CSVDataLoader
Data loader that ingests tabular data from CSV files into DashAI datasets.
Reads one or more CSV files, optionally samples rows, and splits the result
into train/validation/test DashAIDataset splits according to the ratios
specified in the schema. The separator is normalised from human-readable
aliases ("blank space", "tab") to Python character literals before
delegating to pandas.read_csv.
Handles multi-file uploads by concatenating all CSVs before splitting,
and supports header detection, column selection, and row skipping via the
CSVDataloaderSchema parameters.
Parameters
- name : string, default=
- Custom name to register your dataset. If no name is specified, the name of the uploaded file will be used.
- separator : string, default=
, - A separator character delimits the data in a CSV file.
- header : string, default=
infer - Row number(s) containing column labels and marking the start of the data (zero-indexed). Default behavior is to infer the column names. If column names are passed explicitly, this should be set to '0'. Header can also be a list of integers that specify row locations for MultiIndex on the columns.
- names, default=
None - Comma-separated list of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Example: 'col1,col2,col3'. Leave empty to use file headers.
- encoding : string, default=
utf-8 - Encoding to use for UTF when reading/writing. Most common encodings provided.
- na_values, default=
None - Comma-separated additional strings to recognize as NA/NaN. Example: 'NULL,missing,n/a'
- keep_default_na : boolean, default=
True - Whether to include the default NaN values when parsing the data (True recommended).
- true_values, default=
None - Comma-separated values to consider as True. Example: 'yes,true,1,on'
- false_values, default=
None - Comma-separated values to consider as False. Example: 'no,false,0,off'
- skip_blank_lines : boolean, default=
True - If True, skip over blank lines rather than interpreting as NaN values.
- skiprows, default=
None - Number of data rows to skip after reading the header. Leave empty to skip none.
- nrows, default=
None - Number of rows to read from the file. Leave empty to read all rows.
Methods
load_data(self, filepath_or_buffer: str, temp_path: str, params: Dict[str, Any], n_sample: int | None = None) -> 'DashAIDataset'
CSVDataLoaderLoad the uploaded CSV files into a DatasetDict.
Parameters
- filepath_or_buffer : str, optional
- An URL where the dataset is located or a FastAPI/Uvicorn uploaded file object.
- temp_path : str
- The temporary path where the files will be extracted and then uploaded.
- params : Dict[str, Any]
- Dict with the dataloader parameters. The options are: -
separator(str): The character that delimits the CSV data. - n_sample : int | None
- Indicates how many rows load from the dataset, all rows if null.
Returns
- DatasetDict
- A HuggingFace's Dataset with the loaded data.
load_preview(self, filepath_or_buffer: str, params: Dict[str, Any], n_rows: int = 100)
CSVDataLoaderLoad a preview of the CSV dataset using streaming.
Parameters
- filepath_or_buffer : str
- Path to the CSV file.
- params : Dict[str, Any]
- Parameters for loading the CSV (separator, encoding, etc.).
- n_rows : int, optional
- Number of rows to preview. Default is 100.
Returns
- pd.DataFrame
- A DataFrame containing the preview rows.
extract_files(self, file_path: str, temp_path: str) -> str
BaseDataLoaderExtract a ZIP archive into a subdirectory of temp_path.
Parameters
- file_path : str
- Path to the ZIP archive to extract.
- temp_path : str
- Base temporary directory; extraction target is
<temp_path>/files/.
Returns
- str
- Path of the directory containing the extracted files (
<temp_path>/files/).
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
prepare_files(self, file_path: str, temp_path: str) -> str
BaseDataLoaderResolve a file path or URL into a local path suitable for loading.
Parameters
- file_path : str
- Path to a local file, a ZIP archive, or an HTTP(S) URL.
- temp_path : str
- Temporary directory used for extraction of ZIP or URL downloads.
Returns
- tuple of (str, str)
(path, type_path)wheretype_pathis"dir"for extracted archives/URLs or"file"for plain local files.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.