DashAIDataset
What Is DashAIDataset?
DashAIDataset is DashAI's core dataset primitive. It extends the HuggingFace Dataset class with two additional responsibilities:
- Semantic type metadata — a
_typesdictionary (Dict[str, DashAIDataType]) that maps every column name to its DashAI semantic type (see Semantic Types). This metadata is persisted inside the Apache Arrow schema so it survives save/load round-trips. - Split metadata — a
splitsdictionary that records which row indices belong to which split (train,test,validation), plus aggregate statistics computed during upload.
Every piece of data that flows through DashAI — upload, notebook transformations, model training, predictions — is represented as a DashAIDataset.
Internal Structure
Instance Attributes
| Attribute | Type | Description |
|---|---|---|
_table | pyarrow.Table | The underlying Arrow table. All column data lives here. |
_types | Dict[str, DashAIDataType] | Semantic type for each column. Kept in sync with Arrow metadata. |
splits | dict | Split indices and computed statistics (see below). |
The splits Dictionary
splits is a plain Python dict. Its keys are populated progressively:
| Key | Set by | Content |
|---|---|---|
split_indices | split_dataset / update_dataset_splits | {"train": [...], "test": [...], "validation": [...]} — row index lists |
column_names | compute_base_metadata / save_dataset | List of column names |
total_rows | compute_base_metadata / save_dataset | Total row count |
nan | compute_base_metadata | Per-column NaN counts |
general_info | compute_metadata | Row/column counts, memory, duplicates, dtype map |
numeric_stats | compute_metadata | Per-column descriptive statistics for numeric columns |
categorical_stats | compute_metadata | Per-column value counts and top-5 for categorical columns |
text_stats | compute_metadata | Length and word-count statistics for text columns |
quality_info | compute_metadata | Completeness, constant columns, high-cardinality columns, quality score |
correlations | compute_metadata | Pearson correlation matrix for numeric columns |
Arrow Metadata
Semantic types are serialised to the Arrow table's schema metadata under the key dashai_types. This means:
table.schema.metadata[b"dashai_types"] # JSON-serialised type map
The utilities save_types_in_arrow_metadata() and get_types_from_arrow_metadata() in DashAI/back/types/utils.py handle this serialisation. When DashAIDataset is constructed with types=None, it reads the type map directly from the Arrow metadata.
On-disk Format
save_dataset writes two files to a directory:
<dataset_path>/
├── data.arrow # PyArrow IPC file — table + type metadata in schema
└── splits.json # JSON — split indices, row counts, NaN map, computed stats
load_dataset reads both files and reconstructs the DashAIDataset. Types are recovered from the Arrow schema metadata, so no separate type file is needed.
Key Instance Methods
get_split(split_name)
Returns a new DashAIDataset containing only the rows of the requested split.
train_ds = dataset.get_split("train") # uses split_indices["train"]
split_name must exist in splits["split_indices"]. The returned dataset's splits metadata contains only that split's index list.
select_columns(column_names)
Returns a new DashAIDataset with only the specified columns, copying the corresponding types.
features_ds = dataset.select_columns(["age", "income", "city"])
sample(n, method, seed)
Returns n rows as a plain Python dict. method controls selection:
| Method | Behaviour |
|---|---|
"head" | First n rows |
"tail" | Last n rows |
"random" | n random rows (reproducible with seed) |
compute_metadata()
Runs all EDA computations and stores results in self.splits. Called once after upload. Requires _types to be set. Returns self.
compute_base_metadata()
Lightweight subset of compute_metadata() — only sets column_names, total_rows, and nan. Used when full stats are not needed.
keys()
Returns the list of available split names from splits["split_indices"]. Mirrors the DatasetDict.keys() interface so DashAIDataset can be used interchangeably in some contexts.
Data Lifecycle Functions
These module-level functions cover the full journey from raw input to training-ready splits.
Loading and Conversion
to_dashai_dataset(dataset, types=None)
Universal converter. Accepts DashAIDataset (pass-through), HuggingFace Dataset, HuggingFace DatasetDict, or pandas.DataFrame. Multi-split DatasetDict inputs are merged via merge_splits_with_metadata.
from DashAI.back.dataloaders.classes.dashai_dataset import to_dashai_dataset
ds = to_dashai_dataset(my_hf_dataset, types=my_type_map)
merge_splits_with_metadata(dataset_dict)
Concatenates all splits of a DatasetDict into a single DashAIDataset and records the row-index ranges in splits["split_indices"]. The splits are merged in sorted key order.
Persistence
save_dataset(dataset, path, schema=None)
Writes data.arrow and splits.json to path. If schema is provided, transform_dataset_with_schema is applied first.
load_dataset(dataset_path)
Reads data.arrow and splits.json from dataset_path and returns a DashAIDataset. Types are recovered from Arrow metadata.
get_columns_spec(dataset_path)
Reads only the Arrow schema (no row data) and returns a Dict[str, Dict] describing each column's type, dtype, and — for Categorical columns — the category list. Used by the API to return column metadata without loading the full dataset.
Type Transformation
transform_dataset_with_schema(dataset, schema)
Applies a type schema to the dataset, casting columns to their target Arrow types and updating _types. The schema format is:
schema = {
"age": {"type": "Integer", "dtype": "int64"},
"income": {"type": "Float", "dtype": "float64"},
"city": {"type": "Categorical", "dtype": "string", "converted": False},
}
Categorical columns have their category list inferred from the actual data at this point, ensuring no values are silently excluded.
Splitting
split_indexes(total_rows, train_size, test_size, val_size, seed, shuffle, stratify, labels)
Returns (train_indexes, test_indexes, val_indexes) as lists of integer row indices. Uses a two-phase strategy:
- Split all rows into
trainvstest+validation. - Split
test+validationintotestandvalidationproportionally.
Supports stratified splitting: pass the label array in labels.
split_dataset(dataset, train_indexes, test_indexes, val_indexes)
Partitions dataset into a DatasetDict with "train", "test", and "validation" keys, each being a DashAIDataset. Column types from the original dataset are preserved.
If all three index arguments are None, the function reads existing split_indices from dataset.splits instead.
update_dataset_splits(dataset, new_splits, is_random)
Updates dataset.splits["split_indices"] in place.
is_random=True—new_splitsvalues are float proportions;split_indexesis called to generate row lists.is_random=False—new_splitsvalues are already row-index lists; used directly.
Training Preparation
prepare_for_model_session(dataset, splits, output_columns)
High-level entry point used by the job system before training. Handles the full split pipeline:
- Determines split type from
splits["splitType"]("manual","predefined", or"random"). - Calls
split_datasetwith the appropriate index lists. - Returns
(prepared_dataset, index_map)whereindex_maphas keystrain_indexes,test_indexes,val_indexes.
For random splits with stratify=True, it reads the output column values and passes them as labels to split_indexes.
Modifying Data — modify_table
modify_table(dataset, columns, types=None)
Used inside converters and models to replace or add columns while preserving all other columns and metadata.
import pyarrow as pa
from DashAI.back.dataloaders.classes.dashai_dataset import modify_table
new_col = pa.array([1.0, 2.0, 3.0], type=pa.float64())
result = modify_table(dataset, {"scaled_age": new_col}, types={"scaled_age": Float("float64")})
- Columns in
columnsthat already exist in the dataset are replaced. - New columns require a matching entry in
types. - Original Arrow metadata (including
dashai_types) is preserved and extended.
Relationship to Other Subsystems
DataLoader
└─ loads raw file → DashAIDataset (types inferred by DashAIPtype)
│
├─ Notebook workspace
│ └─ Converter.fit_transform(DashAIDataset) → DashAIDataset
│ (modify_table used internally)
│
└─ Model training
└─ prepare_for_model_session → DatasetDict of DashAIDatasets
└─ select_columns → (X_train, y_train) pair
- DataLoaders produce
DashAIDatasetviato_dashai_dataset. - Converters consume and return
DashAIDataset, usingmodify_tableto update columns. - Models receive a
DatasetDictfromprepare_for_model_sessionand callget_split/select_columnsto extract their inputs. - Explorers receive a
DashAIDatasetand read_typesto dispatch column-appropriate visualisations.
Source Files
| File | Role |
|---|---|
DashAI/back/dataloaders/classes/dashai_dataset.py | DashAIDataset class and all module-level functions |
DashAI/back/types/utils.py | Arrow ↔ DashAI type serialisation (save_types_in_arrow_metadata, get_types_from_arrow_metadata) |
DashAI/back/types/value_types.py | Concrete value type classes used in _types |
DashAI/back/types/categorical.py | Categorical type with str2int / int2str encoding |