
class DashAIDataset(table: Table, splits: dict = None, *args, **kwargs)[source]

DashAI dataset wrapper for Huggingface datasets with extra metadata.

__init__(table: Table, splits: dict = None, *args, **kwargs)[source]

Initialize a new instance of a DashAI dataset.


table (Table) – Arrow table from which the dataset will be created


add_column(name, column, new_fingerprint[, ...])

Add column to Dataset.

add_elasticsearch_index(column[, ...])

Add a text index using ElasticSearch for fast retrieval.

add_faiss_index(column[, index_name, ...])

Add a dense index using Faiss for fast retrieval.

add_faiss_index_from_external_arrays(...[, ...])

Add a dense index using Faiss for fast retrieval.

add_item(item, new_fingerprint)

Add item to Dataset.

align_labels_with_mapping(label2id, label_column)

Align the dataset's label ID and label name mapping to match an input label2id mapping.

batch(batch_size[, drop_last_batch, ...])

Group samples from the dataset into batches.

cast(*args, **kwargs)

Override of the cast method to leave it in DashAI dataset format.

cast_column(column, feature[, new_fingerprint])

Cast column to feature for decoding.


class_encode_column(column[, include_nulls])

Casts the given column as [~datasets.features.ClassLabel] and updates the table.


Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.


Drop the index with the specified column.

filter([function, with_indices, with_rank, ...])

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.

flatten([new_fingerprint, max_depth])

Flatten the table.

flatten_indices([keep_in_memory, ...])

Create and cache a new Dataset by flattening the indices mapping.

formatted_as([type, columns, output_all_columns])

To be used in a with statement.

from_buffer(buffer[, info, split, ...])

Instantiate a Dataset backed by an Arrow buffer.

from_csv(path_or_paths[, split, features, ...])

Create Dataset from CSV file(s).

from_dict(mapping[, features, info, split])

Convert dict to a pyarrow.Table to create a [Dataset].

from_file(filename[, info, split, ...])

Instantiate a Dataset backed by an Arrow table at filename.

from_generator(generator[, features, ...])

Create a Dataset from a generator.

from_json(path_or_paths[, split, features, ...])

Create Dataset from JSON or JSON Lines file(s).

from_list(mapping[, features, info, split])

Convert a list of dicts to a pyarrow.Table to create a [Dataset]`.

from_pandas(df[, features, info, split, ...])

Convert pandas.DataFrame to a pyarrow.Table to create a [Dataset].

from_parquet(path_or_paths[, split, ...])

Create Dataset from Parquet file(s).

from_polars(df[, features, info, split])

Collect the underlying arrow arrays in an Arrow Table.

from_spark(df[, split, features, ...])

Create a Dataset from Spark DataFrame.

from_sql(sql, con[, features, cache_dir, ...])

Create Dataset from SQL query or database table.

from_text(path_or_paths[, split, features, ...])

Create Dataset from text file(s).


List the index_name/identifiers of all the attached indexes.

get_nearest_examples(index_name, query[, k])

Find the nearest examples in the dataset to the query.

get_nearest_examples_batch(index_name, queries)

Find the nearest examples in the dataset to the query.


Returns a new DashAIDataset corresponding to the specified split.


iter(batch_size[, drop_last_batch])

Iterate through the batches of size batch_size.


Return the available splits in the dataset.


List the colindex_nameumns/identifiers of all the attached indexes.

load_elasticsearch_index(index_name, ...[, ...])

Load an existing text index using ElasticSearch for fast retrieval.

load_faiss_index(index_name, file[, device, ...])

Load a FaissIndex from disk.

load_from_disk(dataset_path[, ...])

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

map([function, with_indices, with_rank, ...])

Apply a function to all the examples in the table (individually or in batches) and update the table.

push_to_hub(repo_id[, config_name, ...])

Pushes the dataset to the hub as a Parquet dataset.


Remove one or several column(s) in the dataset and the features associated to them.

rename_column(original_column_name, ...[, ...])

Rename a column in the dataset, and move the features associated to the original column under the new column name.

rename_columns(column_mapping[, new_fingerprint])

Rename several columns in the dataset, and move the features associated to the original columns under the new column names.


Reset __getitem__ return format to python objects and all columns.

sample([n, method, seed])

Return sample rows from dataset.

save_faiss_index(index_name, file[, ...])

Save a FaissIndex on disk.


Overrides the default save_to_disk method to save the dataset as a single directory with: - "data.arrow": the dataset's Arrow table. - "splits.json": the dataset's splits (e.g., original split indices).

search(index_name, query[, k])

Find the nearest examples indices in the dataset to the query.

search_batch(index_name, queries[, k])

Find the nearest examples indices in the dataset to the query.

select(indices[, keep_in_memory, ...])

Create a new dataset with rows selected following the list/array of indices.

select_columns(column_names[, new_fingerprint])

Select one or several column(s) in the dataset and the features associated to them.

set_format([type, columns, output_all_columns])

Set __getitem__ return format (type and columns).

set_transform(transform[, columns, ...])

Set __getitem__ return format using this transform.

shard(num_shards, index[, contiguous, ...])

Return the index-nth shard from dataset split into num_shards pieces.

shuffle([seed, generator, keep_in_memory, ...])

Create a new Dataset where the rows are shuffled.


Create a new [Dataset] that skips the first n elements.

sort(column_names[, reverse, ...])

Create a new dataset sorted according to a single or multiple columns.


Create a new [Dataset] with only the first n elements.

to_csv(path_or_buf[, batch_size, num_proc, ...])

Exports the dataset to csv


Returns the dataset as a Python dict.


Get an [datasets.IterableDataset] from a map-style [datasets.Dataset].

to_json(path_or_buf[, batch_size, num_proc, ...])

Export the dataset to JSON Lines or JSON.


Returns the dataset as a Python list.

to_pandas([batch_size, batched])

Returns the dataset as a pandas.DataFrame.

to_parquet(path_or_buf[, batch_size, ...])

Exports the dataset to parquet

to_polars([batch_size, batched, ...])

Returns the dataset as a polars.DataFrame.

to_sql(name, con[, batch_size])

Exports the dataset to a SQL database.

to_tf_dataset([batch_size, columns, ...])

Create a tf.data.Dataset from the underlying Dataset.

train_test_split([test_size, train_size, ...])

Return a dictionary ([datasets.DatasetDict]) with two random train and test subsets (train and test Dataset splits).


Return a list of the unique elements in a column.

with_format([type, columns, output_all_columns])

Set __getitem__ return format (type and columns).

with_transform(transform[, columns, ...])

Set __getitem__ return format using this transform.



Provides a clean way to access the underlying PyArrow table.



The cache files containing the Apache Arrow table backing the dataset.



Names of the columns in the dataset.



The Apache Arrow table backing the dataset.









[~datasets.DatasetInfo] object containing all the metadata in the dataset.



Number of columns in the dataset.


Number of rows in the dataset (same as [Dataset.__len__]).


Shape of the dataset (number of columns, number of rows).



[~datasets.NamedSplit] object corresponding to a named dataset split.

