DashAI.back.dataloaders.classes.dashai_dataset.DashAIDataset

class DashAIDataset(table: Table, splits: dict | None = None, *args, **kwargs)[source]

DashAI dataset wrapper for Huggingface datasets with extra metadata.

__init__(table: Table, splits: dict | None = None, *args, **kwargs)[source]

Initialize a new instance of a DashAI dataset.

Parameters:: table (Table) – Arrow table from which the dataset will be created

Methods

`__init__`(table[, splits])	Initialize a new instance of a DashAI dataset.
`add_column`(name, column, new_fingerprint[, ...])	Add column to Dataset.
`add_elasticsearch_index`(column[, ...])	Add a text index using ElasticSearch for fast retrieval.
`add_faiss_index`(column[, index_name, ...])	Add a dense index using Faiss for fast retrieval.
`add_faiss_index_from_external_arrays`(...[, ...])	Add a dense index using Faiss for fast retrieval.
`add_item`(item, new_fingerprint)	Add item to Dataset.
`align_labels_with_mapping`(label2id, label_column)	Align the dataset's label ID and label name mapping to match an input label2id mapping.
`batch`(batch_size[, drop_last_batch, ...])	Group samples from the dataset into batches.
`cast`(args, *kwargs)	Override of the cast method to leave it in DashAI dataset format.
`cast_column`(column, feature[, new_fingerprint])	Cast column to feature for decoding.
`change_columns_type`(column_types)	Change the type of some columns.
`class_encode_column`(column[, include_nulls])	Casts the given column as [~datasets.features.ClassLabel] and updates the table.
`cleanup_cache_files`()	Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
`drop_index`(index_name)	Drop the index with the specified column.
`filter`([function, with_indices, with_rank, ...])	Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
`flatten`([new_fingerprint, max_depth])	Flatten the table.
`flatten_indices`([keep_in_memory, ...])	Create and cache a new Dataset by flattening the indices mapping.
`formatted_as`([type, columns, output_all_columns])	To be used in a with statement.
`from_buffer`(buffer[, info, split, ...])	Instantiate a Dataset backed by an Arrow buffer.
`from_csv`(path_or_paths[, split, features, ...])	Create Dataset from CSV file(s).
`from_dict`(mapping[, features, info, split])	Convert dict to a pyarrow.Table to create a [Dataset].
`from_file`(filename[, info, split, ...])	Instantiate a Dataset backed by an Arrow table at filename.
`from_generator`(generator[, features, ...])	Create a Dataset from a generator.
`from_json`(path_or_paths[, split, features, ...])	Create Dataset from JSON or JSON Lines file(s).
`from_list`(mapping[, features, info, split])	Convert a list of dicts to a pyarrow.Table to create a [Dataset]`.
`from_pandas`(df[, features, info, split, ...])	Convert pandas.DataFrame to a pyarrow.Table to create a [Dataset].
`from_parquet`(path_or_paths[, split, ...])	Create Dataset from Parquet file(s).
`from_polars`(df[, features, info, split])	Collect the underlying arrow arrays in an Arrow Table.
`from_spark`(df[, split, features, ...])	Create a Dataset from Spark DataFrame.
`from_sql`(sql, con[, features, cache_dir, ...])	Create Dataset from SQL query or database table.
`from_text`(path_or_paths[, split, features, ...])	Create Dataset from text file(s).
`get_index`(index_name)	List the index_name/identifiers of all the attached indexes.
`get_nearest_examples`(index_name, query[, k])	Find the nearest examples in the dataset to the query.
`get_nearest_examples_batch`(index_name, queries)	Find the nearest examples in the dataset to the query.
`get_split`(split_name)	Returns a new DashAIDataset corresponding to the specified split.
`is_index_initialized`(index_name)
`iter`(batch_size[, drop_last_batch])	Iterate through the batches of size batch_size.
`keys`()	Return the available splits in the dataset.
`list_indexes`()	List the colindex_nameumns/identifiers of all the attached indexes.
`load_elasticsearch_index`(index_name, ...[, ...])	Load an existing text index using ElasticSearch for fast retrieval.
`load_faiss_index`(index_name, file[, device, ...])	Load a FaissIndex from disk.
`load_from_disk`(dataset_path[, ...])	Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
`map`([function, with_indices, with_rank, ...])	Apply a function to all the examples in the table (individually or in batches) and update the table.
`push_to_hub`(repo_id[, config_name, ...])	Pushes the dataset to the hub as a Parquet dataset.
`remove_columns`(column_names)	Remove one or several column(s) in the dataset and the features associated to them.
`rename_column`(original_column_name, ...[, ...])	Rename a column in the dataset, and move the features associated to the original column under the new column name.
`rename_columns`(column_mapping[, new_fingerprint])	Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
`repeat`(num_times)	Create a new [Dataset] that repeats the underlying dataset num_times times.
`reset_format`()	Reset __getitem__ return format to python objects and all columns.
`sample`([n, method, seed])	Return sample rows from dataset.
`save_faiss_index`(index_name, file[, ...])	Save a FaissIndex on disk.
`save_to_disk`(dataset_path[, max_shard_size, ...])	Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
`search`(index_name, query[, k])	Find the nearest examples indices in the dataset to the query.
`search_batch`(index_name, queries[, k])	Find the nearest examples indices in the dataset to the query.
`select`(indices[, keep_in_memory, ...])	Create a new dataset with rows selected following the list/array of indices.
`select_columns`(column_names[, new_fingerprint])	Select one or several column(s) in the dataset and the features associated to them.
`set_format`([type, columns, output_all_columns])	Set __getitem__ return format (type and columns).
`set_transform`(transform[, columns, ...])	Set __getitem__ return format using this transform.
`shard`(num_shards, index[, contiguous, ...])	Return the index-nth shard from dataset split into num_shards pieces.
`shuffle`([seed, generator, keep_in_memory, ...])	Create a new Dataset where the rows are shuffled.
`skip`(n)	Create a new [Dataset] that skips the first n elements.
`sort`(column_names[, reverse, ...])	Create a new dataset sorted according to a single or multiple columns.
`take`(n)	Create a new [Dataset] with only the first n elements.
`to_csv`(path_or_buf[, batch_size, num_proc, ...])	Exports the dataset to csv
`to_dict`([batch_size, batched])	Returns the dataset as a Python dict.
`to_iterable_dataset`([num_shards])	Get an [datasets.IterableDataset] from a map-style [datasets.Dataset].
`to_json`(path_or_buf[, batch_size, num_proc, ...])	Export the dataset to JSON Lines or JSON.
`to_list`()	Returns the dataset as a Python list.
`to_pandas`([batch_size, batched])	Returns the dataset as a pandas.DataFrame.
`to_parquet`(path_or_buf[, batch_size, ...])	Exports the dataset to parquet
`to_polars`([batch_size, batched, ...])	Returns the dataset as a polars.DataFrame.
`to_sql`(name, con[, batch_size])	Exports the dataset to a SQL database.
`to_tf_dataset`([batch_size, columns, ...])	Create a tf.data.Dataset from the underlying Dataset.
`train_test_split`([test_size, train_size, ...])	Return a dictionary ([datasets.DatasetDict]) with two random train and test subsets (train and test Dataset splits).
`unique`(column)	Return a list of the unique elements in a column.
`with_format`([type, columns, output_all_columns])	Set __getitem__ return format (type and columns).
`with_transform`(transform[, columns, ...])	Set __getitem__ return format using this transform.

Attributes

`arrow_table`	Provides a clean way to access the underlying PyArrow table.
`builder_name`
`cache_files`	The cache files containing the Apache Arrow table backing the dataset.
`citation`
`column_names`	Names of the columns in the dataset.
`config_name`
`data`	The Apache Arrow table backing the dataset.
`dataset_size`
`description`
`download_checksums`
`download_size`
`features`
`format`
`homepage`
`info`	[~datasets.DatasetInfo] object containing all the metadata in the dataset.
`license`
`num_columns`	Number of columns in the dataset.
`num_rows`	Number of rows in the dataset (same as [Dataset.__len__]).
`shape`	Shape of the dataset (number of columns, number of rows).
`size_in_bytes`
`split`	[~datasets.NamedSplit] object corresponding to a named dataset split.
`supervised_keys`
`version`