DashAI.back.dataloaders.classes.dashai_dataset.DashAIDataset
- class DashAIDataset(table: Table, splits: dict | None = None, *args, **kwargs)[source]
- DashAI dataset wrapper for Huggingface datasets with extra metadata. - __init__(table: Table, splits: dict | None = None, *args, **kwargs)[source]
- Initialize a new instance of a DashAI dataset. - Parameters:
- table (Table) – Arrow table from which the dataset will be created 
 
 - Methods - __init__(table[, splits])- Initialize a new instance of a DashAI dataset. - add_column(name, column, new_fingerprint[, ...])- Add column to Dataset. - add_elasticsearch_index(column[, ...])- Add a text index using ElasticSearch for fast retrieval. - add_faiss_index(column[, index_name, ...])- Add a dense index using Faiss for fast retrieval. - add_faiss_index_from_external_arrays(...[, ...])- Add a dense index using Faiss for fast retrieval. - add_item(item, new_fingerprint)- Add item to Dataset. - align_labels_with_mapping(label2id, label_column)- Align the dataset's label ID and label name mapping to match an input label2id mapping. - batch(batch_size[, drop_last_batch, ...])- Group samples from the dataset into batches. - cast(*args, **kwargs)- Override of the cast method to leave it in DashAI dataset format. - cast_column(column, feature[, new_fingerprint])- Cast column to feature for decoding. - change_columns_type(column_types)- Change the type of some columns. - class_encode_column(column[, include_nulls])- Casts the given column as [~datasets.features.ClassLabel] and updates the table. - cleanup_cache_files()- Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. - drop_index(index_name)- Drop the index with the specified column. - filter([function, with_indices, with_rank, ...])- Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. - flatten([new_fingerprint, max_depth])- Flatten the table. - flatten_indices([keep_in_memory, ...])- Create and cache a new Dataset by flattening the indices mapping. - formatted_as([type, columns, output_all_columns])- To be used in a with statement. - from_buffer(buffer[, info, split, ...])- Instantiate a Dataset backed by an Arrow buffer. - from_csv(path_or_paths[, split, features, ...])- Create Dataset from CSV file(s). - from_dict(mapping[, features, info, split])- Convert dict to a pyarrow.Table to create a [Dataset]. - from_file(filename[, info, split, ...])- Instantiate a Dataset backed by an Arrow table at filename. - from_generator(generator[, features, ...])- Create a Dataset from a generator. - from_json(path_or_paths[, split, features, ...])- Create Dataset from JSON or JSON Lines file(s). - from_list(mapping[, features, info, split])- Convert a list of dicts to a pyarrow.Table to create a [Dataset]`. - from_pandas(df[, features, info, split, ...])- Convert pandas.DataFrame to a pyarrow.Table to create a [Dataset]. - from_parquet(path_or_paths[, split, ...])- Create Dataset from Parquet file(s). - from_polars(df[, features, info, split])- Collect the underlying arrow arrays in an Arrow Table. - from_spark(df[, split, features, ...])- Create a Dataset from Spark DataFrame. - from_sql(sql, con[, features, cache_dir, ...])- Create Dataset from SQL query or database table. - from_text(path_or_paths[, split, features, ...])- Create Dataset from text file(s). - get_index(index_name)- List the index_name/identifiers of all the attached indexes. - get_nearest_examples(index_name, query[, k])- Find the nearest examples in the dataset to the query. - get_nearest_examples_batch(index_name, queries)- Find the nearest examples in the dataset to the query. - get_split(split_name)- Returns a new DashAIDataset corresponding to the specified split. - is_index_initialized(index_name)- iter(batch_size[, drop_last_batch])- Iterate through the batches of size batch_size. - keys()- Return the available splits in the dataset. - list_indexes()- List the colindex_nameumns/identifiers of all the attached indexes. - load_elasticsearch_index(index_name, ...[, ...])- Load an existing text index using ElasticSearch for fast retrieval. - load_faiss_index(index_name, file[, device, ...])- Load a FaissIndex from disk. - load_from_disk(dataset_path[, ...])- Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem. - map([function, with_indices, with_rank, ...])- Apply a function to all the examples in the table (individually or in batches) and update the table. - nan_per_column()- Calculate the number of NaN values per column in the dataset and add it to the metadata under the 'nan' key. - push_to_hub(repo_id[, config_name, ...])- Pushes the dataset to the hub as a Parquet dataset. - remove_columns(column_names)- Remove one or several column(s) in the dataset and the features associated to them. - rename_column(original_column_name, ...[, ...])- Rename a column in the dataset, and move the features associated to the original column under the new column name. - rename_columns(column_mapping[, new_fingerprint])- Rename several columns in the dataset, and move the features associated to the original columns under the new column names. - repeat(num_times)- Create a new [Dataset] that repeats the underlying dataset num_times times. - reset_format()- Reset __getitem__ return format to python objects and all columns. - sample([n, method, seed])- Return sample rows from dataset. - save_faiss_index(index_name, file[, ...])- Save a FaissIndex on disk. - save_to_disk(dataset_path[, max_shard_size, ...])- Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem. - search(index_name, query[, k])- Find the nearest examples indices in the dataset to the query. - search_batch(index_name, queries[, k])- Find the nearest examples indices in the dataset to the query. - select(indices[, keep_in_memory, ...])- Create a new dataset with rows selected following the list/array of indices. - select_columns(column_names[, new_fingerprint])- Select one or several column(s) in the dataset and the features associated to them. - set_format([type, columns, output_all_columns])- Set __getitem__ return format (type and columns). - set_transform(transform[, columns, ...])- Set __getitem__ return format using this transform. - shard(num_shards, index[, contiguous, ...])- Return the index-nth shard from dataset split into num_shards pieces. - shuffle([seed, generator, keep_in_memory, ...])- Create a new Dataset where the rows are shuffled. - skip(n)- Create a new [Dataset] that skips the first n elements. - sort(column_names[, reverse, ...])- Create a new dataset sorted according to a single or multiple columns. - take(n)- Create a new [Dataset] with only the first n elements. - to_csv(path_or_buf[, batch_size, num_proc, ...])- Exports the dataset to csv - to_dict([batch_size, batched])- Returns the dataset as a Python dict. - to_iterable_dataset([num_shards])- Get an [datasets.IterableDataset] from a map-style [datasets.Dataset]. - to_json(path_or_buf[, batch_size, num_proc, ...])- Export the dataset to JSON Lines or JSON. - to_list()- Returns the dataset as a Python list. - to_pandas([batch_size, batched])- Returns the dataset as a pandas.DataFrame. - to_parquet(path_or_buf[, batch_size, ...])- Exports the dataset to parquet - to_polars([batch_size, batched, ...])- Returns the dataset as a polars.DataFrame. - to_sql(name, con[, batch_size])- Exports the dataset to a SQL database. - to_tf_dataset([batch_size, columns, ...])- Create a tf.data.Dataset from the underlying Dataset. - train_test_split([test_size, train_size, ...])- Return a dictionary ([datasets.DatasetDict]) with two random train and test subsets (train and test Dataset splits). - unique(column)- Return a list of the unique elements in a column. - with_format([type, columns, output_all_columns])- Set __getitem__ return format (type and columns). - with_transform(transform[, columns, ...])- Set __getitem__ return format using this transform. - Attributes - arrow_table- Provides a clean way to access the underlying PyArrow table. - builder_name- cache_files- The cache files containing the Apache Arrow table backing the dataset. - citation- column_names- Names of the columns in the dataset. - config_name- data- The Apache Arrow table backing the dataset. - dataset_size- description- download_checksums- download_size- features- format- homepage- info- [~datasets.DatasetInfo] object containing all the metadata in the dataset. - license- num_columns- Number of columns in the dataset. - num_rows- Number of rows in the dataset (same as [Dataset.__len__]). - shape- Shape of the dataset (number of columns, number of rows). - size_in_bytes- split- [~datasets.NamedSplit] object corresponding to a named dataset split. - supervised_keys- version