DashAI.back.dataloaders.classes.dashai_dataset.DashAIDataset
- class DashAIDataset(table: Table, splits: dict | None = None, *args, **kwargs)[source]
DashAI dataset wrapper for Huggingface datasets with extra metadata.
- __init__(table: Table, splits: dict | None = None, *args, **kwargs)[source]
Initialize a new instance of a DashAI dataset.
- Parameters:
table (Table) – Arrow table from which the dataset will be created
Methods
__init__(table[, splits])Initialize a new instance of a DashAI dataset.
add_column(name, column, new_fingerprint[, ...])Add column to Dataset.
add_elasticsearch_index(column[, ...])Add a text index using ElasticSearch for fast retrieval.
add_faiss_index(column[, index_name, ...])Add a dense index using Faiss for fast retrieval.
add_faiss_index_from_external_arrays(...[, ...])Add a dense index using Faiss for fast retrieval.
add_item(item, new_fingerprint)Add item to Dataset.
align_labels_with_mapping(label2id, label_column)Align the dataset's label ID and label name mapping to match an input label2id mapping.
batch(batch_size[, drop_last_batch, ...])Group samples from the dataset into batches.
cast(*args, **kwargs)Override of the cast method to leave it in DashAI dataset format.
cast_column(column, feature[, new_fingerprint])Cast column to feature for decoding.
change_columns_type(column_types)Change the type of some columns.
class_encode_column(column[, include_nulls])Casts the given column as [~datasets.features.ClassLabel] and updates the table.
cleanup_cache_files()Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
drop_index(index_name)Drop the index with the specified column.
filter([function, with_indices, with_rank, ...])Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
flatten([new_fingerprint, max_depth])Flatten the table.
flatten_indices([keep_in_memory, ...])Create and cache a new Dataset by flattening the indices mapping.
formatted_as([type, columns, output_all_columns])To be used in a with statement.
from_buffer(buffer[, info, split, ...])Instantiate a Dataset backed by an Arrow buffer.
from_csv(path_or_paths[, split, features, ...])Create Dataset from CSV file(s).
from_dict(mapping[, features, info, split])Convert dict to a pyarrow.Table to create a [Dataset].
from_file(filename[, info, split, ...])Instantiate a Dataset backed by an Arrow table at filename.
from_generator(generator[, features, ...])Create a Dataset from a generator.
from_json(path_or_paths[, split, features, ...])Create Dataset from JSON or JSON Lines file(s).
from_list(mapping[, features, info, split])Convert a list of dicts to a pyarrow.Table to create a [Dataset]`.
from_pandas(df[, features, info, split, ...])Convert pandas.DataFrame to a pyarrow.Table to create a [Dataset].
from_parquet(path_or_paths[, split, ...])Create Dataset from Parquet file(s).
from_polars(df[, features, info, split])Collect the underlying arrow arrays in an Arrow Table.
from_spark(df[, split, features, ...])Create a Dataset from Spark DataFrame.
from_sql(sql, con[, features, cache_dir, ...])Create Dataset from SQL query or database table.
from_text(path_or_paths[, split, features, ...])Create Dataset from text file(s).
get_index(index_name)List the index_name/identifiers of all the attached indexes.
get_nearest_examples(index_name, query[, k])Find the nearest examples in the dataset to the query.
get_nearest_examples_batch(index_name, queries)Find the nearest examples in the dataset to the query.
get_split(split_name)Returns a new DashAIDataset corresponding to the specified split.
is_index_initialized(index_name)iter(batch_size[, drop_last_batch])Iterate through the batches of size batch_size.
keys()Return the available splits in the dataset.
list_indexes()List the colindex_nameumns/identifiers of all the attached indexes.
load_elasticsearch_index(index_name, ...[, ...])Load an existing text index using ElasticSearch for fast retrieval.
load_faiss_index(index_name, file[, device, ...])Load a FaissIndex from disk.
load_from_disk(dataset_path[, ...])Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
map([function, with_indices, with_rank, ...])Apply a function to all the examples in the table (individually or in batches) and update the table.
nan_per_column()Calculate the number of NaN values per column in the dataset and add it to the metadata under the 'nan' key.
push_to_hub(repo_id[, config_name, ...])Pushes the dataset to the hub as a Parquet dataset.
remove_columns(column_names)Remove one or several column(s) in the dataset and the features associated to them.
rename_column(original_column_name, ...[, ...])Rename a column in the dataset, and move the features associated to the original column under the new column name.
rename_columns(column_mapping[, new_fingerprint])Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
repeat(num_times)Create a new [Dataset] that repeats the underlying dataset num_times times.
reset_format()Reset __getitem__ return format to python objects and all columns.
sample([n, method, seed])Return sample rows from dataset.
save_faiss_index(index_name, file[, ...])Save a FaissIndex on disk.
save_to_disk(dataset_path[, max_shard_size, ...])Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
search(index_name, query[, k])Find the nearest examples indices in the dataset to the query.
search_batch(index_name, queries[, k])Find the nearest examples indices in the dataset to the query.
select(indices[, keep_in_memory, ...])Create a new dataset with rows selected following the list/array of indices.
select_columns(column_names[, new_fingerprint])Select one or several column(s) in the dataset and the features associated to them.
set_format([type, columns, output_all_columns])Set __getitem__ return format (type and columns).
set_transform(transform[, columns, ...])Set __getitem__ return format using this transform.
shard(num_shards, index[, contiguous, ...])Return the index-nth shard from dataset split into num_shards pieces.
shuffle([seed, generator, keep_in_memory, ...])Create a new Dataset where the rows are shuffled.
skip(n)Create a new [Dataset] that skips the first n elements.
sort(column_names[, reverse, ...])Create a new dataset sorted according to a single or multiple columns.
take(n)Create a new [Dataset] with only the first n elements.
to_csv(path_or_buf[, batch_size, num_proc, ...])Exports the dataset to csv
to_dict([batch_size, batched])Returns the dataset as a Python dict.
to_iterable_dataset([num_shards])Get an [datasets.IterableDataset] from a map-style [datasets.Dataset].
to_json(path_or_buf[, batch_size, num_proc, ...])Export the dataset to JSON Lines or JSON.
to_list()Returns the dataset as a Python list.
to_pandas([batch_size, batched])Returns the dataset as a pandas.DataFrame.
to_parquet(path_or_buf[, batch_size, ...])Exports the dataset to parquet
to_polars([batch_size, batched, ...])Returns the dataset as a polars.DataFrame.
to_sql(name, con[, batch_size])Exports the dataset to a SQL database.
to_tf_dataset([batch_size, columns, ...])Create a tf.data.Dataset from the underlying Dataset.
train_test_split([test_size, train_size, ...])Return a dictionary ([datasets.DatasetDict]) with two random train and test subsets (train and test Dataset splits).
unique(column)Return a list of the unique elements in a column.
with_format([type, columns, output_all_columns])Set __getitem__ return format (type and columns).
with_transform(transform[, columns, ...])Set __getitem__ return format using this transform.
Attributes
arrow_tableProvides a clean way to access the underlying PyArrow table.
builder_namecache_filesThe cache files containing the Apache Arrow table backing the dataset.
citationcolumn_namesNames of the columns in the dataset.
config_namedataThe Apache Arrow table backing the dataset.
dataset_sizedescriptiondownload_checksumsdownload_sizefeaturesformathomepageinfo[~datasets.DatasetInfo] object containing all the metadata in the dataset.
licensenum_columnsNumber of columns in the dataset.
num_rowsNumber of rows in the dataset (same as [Dataset.__len__]).
shapeShape of the dataset (number of columns, number of rows).
size_in_bytessplit[~datasets.NamedSplit] object corresponding to a named dataset split.
supervised_keysversion