DashAI.back.dataloaders.classes.dashai_dataset.DashAIDataset
- class DashAIDataset(table: Table, splits: dict = None, *args, **kwargs)[source]
DashAI dataset wrapper for Huggingface datasets with extra metadata.
- __init__(table: Table, splits: dict = None, *args, **kwargs)[source]
Initialize a new instance of a DashAI dataset.
- Parameters:
table (Table) – Arrow table from which the dataset will be created
Methods
__init__
(table[, splits])Initialize a new instance of a DashAI dataset.
add_column
(name, column, new_fingerprint[, ...])Add column to Dataset.
add_elasticsearch_index
(column[, ...])Add a text index using ElasticSearch for fast retrieval.
add_faiss_index
(column[, index_name, ...])Add a dense index using Faiss for fast retrieval.
add_faiss_index_from_external_arrays
(...[, ...])Add a dense index using Faiss for fast retrieval.
add_item
(item, new_fingerprint)Add item to Dataset.
align_labels_with_mapping
(label2id, label_column)Align the dataset's label ID and label name mapping to match an input label2id mapping.
batch
(batch_size[, drop_last_batch, ...])Group samples from the dataset into batches.
cast
(*args, **kwargs)Override of the cast method to leave it in DashAI dataset format.
cast_column
(column, feature[, new_fingerprint])Cast column to feature for decoding.
change_columns_type
(column_types)Change the type of some columns.
class_encode_column
(column[, include_nulls])Casts the given column as [~datasets.features.ClassLabel] and updates the table.
cleanup_cache_files
()Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
drop_index
(index_name)Drop the index with the specified column.
filter
([function, with_indices, with_rank, ...])Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
flatten
([new_fingerprint, max_depth])Flatten the table.
flatten_indices
([keep_in_memory, ...])Create and cache a new Dataset by flattening the indices mapping.
formatted_as
([type, columns, output_all_columns])To be used in a with statement.
from_buffer
(buffer[, info, split, ...])Instantiate a Dataset backed by an Arrow buffer.
from_csv
(path_or_paths[, split, features, ...])Create Dataset from CSV file(s).
from_dict
(mapping[, features, info, split])Convert dict to a pyarrow.Table to create a [Dataset].
from_file
(filename[, info, split, ...])Instantiate a Dataset backed by an Arrow table at filename.
from_generator
(generator[, features, ...])Create a Dataset from a generator.
from_json
(path_or_paths[, split, features, ...])Create Dataset from JSON or JSON Lines file(s).
from_list
(mapping[, features, info, split])Convert a list of dicts to a pyarrow.Table to create a [Dataset]`.
from_pandas
(df[, features, info, split, ...])Convert pandas.DataFrame to a pyarrow.Table to create a [Dataset].
from_parquet
(path_or_paths[, split, ...])Create Dataset from Parquet file(s).
from_polars
(df[, features, info, split])Collect the underlying arrow arrays in an Arrow Table.
from_spark
(df[, split, features, ...])Create a Dataset from Spark DataFrame.
from_sql
(sql, con[, features, cache_dir, ...])Create Dataset from SQL query or database table.
from_text
(path_or_paths[, split, features, ...])Create Dataset from text file(s).
get_index
(index_name)List the index_name/identifiers of all the attached indexes.
get_nearest_examples
(index_name, query[, k])Find the nearest examples in the dataset to the query.
get_nearest_examples_batch
(index_name, queries)Find the nearest examples in the dataset to the query.
get_split
(split_name)Returns a new DashAIDataset corresponding to the specified split.
is_index_initialized
(index_name)iter
(batch_size[, drop_last_batch])Iterate through the batches of size batch_size.
keys
()Return the available splits in the dataset.
list_indexes
()List the colindex_nameumns/identifiers of all the attached indexes.
load_elasticsearch_index
(index_name, ...[, ...])Load an existing text index using ElasticSearch for fast retrieval.
load_faiss_index
(index_name, file[, device, ...])Load a FaissIndex from disk.
load_from_disk
(dataset_path[, ...])Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
map
([function, with_indices, with_rank, ...])Apply a function to all the examples in the table (individually or in batches) and update the table.
push_to_hub
(repo_id[, config_name, ...])Pushes the dataset to the hub as a Parquet dataset.
remove_columns
(column_names)Remove one or several column(s) in the dataset and the features associated to them.
rename_column
(original_column_name, ...[, ...])Rename a column in the dataset, and move the features associated to the original column under the new column name.
rename_columns
(column_mapping[, new_fingerprint])Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
reset_format
()Reset __getitem__ return format to python objects and all columns.
sample
([n, method, seed])Return sample rows from dataset.
save_faiss_index
(index_name, file[, ...])Save a FaissIndex on disk.
save_to_disk
(dataset_path)Overrides the default save_to_disk method to save the dataset as a single directory with: - "data.arrow": the dataset's Arrow table. - "splits.json": the dataset's splits (e.g., original split indices).
search
(index_name, query[, k])Find the nearest examples indices in the dataset to the query.
search_batch
(index_name, queries[, k])Find the nearest examples indices in the dataset to the query.
select
(indices[, keep_in_memory, ...])Create a new dataset with rows selected following the list/array of indices.
select_columns
(column_names[, new_fingerprint])Select one or several column(s) in the dataset and the features associated to them.
set_format
([type, columns, output_all_columns])Set __getitem__ return format (type and columns).
set_transform
(transform[, columns, ...])Set __getitem__ return format using this transform.
shard
(num_shards, index[, contiguous, ...])Return the index-nth shard from dataset split into num_shards pieces.
shuffle
([seed, generator, keep_in_memory, ...])Create a new Dataset where the rows are shuffled.
skip
(n)Create a new [Dataset] that skips the first n elements.
sort
(column_names[, reverse, ...])Create a new dataset sorted according to a single or multiple columns.
take
(n)Create a new [Dataset] with only the first n elements.
to_csv
(path_or_buf[, batch_size, num_proc, ...])Exports the dataset to csv
to_dict
([batch_size])Returns the dataset as a Python dict.
to_iterable_dataset
([num_shards])Get an [datasets.IterableDataset] from a map-style [datasets.Dataset].
to_json
(path_or_buf[, batch_size, num_proc, ...])Export the dataset to JSON Lines or JSON.
to_list
()Returns the dataset as a Python list.
to_pandas
([batch_size, batched])Returns the dataset as a pandas.DataFrame.
to_parquet
(path_or_buf[, batch_size, ...])Exports the dataset to parquet
to_polars
([batch_size, batched, ...])Returns the dataset as a polars.DataFrame.
to_sql
(name, con[, batch_size])Exports the dataset to a SQL database.
to_tf_dataset
([batch_size, columns, ...])Create a tf.data.Dataset from the underlying Dataset.
train_test_split
([test_size, train_size, ...])Return a dictionary ([datasets.DatasetDict]) with two random train and test subsets (train and test Dataset splits).
unique
(column)Return a list of the unique elements in a column.
with_format
([type, columns, output_all_columns])Set __getitem__ return format (type and columns).
with_transform
(transform[, columns, ...])Set __getitem__ return format using this transform.
Attributes
arrow_table
Provides a clean way to access the underlying PyArrow table.
builder_name
cache_files
The cache files containing the Apache Arrow table backing the dataset.
citation
column_names
Names of the columns in the dataset.
config_name
data
The Apache Arrow table backing the dataset.
dataset_size
description
download_checksums
download_size
features
format
homepage
info
[~datasets.DatasetInfo] object containing all the metadata in the dataset.
license
num_columns
Number of columns in the dataset.
num_rows
Number of rows in the dataset (same as [Dataset.__len__]).
shape
Shape of the dataset (number of columns, number of rows).
size_in_bytes
split
[~datasets.NamedSplit] object corresponding to a named dataset split.
supervised_keys
version