DashAI.back.dataloaders.classes.dashai_dataset.DashAIDataset

class DashAIDataset(table: Table, splits: dict = None, *args, **kwargs)[source]

DashAI dataset wrapper for Huggingface datasets with extra metadata.

__init__(table: Table, splits: dict = None, *args, **kwargs)[source]

Initialize a new instance of a DashAI dataset.

Parameters:

table (Table) – Arrow table from which the dataset will be created

Methods

__init__(table[, splits])

Initialize a new instance of a DashAI dataset.

add_column(name, column, new_fingerprint[, ...])

Add column to Dataset.

add_elasticsearch_index(column[, ...])

Add a text index using ElasticSearch for fast retrieval.

add_faiss_index(column[, index_name, ...])

Add a dense index using Faiss for fast retrieval.

add_faiss_index_from_external_arrays(...[, ...])

Add a dense index using Faiss for fast retrieval.

add_item(item, new_fingerprint)

Add item to Dataset.

align_labels_with_mapping(label2id, label_column)

Align the dataset's label ID and label name mapping to match an input label2id mapping.

batch(batch_size[, drop_last_batch, ...])

Group samples from the dataset into batches.

cast(*args, **kwargs)

Override of the cast method to leave it in DashAI dataset format.

cast_column(column, feature[, new_fingerprint])

Cast column to feature for decoding.

change_columns_type(column_types)

Change the type of some columns.

class_encode_column(column[, include_nulls])

Casts the given column as [~datasets.features.ClassLabel] and updates the table.

cleanup_cache_files()

Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.

drop_index(index_name)

Drop the index with the specified column.

filter([function, with_indices, with_rank, ...])

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.

flatten([new_fingerprint, max_depth])

Flatten the table.

flatten_indices([keep_in_memory, ...])

Create and cache a new Dataset by flattening the indices mapping.

formatted_as([type, columns, output_all_columns])

To be used in a with statement.

from_buffer(buffer[, info, split, ...])

Instantiate a Dataset backed by an Arrow buffer.

from_csv(path_or_paths[, split, features, ...])

Create Dataset from CSV file(s).

from_dict(mapping[, features, info, split])

Convert dict to a pyarrow.Table to create a [Dataset].

from_file(filename[, info, split, ...])

Instantiate a Dataset backed by an Arrow table at filename.

from_generator(generator[, features, ...])

Create a Dataset from a generator.

from_json(path_or_paths[, split, features, ...])

Create Dataset from JSON or JSON Lines file(s).

from_list(mapping[, features, info, split])

Convert a list of dicts to a pyarrow.Table to create a [Dataset]`.

from_pandas(df[, features, info, split, ...])

Convert pandas.DataFrame to a pyarrow.Table to create a [Dataset].

from_parquet(path_or_paths[, split, ...])

Create Dataset from Parquet file(s).

from_polars(df[, features, info, split])

Collect the underlying arrow arrays in an Arrow Table.

from_spark(df[, split, features, ...])

Create a Dataset from Spark DataFrame.

from_sql(sql, con[, features, cache_dir, ...])

Create Dataset from SQL query or database table.

from_text(path_or_paths[, split, features, ...])

Create Dataset from text file(s).

get_index(index_name)

List the index_name/identifiers of all the attached indexes.

get_nearest_examples(index_name, query[, k])

Find the nearest examples in the dataset to the query.

get_nearest_examples_batch(index_name, queries)

Find the nearest examples in the dataset to the query.

get_split(split_name)

Returns a new DashAIDataset corresponding to the specified split.

is_index_initialized(index_name)

iter(batch_size[, drop_last_batch])

Iterate through the batches of size batch_size.

keys()

Return the available splits in the dataset.

list_indexes()

List the colindex_nameumns/identifiers of all the attached indexes.

load_elasticsearch_index(index_name, ...[, ...])

Load an existing text index using ElasticSearch for fast retrieval.

load_faiss_index(index_name, file[, device, ...])

Load a FaissIndex from disk.

load_from_disk(dataset_path[, ...])

Loads a dataset that was previously saved using [save_to_disk] from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem.

map([function, with_indices, with_rank, ...])

Apply a function to all the examples in the table (individually or in batches) and update the table.

push_to_hub(repo_id[, config_name, ...])

Pushes the dataset to the hub as a Parquet dataset.

remove_columns(column_names)

Remove one or several column(s) in the dataset and the features associated to them.

rename_column(original_column_name, ...[, ...])

Rename a column in the dataset, and move the features associated to the original column under the new column name.

rename_columns(column_mapping[, new_fingerprint])

Rename several columns in the dataset, and move the features associated to the original columns under the new column names.

reset_format()

Reset __getitem__ return format to python objects and all columns.

sample([n, method, seed])

Return sample rows from dataset.

save_faiss_index(index_name, file[, ...])

Save a FaissIndex on disk.

save_to_disk(dataset_path)

Overrides the default save_to_disk method to save the dataset as a single directory with: - "data.arrow": the dataset's Arrow table. - "splits.json": the dataset's splits (e.g., original split indices).

search(index_name, query[, k])

Find the nearest examples indices in the dataset to the query.

search_batch(index_name, queries[, k])

Find the nearest examples indices in the dataset to the query.

select(indices[, keep_in_memory, ...])

Create a new dataset with rows selected following the list/array of indices.

select_columns(column_names[, new_fingerprint])

Select one or several column(s) in the dataset and the features associated to them.

set_format([type, columns, output_all_columns])

Set __getitem__ return format (type and columns).

set_transform(transform[, columns, ...])

Set __getitem__ return format using this transform.

shard(num_shards, index[, contiguous, ...])

Return the index-nth shard from dataset split into num_shards pieces.

shuffle([seed, generator, keep_in_memory, ...])

Create a new Dataset where the rows are shuffled.

skip(n)

Create a new [Dataset] that skips the first n elements.

sort(column_names[, reverse, ...])

Create a new dataset sorted according to a single or multiple columns.

take(n)

Create a new [Dataset] with only the first n elements.

to_csv(path_or_buf[, batch_size, num_proc, ...])

Exports the dataset to csv

to_dict([batch_size])

Returns the dataset as a Python dict.

to_iterable_dataset([num_shards])

Get an [datasets.IterableDataset] from a map-style [datasets.Dataset].

to_json(path_or_buf[, batch_size, num_proc, ...])

Export the dataset to JSON Lines or JSON.

to_list()

Returns the dataset as a Python list.

to_pandas([batch_size, batched])

Returns the dataset as a pandas.DataFrame.

to_parquet(path_or_buf[, batch_size, ...])

Exports the dataset to parquet

to_polars([batch_size, batched, ...])

Returns the dataset as a polars.DataFrame.

to_sql(name, con[, batch_size])

Exports the dataset to a SQL database.

to_tf_dataset([batch_size, columns, ...])

Create a tf.data.Dataset from the underlying Dataset.

train_test_split([test_size, train_size, ...])

Return a dictionary ([datasets.DatasetDict]) with two random train and test subsets (train and test Dataset splits).

unique(column)

Return a list of the unique elements in a column.

with_format([type, columns, output_all_columns])

Set __getitem__ return format (type and columns).

with_transform(transform[, columns, ...])

Set __getitem__ return format using this transform.

Attributes

arrow_table

Provides a clean way to access the underlying PyArrow table.

builder_name

cache_files

The cache files containing the Apache Arrow table backing the dataset.

citation

column_names

Names of the columns in the dataset.

config_name

data

The Apache Arrow table backing the dataset.

dataset_size

description

download_checksums

download_size

features

format

homepage

info

[~datasets.DatasetInfo] object containing all the metadata in the dataset.

license

num_columns

Number of columns in the dataset.

num_rows

Number of rows in the dataset (same as [Dataset.__len__]).

shape

Shape of the dataset (number of columns, number of rows).

size_in_bytes

split

[~datasets.NamedSplit] object corresponding to a named dataset split.

supervised_keys

version