Skip to main content

HuggingFaceDatasetSource

DatasetSource
DashAI.back.dataset_sources.HuggingFaceDatasetSource

Dataset source that fetches public datasets from HuggingFace Hub.

Uses huggingface_hub.HfApi — no authentication required for public datasets. HfApi.list_datasets exposes an iterator rather than native cursors, so pagination is implemented by treating the cursor as a numeric offset and slicing the iterator.

Methods

download_dataset(self, dataset_id: str, temp_path: str) -> str

Defined on HuggingFaceDatasetSource

Download the raw dataset files from HuggingFace Hub.

Parameters

dataset_id : str
HuggingFace dataset identifier (e.g. "stanfordnlp/imdb").
temp_path : str
Local directory to download into.

Returns

str
Path to the directory containing the downloaded files.

get_info(self, dataset_id: str) -> 'DatasetEntry | None'

Defined on HuggingFaceDatasetSource

Return full metadata for a single HuggingFace dataset, including size.

Parameters

dataset_id : str
HuggingFace dataset identifier in "namespace/repo" form.

Returns

DatasetEntry or None
Full metadata entry, or None on error.

search(self, query: str, limit: int = 20, cursor: str | None = None, **filters: Any) -> DashAI.back.dataset_sources.base_dataset_source.SearchPage

Defined on HuggingFaceDatasetSource

Return public HuggingFace datasets matching a query.

Parameters

query : str
Search string passed to HfApi.list_datasets.
limit : int, optional
Maximum number of results, by default 20.
cursor : str or None, optional
Pagination cursor returned by the previous call (encoded numeric offset). None fetches the first page.
**filters : Any
Unused; reserved for future tag/task filters.

Returns

SearchPage
Matching datasets and cursor for the next page (or None).

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.