Saltar al contenido principal

TokenizerConverter

Converter
DashAI.back.converters.hugging_face.TokenizerConverter

Converter that tokenizes text and stores each token ID in a separate column.

Parameters

model_name : string, default=bert-base-uncased
Name of the pre-trained tokenizer model
max_length : integer, default=512
Maximum sequence length for tokenization
batch_size : integer, default=32
Number of samples to process at once
device : string, default=cpu
Device to use for computation

Methods

get_output_type(self, column_name: Optional[str] = None) -> DashAI.back.types.dashai_data_type.DashAIDataType

Defined on TokenizerConverter

Return the DashAI data type produced by this converter for a column.

Parameters

column_name : str, optional
The column name to look up in the fitted encoders. When provided and the encoder has been fitted, the returned type reflects the actual fitted classes. Defaults to None.

Returns

DashAIDataType
An Integer type for each token position column.

changes_row_count(self) -> 'bool'

Defined on BaseConverter

Indicate whether this converter changes the number of dataset rows.

Returns

bool
True if the converter may add or remove rows, False otherwise.

fit(self, x: 'DashAIDataset', y: 'DashAIDataset' = None) -> Type[DashAI.back.converters.base_converter.BaseConverter]

Defined on HuggingFaceWrapper

Validate the input dataset and load the HuggingFace model.

Parameters

x : DashAIDataset
Input dataset whose columns must all be string-typed.
y : DashAIDataset or None, optional
Ignored. Present for API compatibility. Default None.

Returns

HuggingFaceWrapper
The fitted converter instance (self).

get_metadata(cls) -> 'Dict[str, Any]'

Defined on BaseConverter

Get metadata for the converter, used by the DashAI frontend.

Parameters

cls : type
The converter class (injected automatically by Python for classmethods).

Returns

Dict[str, Any]
Dictionary containing display name, short description, image preview path, category, icon, color, and whether the converter is supervised.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

transform(self, x: 'DashAIDataset', y: 'DashAIDataset' = None) -> 'DashAIDataset'

Defined on HuggingFaceWrapper

Transform the input dataset by running inference in batches.

Parameters

x : DashAIDataset
The dataset to transform. Must have been fitted first.
y : DashAIDataset or None, optional
Ignored. Present for API compatibility. Default None.

Returns

DashAIDataset
Transformed dataset with output types set per column.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.