BertinTransformer

Model

DashAI.back.models.hugging_face.BertinTransformer

Pre-trained BERTIN model (Spanish RoBERTa) for Spanish text classification.

BERTIN is a Spanish RoBERTa model trained on the Spanish portion of mC4 and additional Spanish corpora. It applies RoBERTa's improved training recipe to Spanish and typically outperforms BETO on Spanish NLP benchmarks. Requires the sentencepiece package for its tokeniser.

References

[1] de la Rosa, J. et al. (2022). "BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling."
[2] https://huggingface.co/bertin-project/bertin-roberta-base-spanish

Parameters

num_train_epochs : integer, default=1: Total number of training epochs to perform.
batch_size : integer, default=16: The batch size per GPU/TPU core/CPU for training
learning_rate : number, default=3e-05: The initial learning rate for AdamW optimizer
device : string, default=CPU: Hardware on which the training is run. If available, GPU is recommended for efficiency reasons. Otherwise, use CPU.
weight_decay : number, default=0.01: Weight decay is a regularization technique used in training neural networks to prevent overfitting. In the context of the AdamW optimizer, the 'weight_decay' parameter is the rate at which the weights of all layers are reduced during training, provided that this rate is not zero.
log_train_every_n_epochs, default=1: Log metrics for train split every n epochs during training. If None, it won't log per epoch.
log_train_every_n_steps, default=None: Log metrics for train split every n steps during training. If None, it won't log per step.
log_validation_every_n_epochs, default=1: Log metrics for validation split every n epochs during training. If None, it won't log per epoch.
log_validation_every_n_steps, default=None: Log metrics for validation split every n steps during training. If None, it won't log per step.

Methods

calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)

Defined on BaseModel

Calculate and save metrics for a given data split and level.

Parameters

split : SplitEnum: The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
level : LevelEnum: The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
log_index : int, optional: Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
x_data : DashAIDataset, optional: Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
y_data : DashAIDataset, optional: Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.

get_metadata(cls) -> Dict[str, Any]

Defined on BaseModel

Get metadata values for the current model.

Returns

Dict[str, Any]: Dictionary containing UI metadata such as the model icon used in the DashAI frontend.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict: Dictionary representing the Json Schema of the component.

load(cls, filename: Union[str, ForwardRef('Path')]) -> 'HuggingFaceTextClassificationTransformer'

Defined on HuggingFaceTextClassificationTransformer

Restore a HuggingFaceTextClassificationTransformer instance from disk.

Parameters

filename : str or Path: Directory path from which the model files will be read.

Returns

HuggingFaceTextClassificationTransformer: The restored model instance with fitted set to the persisted value.

predict(self, x_pred: 'DashAIDataset')

Defined on HuggingFaceTextClassificationTransformer

Predict with the fine-tuned model.

Parameters

x_pred : DashAIDataset: Dataset with text data.

Returns

List: List of predicted probabilities for each class.

prepare_dataset(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'

Defined on HuggingFaceTextClassificationTransformer

Apply the model transformations to the dataset.

Parameters

dataset : DashAIDataset: The dataset to be transformed.
is_fit : bool: Whether this is for fitting (True) or prediction (False).

Returns

DashAIDataset: The prepared dataset ready to be converted to an accepted format in the model.

prepare_output(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'

Defined on BaseModel

Hook for model-specific preprocessing of output targets.

Parameters

dataset : DashAIDataset: The output dataset (target labels) to preprocess.
is_fit : bool: Whether the call is part of a fitting phase. Defaults to False.

Returns

DashAIDataset: The preprocessed output dataset.

save(self, filename: Union[str, ForwardRef('Path')]) -> None

Defined on HuggingFaceTextClassificationTransformer

Store the fine-tuned model and its configuration to disk.

Parameters

filename : str or Path: Directory path where the model files will be written.

tokenize_data(self, dataset: 'DashAIDataset') -> 'DashAIDataset'

Defined on HuggingFaceTextClassificationTransformer

Tokenize the input data.

Parameters

dataset : DashAIDataset: Dataset with the input data to preprocess.

Returns

DashAIDataset: Dataset with the tokenized input data.

train(self, x_train, y_train, x_validation=None, y_validation=None)

Defined on HuggingFaceTextClassificationTransformer

Fine-tune the model on the provided classification data.

Parameters

x_train : DashAIDataset: Input text features for training.
y_train : DashAIDataset: Target labels for training.
x_validation : DashAIDataset: Input text features for validation.
y_validation : DashAIDataset: Target labels for validation.

Returns

HuggingFaceTextClassificationTransformer: The fine-tuned model instance.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict: A dictionary with the data provided by the user to initialize the model.

Returns

dict: A validated dictionary with the necessary objects.

Compatible with

TextClassificationTask

References​

Parameters​

Methods​

Compatible with​

References

Parameters

Methods

Compatible with