BertinTransformer
Pre-trained BERTIN model (Spanish RoBERTa) for Spanish text classification.
BERTIN is a Spanish RoBERTa model trained on the Spanish portion of mC4 and
additional Spanish corpora. It applies RoBERTa's improved training recipe to
Spanish and typically outperforms BETO on Spanish NLP benchmarks. Requires
the sentencepiece package for its tokeniser.
References
- [1] de la Rosa, J. et al. (2022). "BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling."
- [2] https://huggingface.co/bertin-project/bertin-roberta-base-spanish
Parameters
- num_train_epochs : integer, default=
1 - Total number of training epochs to perform.
- batch_size : integer, default=
16 - The batch size per GPU/TPU core/CPU for training
- learning_rate : number, default=
3e-05 - The initial learning rate for AdamW optimizer
- device : string, default=
CPU - Hardware on which the training is run. If available, GPU is recommended for efficiency reasons. Otherwise, use CPU.
- weight_decay : number, default=
0.01 - Weight decay is a regularization technique used in training neural networks to prevent overfitting. In the context of the AdamW optimizer, the 'weight_decay' parameter is the rate at which the weights of all layers are reduced during training, provided that this rate is not zero.
- log_train_every_n_epochs, default=
1 - Log metrics for train split every n epochs during training. If None, it won't log per epoch.
- log_train_every_n_steps, default=
None - Log metrics for train split every n steps during training. If None, it won't log per step.
- log_validation_every_n_epochs, default=
1 - Log metrics for validation split every n epochs during training. If None, it won't log per epoch.
- log_validation_every_n_steps, default=
None - Log metrics for validation split every n steps during training. If None, it won't log per step.
Methods
calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)
BaseModelCalculate and save metrics for a given data split and level.
Parameters
- split : SplitEnum
- The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
- level : LevelEnum
- The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
- log_index : int, optional
- Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
- x_data : DashAIDataset, optional
- Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
- y_data : DashAIDataset, optional
- Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.
get_metadata(cls) -> Dict[str, Any]
BaseModelGet metadata values for the current model.
Returns
- Dict[str, Any]
- Dictionary containing UI metadata such as the model icon used in the DashAI frontend.
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
load(cls, filename: Union[str, ForwardRef('Path')]) -> 'HuggingFaceTextClassificationTransformer'
HuggingFaceTextClassificationTransformerRestore a HuggingFaceTextClassificationTransformer instance from disk.
Parameters
- filename : str or Path
- Directory path from which the model files will be read.
Returns
- HuggingFaceTextClassificationTransformer
- The restored model instance with
fittedset to the persisted value.
predict(self, x_pred: 'DashAIDataset')
HuggingFaceTextClassificationTransformerPredict with the fine-tuned model.
Parameters
- x_pred : DashAIDataset
- Dataset with text data.
Returns
- List
- List of predicted probabilities for each class.
prepare_dataset(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'
HuggingFaceTextClassificationTransformerApply the model transformations to the dataset.
Parameters
- dataset : DashAIDataset
- The dataset to be transformed.
- is_fit : bool
- Whether this is for fitting (True) or prediction (False).
Returns
- DashAIDataset
- The prepared dataset ready to be converted to an accepted format in the model.
prepare_output(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'
BaseModelHook for model-specific preprocessing of output targets.
Parameters
- dataset : DashAIDataset
- The output dataset (target labels) to preprocess.
- is_fit : bool
- Whether the call is part of a fitting phase. Defaults to False.
Returns
- DashAIDataset
- The preprocessed output dataset.
save(self, filename: Union[str, ForwardRef('Path')]) -> None
HuggingFaceTextClassificationTransformerStore the fine-tuned model and its configuration to disk.
Parameters
- filename : str or Path
- Directory path where the model files will be written.
tokenize_data(self, dataset: 'DashAIDataset') -> 'DashAIDataset'
HuggingFaceTextClassificationTransformerTokenize the input data.
Parameters
- dataset : DashAIDataset
- Dataset with the input data to preprocess.
Returns
- DashAIDataset
- Dataset with the tokenized input data.
train(self, x_train, y_train, x_validation=None, y_validation=None)
HuggingFaceTextClassificationTransformerFine-tune the model on the provided classification data.
Parameters
- x_train : DashAIDataset
- Input text features for training.
- y_train : DashAIDataset
- Target labels for training.
- x_validation : DashAIDataset
- Input text features for validation.
- y_validation : DashAIDataset
- Target labels for validation.
Returns
- HuggingFaceTextClassificationTransformer
- The fine-tuned model instance.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.