NllbTransformer
Pre-trained transformer for configurable multilingual translation.
This model fine-tunes the facebook/nllb-200-distilled-600M checkpoint from
Meta AI's No Language Left Behind (NLLB) project. The base model supports
translation across 200 languages using a single unified model, identified by
NLLB language codes of the form <iso639>_<script> (e.g. spa_Latn,
eng_Latn). The 600M-parameter distilled variant provides a balance between
translation quality and computational cost.
Target language generation is guided by forced_bos_token_id, which forces
the decoder to start with the target language token. Fine-tuning is performed
with the HuggingFace Seq2SeqTrainer using the AdamW optimizer. Training and
validation metrics are logged at configurable epoch and step intervals via a
custom MetricsCallback.
References
Parameters
- num_train_epochs : integer, default=
1 - Total number of training epochs to perform.
- batch_size : integer, default=
4 - The batch size per GPU/TPU core/CPU for training
- learning_rate : number, default=
2e-05 - The initial learning rate for AdamW optimizer
- device : string, default=
CPU - Hardware on which the training is run. If available, GPU is recommended for efficiency reasons. Otherwise, use CPU. If GPU is selected then it will use all gpus available.
- weight_decay : number, default=
0.01 - Weight decay is a regularization technique used in training neural networks to prevent overfitting. In the context of the AdamW optimizer, the 'weight_decay' parameter is the rate at which the weights of all layers are reduced during training, provided that this rate is not zero.
- log_train_every_n_epochs, default=
1 - Log metrics for train split every n epochs during training. If None, it won't log per epoch.
- log_train_every_n_steps, default=
None - Log metrics for train split every n steps during training. If None, it won't log per step.
- log_validation_every_n_epochs, default=
1 - Log metrics for validation split every n epochs during training. If None, it won't log per epoch.
- log_validation_every_n_steps, default=
None - Log metrics for validation split every n steps during training. If None, it won't log per step.
- source_language : string, default=
spa_Latn - Source language code for NLLB tokenizer. Example: spa_Latn for Spanish.
- target_language : string, default=
eng_Latn - Target language code for NLLB generation. Example: eng_Latn for English.
Methods
load(cls, filename: Union[str, ForwardRef('Path')])
NllbTransformerRestore an NllbTransformer instance from disk.
Parameters
- filename : str or Path
- Directory path from which the model files will be read.
Returns
- NllbTransformer
- The restored model instance with
fittedset to the persisted value.
predict(self, x_pred: 'DashAIDataset') -> List
NllbTransformerTranslate source texts to the configured target language.
Parameters
- x_pred : DashAIDataset
- Source-language dataset. Only the first column is used.
Returns
- list of str
- One translated string per input sample, in the same order as
x_pred.
prepare_dataset(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'
NllbTransformerReturn the dataset unchanged.
Parameters
- dataset : DashAIDataset
- The dataset to be prepared.
- is_fit : bool, optional
- Whether the call is made during fitting. Unused here. Default
False.
Returns
- DashAIDataset
- The original dataset, unmodified.
save(self, filename: Union[str, ForwardRef('Path')]) -> None
NllbTransformerStore the fine-tuned model and its configuration to disk.
Parameters
- filename : str or Path
- Directory path where the model files will be written. If a file exists at that path it is removed and replaced by a directory.
tokenize_data(self, x: 'DashAIDataset', y: Optional[ForwardRef('DashAIDataset')] = None) -> 'DashAIDataset'
NllbTransformerTokenize input and optional target datasets for seq2seq training.
Parameters
- x : DashAIDataset
- Source-language dataset. Only the first column is used.
- y : DashAIDataset, optional
- Target-language dataset. When provided, tokenized targets are added as
labels. WhenNone, onlyinput_idsandattention_maskare returned (inference mode).
Returns
- DashAIDataset
- Tokenized dataset with keys
input_ids,attention_mask, and optionallylabels.
train(self, x_train: 'DashAIDataset', y_train: 'DashAIDataset', x_validation: 'DashAIDataset' = None, y_validation: 'DashAIDataset' = None) -> 'NllbTransformer'
NllbTransformerFine-tune the NLLB model on the configured language pair.
Parameters
- x_train : DashAIDataset
- Input source-language text features for training.
- y_train : DashAIDataset
- Target-language translation labels for training.
- x_validation : DashAIDataset, optional
- Input source-language text features for validation. Default
None. - y_validation : DashAIDataset, optional
- Target-language translation labels for validation. Default
None.
Returns
- NllbTransformer
- The fine-tuned model instance.
calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)
BaseModelCalculate and save metrics for a given data split and level.
Parameters
- split : SplitEnum
- The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
- level : LevelEnum
- The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
- log_index : int, optional
- Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
- x_data : DashAIDataset, optional
- Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
- y_data : DashAIDataset, optional
- Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.
get_metadata(cls) -> Dict[str, Any]
BaseModelGet metadata values for the current model.
Returns
- Dict[str, Any]
- Dictionary containing UI metadata such as the model icon used in the DashAI frontend.
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
prepare_output(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'
BaseModelHook for model-specific preprocessing of output targets.
Parameters
- dataset : DashAIDataset
- The output dataset (target labels) to preprocess.
- is_fit : bool
- Whether the call is part of a fitting phase. Defaults to False.
Returns
- DashAIDataset
- The preprocessed output dataset.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.