NllbTransformer

Model

DashAI.back.models.hugging_face.NllbTransformer

Pre-trained transformer for configurable multilingual translation.

This model fine-tunes the facebook/nllb-200-distilled-600M checkpoint from Meta AI's No Language Left Behind (NLLB) project. The base model supports translation across 200 languages using a single unified model, identified by NLLB language codes of the form <iso639>_<script> (e.g. spa_Latn, eng_Latn). The 600M-parameter distilled variant provides a balance between translation quality and computational cost.

Target language generation is guided by forced_bos_token_id, which forces the decoder to start with the target language token. Fine-tuning is performed with the HuggingFace Seq2SeqTrainer using the AdamW optimizer. Training and validation metrics are logged at configurable epoch and step intervals via a custom MetricsCallback.

References

Parameters

num_train_epochs : integer, default=1: Total number of training epochs to perform.
batch_size : integer, default=4: The batch size per GPU/TPU core/CPU for training.
learning_rate : number, default=2e-05: The initial learning rate for AdamW optimizer.
device : string, default=CPU: Hardware on which training is run. GPU is recommended when available. If GPU is selected, all available GPUs are used.
weight_decay : number, default=0.01: L2 regularization coefficient applied via the AdamW optimizer to prevent overfitting.
log_train_every_n_epochs, default=1: Log train metrics every N epochs. None disables per-epoch logging.
log_train_every_n_steps, default=None: Log train metrics every N steps. None disables per-step logging.
log_validation_every_n_epochs, default=1: Log validation metrics every N epochs. None disables per-epoch logging.
log_validation_every_n_steps, default=None: Log validation metrics every N steps. None disables per-step logging.
source_language : string, default=spa_Latn: Source language code for NLLB tokenizer (e.g. spa_Latn for Spanish, eng_Latn for English). It uses BCP-47 language tags in the formatExamples
target_language : string, default=eng_Latn: Target language code for NLLB generation (e.g. eng_Latn for English, fra_Latn for French). It uses BCP-47 language tags in the format Examples

Methods

load(cls, filename: Union[str, ForwardRef('Path')])

Defined on NllbTransformer

Restore an NllbTransformer instance from disk.

Parameters

filename : str or Path: Directory path from which the model files will be read.

Returns

NllbTransformer: The restored model instance with fitted set to the persisted value.

predict(self, x_pred: 'DashAIDataset') -> List

Defined on NllbTransformer

Translate source texts to the configured target language.

Parameters

x_pred : DashAIDataset: Source-language dataset. Only the first column is used.

Returns

list of str: One translated string per input sample, in the same order as x_pred.

prepare_dataset(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'

Defined on NllbTransformer

Return the dataset unchanged.

Parameters

dataset : DashAIDataset: The dataset to be prepared.
is_fit : bool, optional: Whether the call is made during fitting. Unused here. Default False.

Returns

DashAIDataset: The original dataset, unmodified.

save(self, filename: Union[str, ForwardRef('Path')]) -> None

Defined on NllbTransformer

Store the fine-tuned model and its configuration to disk.

Parameters

filename : str or Path: Directory path where the model files will be written. If a file exists at that path it is removed and replaced by a directory.

tokenize_data(self, x: 'DashAIDataset', y: Optional[ForwardRef('DashAIDataset')] = None) -> 'DashAIDataset'

Defined on NllbTransformer

Tokenize input and optional target datasets for seq2seq training.

Parameters

x : DashAIDataset: Source-language dataset. Only the first column is used.
y : DashAIDataset, optional: Target-language dataset. When provided, tokenized targets are added as labels. When None, only input_ids and attention_mask are returned (inference mode).

Returns

DashAIDataset: Tokenized dataset with keys input_ids, attention_mask, and optionally labels.

train(self, x_train: 'DashAIDataset', y_train: 'DashAIDataset', x_validation: 'DashAIDataset' = None, y_validation: 'DashAIDataset' = None) -> 'NllbTransformer'

Defined on NllbTransformer

Fine-tune the NLLB model on the configured language pair.

Parameters

x_train : DashAIDataset: Input source-language text features for training.
y_train : DashAIDataset: Target-language translation labels for training.
x_validation : DashAIDataset, optional: Input source-language text features for validation. Default None.
y_validation : DashAIDataset, optional: Target-language translation labels for validation. Default None.

Returns

NllbTransformer: The fine-tuned model instance.

calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)

Defined on BaseModel

Calculate and save metrics for a given data split and level.

Parameters

split : SplitEnum: The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
level : LevelEnum: The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
log_index : int, optional: Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
x_data : DashAIDataset, optional: Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
y_data : DashAIDataset, optional: Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.

get_metadata(cls) -> Dict[str, Any]

Defined on BaseModel

Get metadata values for the current model.

Returns

Dict[str, Any]: Dictionary containing UI metadata such as the model icon used in the DashAI frontend.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict: Dictionary representing the Json Schema of the component.

prepare_output(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'

Defined on BaseModel

Hook for model-specific preprocessing of output targets.

Parameters

dataset : DashAIDataset: The output dataset (target labels) to preprocess.
is_fit : bool: Whether the call is part of a fitting phase. Defaults to False.

Returns

DashAIDataset: The preprocessed output dataset.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict: A dictionary with the data provided by the user to initialize the model.

Returns

dict: A validated dictionary with the necessary objects.

Compatible with

TranslationTask

References​

Parameters​

Methods​

Compatible with​

References

Parameters

Methods

Compatible with