Saltar al contenido principal

M2M100Transformer

Model

DashAI.back.models.hugging_face.M2M100Transformer

M2M100 multilingual seq2seq model for configurable language-pair translation.

Fine-tunes the facebook/m2m100_418M checkpoint from Meta AI. The base model supports direct translation across 100 languages using ISO 639-1 language codes (e.g. "en", "es", "fr"). Unlike pivot-based systems, M2M100 translates directly between any supported pair.

Target language generation is guided by forced_bos_token_id obtained from tokenizer.get_lang_id(target_language), identical in principle to the NLLB approach but using simpler ISO codes.

References

[1] https://huggingface.co/facebook/m2m100_418M
[2] Fan et al. (2021). "Beyond English-Centric Multilingual Machine Translation." JMLR 2021.

Parameters

num_train_epochs : integer, default=1: Total number of training epochs to perform.
batch_size : integer, default=4: The batch size per GPU/TPU core/CPU for training.
learning_rate : number, default=2e-05: The initial learning rate for AdamW optimizer.
device : string, default=CPU: Hardware on which training is run. GPU is recommended when available. If GPU is selected, all available GPUs are used.
weight_decay : number, default=0.01: L2 regularization coefficient applied via the AdamW optimizer to prevent overfitting.
log_train_every_n_epochs, default=1: Log train metrics every N epochs. None disables per-epoch logging.
log_train_every_n_steps, default=None: Log train metrics every N steps. None disables per-step logging.
log_validation_every_n_epochs, default=1: Log validation metrics every N epochs. None disables per-epoch logging.
log_validation_every_n_steps, default=None: Log validation metrics every N steps. None disables per-step logging.
source_language : string, default=en: Source language ISO 639-1 code (e.g. 'en', 'es', 'fr', 'de'). Supports 100 languages.
target_language : string, default=es: Target language ISO 639-1 code (e.g. 'en', 'es', 'fr', 'de'). Supports 100 languages.

Methods

load(cls, filename: Union[str, ForwardRef('Path')])

Defined on M2M100Transformer

Restore an M2M100Transformer instance from disk.

predict(self, x_pred: 'DashAIDataset') -> List

Defined on M2M100Transformer

Translate using forced_bos_token_id for the target language.

prepare_dataset(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'

Defined on M2M100Transformer

Return the dataset unchanged.

save(self, filename: Union[str, ForwardRef('Path')]) -> None

Defined on M2M100Transformer

Persist model weights and hyperparameters to disk.

tokenize_data(self, x: 'DashAIDataset', y: Optional[ForwardRef('DashAIDataset')] = None) -> 'DashAIDataset'

Defined on M2M100Transformer

Tokenize with src_lang set for M2M100.

train(self, x_train: 'DashAIDataset', y_train: 'DashAIDataset', x_validation: 'DashAIDataset' = None, y_validation: 'DashAIDataset' = None)

Defined on M2M100Transformer

Fine-tune M2M100 on the configured language pair.

calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)

Defined on BaseModel

Calculate and save metrics for a given data split and level.

Parameters

split : SplitEnum: The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
level : LevelEnum: The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
log_index : int, optional: Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
x_data : DashAIDataset, optional: Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
y_data : DashAIDataset, optional: Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.

get_metadata(cls) -> Dict[str, Any]

Defined on BaseModel

Get metadata values for the current model.

Returns

Dict[str, Any]: Dictionary containing UI metadata such as the model icon used in the DashAI frontend.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict: Dictionary representing the Json Schema of the component.

prepare_output(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'

Defined on BaseModel

Hook for model-specific preprocessing of output targets.

Parameters

dataset : DashAIDataset: The output dataset (target labels) to preprocess.
is_fit : bool: Whether the call is part of a fitting phase. Defaults to False.

Returns

DashAIDataset: The preprocessed output dataset.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict: A dictionary with the data provided by the user to initialize the model.

Returns

dict: A validated dictionary with the necessary objects.

Compatible with

TranslationTask

References
Parameters
Methods
Compatible with