M2M100Transformer
M2M100 multilingual seq2seq model for configurable language-pair translation.
Fine-tunes the facebook/m2m100_418M checkpoint from Meta AI. The base
model supports direct translation across 100 languages using ISO 639-1
language codes (e.g. "en", "es", "fr"). Unlike pivot-based
systems, M2M100 translates directly between any supported pair.
Target language generation is guided by forced_bos_token_id obtained
from tokenizer.get_lang_id(target_language), identical in principle
to the NLLB approach but using simpler ISO codes.
References
- [1] https://huggingface.co/facebook/m2m100_418M
- [2] Fan et al. (2021). "Beyond English-Centric Multilingual Machine Translation." JMLR 2021.
Parameters
- num_train_epochs : integer, default=
1 - Total number of training epochs to perform.
- batch_size : integer, default=
4 - The batch size per GPU/TPU core/CPU for training.
- learning_rate : number, default=
2e-05 - The initial learning rate for AdamW optimizer.
- device : string, default=
CPU - Hardware on which training is run. GPU is recommended when available. If GPU is selected, all available GPUs are used.
- weight_decay : number, default=
0.01 - L2 regularization coefficient applied via the AdamW optimizer to prevent overfitting.
- log_train_every_n_epochs, default=
1 - Log train metrics every N epochs. None disables per-epoch logging.
- log_train_every_n_steps, default=
None - Log train metrics every N steps. None disables per-step logging.
- log_validation_every_n_epochs, default=
1 - Log validation metrics every N epochs. None disables per-epoch logging.
- log_validation_every_n_steps, default=
None - Log validation metrics every N steps. None disables per-step logging.
- source_language : string, default=
en - Source language ISO 639-1 code (e.g. 'en', 'es', 'fr', 'de'). Supports 100 languages.
- target_language : string, default=
es - Target language ISO 639-1 code (e.g. 'en', 'es', 'fr', 'de'). Supports 100 languages.
Methods
load(cls, filename: Union[str, ForwardRef('Path')])
M2M100TransformerRestore an M2M100Transformer instance from disk.
predict(self, x_pred: 'DashAIDataset') -> List
M2M100TransformerTranslate using forced_bos_token_id for the target language.
prepare_dataset(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'
M2M100TransformerReturn the dataset unchanged.
save(self, filename: Union[str, ForwardRef('Path')]) -> None
M2M100TransformerPersist model weights and hyperparameters to disk.
tokenize_data(self, x: 'DashAIDataset', y: Optional[ForwardRef('DashAIDataset')] = None) -> 'DashAIDataset'
M2M100TransformerTokenize with src_lang set for M2M100.
train(self, x_train: 'DashAIDataset', y_train: 'DashAIDataset', x_validation: 'DashAIDataset' = None, y_validation: 'DashAIDataset' = None)
M2M100TransformerFine-tune M2M100 on the configured language pair.
calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)
BaseModelCalculate and save metrics for a given data split and level.
Parameters
- split : SplitEnum
- The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
- level : LevelEnum
- The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
- log_index : int, optional
- Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
- x_data : DashAIDataset, optional
- Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
- y_data : DashAIDataset, optional
- Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.
get_metadata(cls) -> Dict[str, Any]
BaseModelGet metadata values for the current model.
Returns
- Dict[str, Any]
- Dictionary containing UI metadata such as the model icon used in the DashAI frontend.
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
prepare_output(self, dataset: 'DashAIDataset', is_fit: bool = False) -> 'DashAIDataset'
BaseModelHook for model-specific preprocessing of output targets.
Parameters
- dataset : DashAIDataset
- The output dataset (target labels) to preprocess.
- is_fit : bool
- Whether the call is part of a fitting phase. Defaults to False.
Returns
- DashAIDataset
- The preprocessed output dataset.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.