BagOfWordsTextClassificationModel
Text classification meta-model that combines a bag-of-words vectorizer with a DashAI tabular classifier.
The model converts raw text into a token-count matrix using scikit-learn's
CountVectorizer with a configurable n-gram range, then passes the
resulting sparse feature matrix to any DashAI TabularClassificationModel
for training and prediction. This decouples text featurisation from the
choice of classifier, allowing any registered DashAI tabular model (tree-based,
SVM, linear, etc.) to be applied to text classification without modification.
During training the vectorizer is fitted on the input text column and the
resulting token-count matrix is forwarded to the wrapped classifier's
train method. During inference the already-fitted vectorizer transforms
the text before calling the classifier's predict method.
References
Parameters
- tabular_classifier : object
- Tabular model used as the underlying model to generate the text classifier.
- ngram_min_n : integer, default=
1 - The lower boundary of the range of n-values for different word n-grams or char n-grams to be extracted. It must be an integer greater or equal than 1
- ngram_max_n : integer, default=
1 - The upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. It must be an integer greater or equal than 1
Methods
get_vectorizer(self, input_column: str, output_column: Optional[str] = None)
BagOfWordsTextClassificationModelFactory that returns a function to transform a text classification dataset into a tabular classification dataset.
Parameters
- input_column : str
- name the input column of the dataset. This column will be vectorized.
- output_column : str
- name the output column of the dataset.
Returns
- Function
- Function for vectorize the dataset.
load(filename: Union[str, ForwardRef('Path')]) -> None
BagOfWordsTextClassificationModelDeserialise a model from disk using joblib.
Parameters
- filename : str or Path
- Path to the file previously written by :meth:
save.
Returns
- BagOfWordsTextClassificationModel
- The loaded model instance.
predict(self, x)
BagOfWordsTextClassificationModelGenerate class-probability predictions for the input text dataset.
Parameters
- x : DashAIDataset
- Input dataset containing the raw text column.
Returns
- numpy.ndarray
- Array of shape
(n_samples, n_classes)with predicted probabilities for each class, as returned by the wrapped tabular classifier.
prepare_dataset(self, dataset: 'DashAIDataset', is_fit=False)
BagOfWordsTextClassificationModelApply the model transformations to the dataset.
Parameters
- dataset : DashAIDataset
- The dataset to be transformed.
- is_fit : bool, optional
- If True, the method will apply transformations needed for fitting the model.
Returns
- DashAIDataset
- The prepared dataset ready to be converted to an accepted format in the model.
prepare_output(self, dataset, is_fit=False)
BagOfWordsTextClassificationModelPrepare output targets by delegating to the wrapped classifier.
Parameters
- dataset : DashAIDataset
- The output dataset containing the target labels to be prepared.
- is_fit : bool, optional
- If
True, fit the label encoder on the dataset before transforming. IfFalse, apply existing encodings. Default isFalse.
Returns
- DashAIDataset
- The prepared output dataset with categorical labels encoded as integers.
save(self, filename: Union[str, ForwardRef('Path')]) -> None
BagOfWordsTextClassificationModelSerialise the model to disk using joblib.
Parameters
- filename : str or Path
- Destination file path where the model will be written.
train(self, x, y, x_validation=None, y_validation=None)
BagOfWordsTextClassificationModelFit the bag-of-words vectorizer and the underlying tabular classifier.
Parameters
- x : DashAIDataset
- Input dataset containing the raw text column.
- y : DashAIDataset
- Target dataset containing the class labels.
- x_validation : DashAIDataset or None, optional
- Validation inputs. Not used by the default tabular classifiers but accepted for interface compatibility.
- y_validation : DashAIDataset or None, optional
- Validation targets. Not used by the default tabular classifiers but accepted for interface compatibility.
calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)
BaseModelCalculate and save metrics for a given data split and level.
Parameters
- split : SplitEnum
- The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
- level : LevelEnum
- The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
- log_index : int, optional
- Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
- x_data : DashAIDataset, optional
- Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
- y_data : DashAIDataset, optional
- Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.
get_metadata(cls) -> Dict[str, Any]
BaseModelGet metadata values for the current model.
Returns
- Dict[str, Any]
- Dictionary containing UI metadata such as the model icon used in the DashAI frontend.
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.