Skip to main content

BagOfWordsTextClassificationModel

Model
DashAI.back.models.scikit_learn.BagOfWordsTextClassificationModel

Text classification meta-model that combines a bag-of-words vectorizer with a DashAI tabular classifier.

The model converts raw text into a token-count matrix using scikit-learn's CountVectorizer with a configurable n-gram range, then passes the resulting sparse feature matrix to any DashAI TabularClassificationModel for training and prediction. This decouples text featurisation from the choice of classifier, allowing any registered DashAI tabular model (tree-based, SVM, linear, etc.) to be applied to text classification without modification.

During training the vectorizer is fitted on the input text column and the resulting token-count matrix is forwarded to the wrapped classifier's train method. During inference the already-fitted vectorizer transforms the text before calling the classifier's predict method.

References

Parameters

tabular_classifier : object
Tabular model used as the underlying model to generate the text classifier.
ngram_min_n : integer, default=1
The lower boundary of the range of n-values for different word n-grams or char n-grams to be extracted. It must be an integer greater or equal than 1
ngram_max_n : integer, default=1
The upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. It must be an integer greater or equal than 1

Methods

get_vectorizer(self, input_column: str, output_column: Optional[str] = None)

Defined on BagOfWordsTextClassificationModel

Factory that returns a function to transform a text classification dataset into a tabular classification dataset.

Parameters

input_column : str
name the input column of the dataset. This column will be vectorized.
output_column : str
name the output column of the dataset.

Returns

Function
Function for vectorize the dataset.

load(filename: Union[str, ForwardRef('Path')]) -> None

Defined on BagOfWordsTextClassificationModel

Deserialise a model from disk using joblib.

Parameters

filename : str or Path
Path to the file previously written by :meth:save.

Returns

BagOfWordsTextClassificationModel
The loaded model instance.

predict(self, x)

Defined on BagOfWordsTextClassificationModel

Generate class-probability predictions for the input text dataset.

Parameters

x : DashAIDataset
Input dataset containing the raw text column.

Returns

numpy.ndarray
Array of shape (n_samples, n_classes) with predicted probabilities for each class, as returned by the wrapped tabular classifier.

prepare_dataset(self, dataset: 'DashAIDataset', is_fit=False)

Defined on BagOfWordsTextClassificationModel

Apply the model transformations to the dataset.

Parameters

dataset : DashAIDataset
The dataset to be transformed.
is_fit : bool, optional
If True, the method will apply transformations needed for fitting the model.

Returns

DashAIDataset
The prepared dataset ready to be converted to an accepted format in the model.

prepare_output(self, dataset, is_fit=False)

Defined on BagOfWordsTextClassificationModel

Prepare output targets by delegating to the wrapped classifier.

Parameters

dataset : DashAIDataset
The output dataset containing the target labels to be prepared.
is_fit : bool, optional
If True, fit the label encoder on the dataset before transforming. If False, apply existing encodings. Default is False.

Returns

DashAIDataset
The prepared output dataset with categorical labels encoded as integers.

save(self, filename: Union[str, ForwardRef('Path')]) -> None

Defined on BagOfWordsTextClassificationModel

Serialise the model to disk using joblib.

Parameters

filename : str or Path
Destination file path where the model will be written.

train(self, x, y, x_validation=None, y_validation=None)

Defined on BagOfWordsTextClassificationModel

Fit the bag-of-words vectorizer and the underlying tabular classifier.

Parameters

x : DashAIDataset
Input dataset containing the raw text column.
y : DashAIDataset
Target dataset containing the class labels.
x_validation : DashAIDataset or None, optional
Validation inputs. Not used by the default tabular classifiers but accepted for interface compatibility.
y_validation : DashAIDataset or None, optional
Validation targets. Not used by the default tabular classifiers but accepted for interface compatibility.

calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)

Defined on BaseModel

Calculate and save metrics for a given data split and level.

Parameters

split : SplitEnum
The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
level : LevelEnum
The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
log_index : int, optional
Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
x_data : DashAIDataset, optional
Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
y_data : DashAIDataset, optional
Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.

get_metadata(cls) -> Dict[str, Any]

Defined on BaseModel

Get metadata values for the current model.

Returns

Dict[str, Any]
Dictionary containing UI metadata such as the model icon used in the DashAI frontend.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.

Compatible with