BagOfWordsTextClassificationModel

Model

DashAI.back.models.scikit_learn.BagOfWordsTextClassificationModel

Text classification meta-model that combines a bag-of-words vectorizer with a DashAI tabular classifier.

The model converts raw text into a token-count matrix using scikit-learn's CountVectorizer with a configurable n-gram range, then passes the resulting sparse feature matrix to any DashAI TabularClassificationModel for training and prediction. This decouples text featurisation from the choice of classifier, allowing any registered DashAI tabular model (tree based, SVM, linear, etc.) to be applied to text classification without modification.

During training the vectorizer is fitted on the input text column and the resulting token-count matrix is forwarded to the wrapped classifier's train method. During inference the already-fitted vectorizer transforms the text before calling the classifier's predict method.

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Parameters

tabular_classifier : object: Tabular model used as the underlying model to generate the text classifier.
ngram_min_n : integer, default=1: The lower boundary of the range of n-values for different word n-grams or char n-grams to be extracted. It must be an integer greater or equal than 1
ngram_max_n : integer, default=1: The upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. It must be an integer greater or equal than 1

Methods

get_vectorizer(self, input_column: str, output_column: Optional[str] = None)

Defined on BagOfWordsTextClassificationModel

Factory that returns a function to transform a text classification dataset into a tabular classification dataset.

Parameters

input_column : str: name the input column of the dataset. This column will be vectorized.
output_column : str: name the output column of the dataset.

Returns

Function: Function for vectorize the dataset.

load(filename: Union[str, ForwardRef('Path')]) -> None

Defined on BagOfWordsTextClassificationModel

Deserialise a model from disk using joblib.

Parameters

filename : str or Path: Path to the file previously written by :meth:save.

Returns

BagOfWordsTextClassificationModel: The loaded model instance.

predict(self, x)

Defined on BagOfWordsTextClassificationModel

Generate class-probability predictions for the input text dataset.

Parameters

x : DashAIDataset: Input dataset containing the raw text column.

Returns

numpy.ndarray: Array of shape (n_samples, n_classes) with predicted probabilities for each class, as returned by the wrapped tabular classifier.

prepare_dataset(self, dataset: 'DashAIDataset', is_fit=False)

Defined on BagOfWordsTextClassificationModel

Apply the model transformations to the dataset.

Parameters

dataset : DashAIDataset: The dataset to be transformed.
is_fit : bool, optional: If True, the method will apply transformations needed for fitting the model.

Returns

DashAIDataset: The prepared dataset ready to be converted to an accepted format in the model.

prepare_output(self, dataset, is_fit=False)

Defined on BagOfWordsTextClassificationModel

Prepare output targets by delegating to the wrapped classifier.

Parameters

dataset : DashAIDataset: The output dataset containing the target labels to be prepared.
is_fit : bool, optional: If True, fit the label encoder on the dataset before transforming. If False, apply existing encodings. Default is False.

Returns

DashAIDataset: The prepared output dataset with categorical labels encoded as integers.

save(self, filename: Union[str, ForwardRef('Path')]) -> None

Defined on BagOfWordsTextClassificationModel

Serialise the model to disk using joblib.

Parameters

filename : str or Path: Destination file path where the model will be written.

train(self, x, y, x_validation=None, y_validation=None)

Defined on BagOfWordsTextClassificationModel

Fit the bag-of-words vectorizer and the underlying tabular classifier.

Parameters

x : DashAIDataset: Input dataset containing the raw text column.
y : DashAIDataset: Target dataset containing the class labels.
x_validation : DashAIDataset or None, optional: Validation inputs. Not used by the default tabular classifiers but accepted for interface compatibility.
y_validation : DashAIDataset or None, optional: Validation targets. Not used by the default tabular classifiers but accepted for interface compatibility.

calculate_metrics(self, split: DashAI.back.core.enums.metrics.SplitEnum = <SplitEnum.VALIDATION: 'validation'>, level: DashAI.back.core.enums.metrics.LevelEnum = <LevelEnum.LAST: 'last'>, log_index: int = None, x_data: 'DashAIDataset' = None, y_data: 'DashAIDataset' = None)

Defined on BaseModel

Calculate and save metrics for a given data split and level.

Parameters

split : SplitEnum: The data split to evaluate (TRAIN, VALIDATION, or TEST). Defaults to SplitEnum.VALIDATION.
level : LevelEnum: The metric granularity level (LAST, TRIAL, STEP, or BATCH). Defaults to LevelEnum.LAST.
log_index : int, optional: Explicit step index for the metric entry. If None, the next step index is computed automatically. Defaults to None.
x_data : DashAIDataset, optional: Input features. If None, the dataset stored in the model for the given split is used. Defaults to None.
y_data : DashAIDataset, optional: Target labels. If None, the labels stored in the model for the given split are used. Defaults to None.

get_metadata(cls) -> Dict[str, Any]

Defined on BaseModel

Get metadata values for the current model.

Returns

Dict[str, Any]: Dictionary containing UI metadata such as the model icon used in the DashAI frontend.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict: Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict: A dictionary with the data provided by the user to initialize the model.

Returns

dict: A validated dictionary with the necessary objects.

Compatible with

TextClassificationTask

References​

Parameters​

Methods​

Compatible with​

References

Parameters

Methods

Compatible with