Skip to main content

BagOfWordsConverter

Converter
DashAI.back.converters.scikit_learn.BagOfWordsConverter

Convert raw text documents into a matrix of token occurrence counts.

The Bag-of-Words (BoW) model represents each document as a fixed-length vector of word counts, discarding word order and grammar. During fit a vocabulary of up to max_features terms is built from the training corpus. During transform each document is mapped to that vocabulary, producing one integer-valued column per token.

Optional preprocessing steps include lower-casing, stop-word removal, and n-gram extraction (unigrams, bigrams, …). The result is a sparse integer matrix converted to a DashAI dataset with one column per vocabulary term.

Internally wraps sklearn.feature_extraction.text.CountVectorizer.

The BoW representation is one of the foundational techniques in information retrieval described in Salton & McGill (1983) [2].

References

Parameters

max_features : integer, default=1000
Maximum number of features (most frequent words) to keep.
lowercase : boolean, default=True
Whether to convert all characters to lowercase before tokenizing.
stop_words, default=None
Stop word set to remove. Use 'english' or None.
lower_bound_ngrams : integer, default=1
Lower bound for n-grams. Must be <= upper bound.
upper_bound_ngrams : integer, default=1
Upper bound for n-grams. Must be >= lower bound.

Methods

fit(self, x: 'DashAIDataset', y=None) -> 'BagOfWordsConverter'

Defined on BagOfWordsConverter

Fit CountVectorizer on the first text column of the dataset.

Parameters

x : DashAIDataset
Input dataset. Only the first column is used for fitting.
y : ignored
Present for API compatibility.

Returns

BagOfWordsConverter
The fitted converter instance (self).

transform(self, x: 'DashAIDataset', y=None) -> 'DashAIDataset'

Defined on BagOfWordsConverter

Transform text into Bag-of-Words token-frequency columns.

Parameters

x : DashAIDataset
Input dataset. The first column is vectorised.
y : ignored
Present for API compatibility.

Returns

DashAIDataset
Dataset where each token becomes a numeric frequency column.

changes_row_count(self) -> 'bool'

Defined on BaseConverter

Indicate whether this converter changes the number of dataset rows.

Returns

bool
True if the converter may add or remove rows, False otherwise.

get_metadata(cls) -> 'Dict[str, Any]'

Defined on BaseConverter

Get metadata for the converter, used by the DashAI frontend.

Parameters

cls : type
The converter class (injected automatically by Python for classmethods).

Returns

Dict[str, Any]
Dictionary containing display name, short description, image preview path, category, icon, color, and whether the converter is supervised.

get_output_type(self, column_name: 'str' = None) -> 'DashAIDataType'

Defined on BaseConverter

Return the DashAI data type produced by this converter for a given column.

Parameters

column_name : str, optional
The name of the output column. Useful for converters that may produce different types per column. Defaults to None.

Returns

DashAIDataType
The output data type for the specified column.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict
Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict
A dictionary with the data provided by the user to initialize the model.

Returns

dict
A validated dictionary with the necessary objects.