TFIDFConverter

Converter

DashAI.back.converters.scikit_learn.TFIDFConverter

Convert raw text documents into a matrix of TF-IDF weighted features.

TF-IDF (Term Frequency - Inverse Document Frequency) re-weights raw token counts so that terms that appear frequently in a specific document but rarely across the whole corpus receive a higher score, while common stop-like terms are down-weighted. Each document is represented as a floating-point vector of TF-IDF scores, one dimension per vocabulary term.

The TF-IDF score for term t in document d is:

tfidf(t, d) = tf(t, d) x log((1 + n) / (1 + df(t))) + 1

where n is the total number of documents and df(t) is the number of documents containing t (scikit-learn's smooth_idf=True default).

Optional preprocessing steps include lower-casing, stop-word removal, and n-gram extraction. The result is a floating-point DashAI dataset with one column per vocabulary term.

Internally wraps sklearn.feature_extraction.text.TfidfVectorizer.

The TF-IDF weighting scheme was introduced as a core technique in information retrieval by Salton & McGill (1983) [2].

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
[2] Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.

Parameters

max_features : integer, default=1000: Maximum number of features (most frequent terms) to keep.
lowercase : boolean, default=True: Whether to convert all characters to lowercase before tokenizing.
stop_words, default=None: Stop word set to remove. Use 'english' or None.
lower_bound_ngrams : integer, default=1: Lower bound for n-grams to be extracted. Must be <= upper bound.
upper_bound_ngrams : integer, default=1: Upper bound for n-grams to be extracted. Must be >= lower bound.

Methods

fit(self, x: 'DashAIDataset', y=None) -> 'TFIDFConverter'

Defined on TFIDFConverter

Fit TfidfVectorizer on the first text column of the dataset.

Parameters

x : DashAIDataset: Input dataset. Only the first column is used for fitting.
y : ignored: Present for API compatibility.

Returns

TFIDFConverter: The fitted converter instance (self).

get_output_type(self, column_name: Optional[str] = None) -> DashAI.back.types.dashai_data_type.DashAIDataType

Defined on TFIDFConverter

Return the DashAI data type produced by this converter for a column.

Parameters

column_name : str, optional: The column name to look up in the fitted vectoriser. When provided and the vectoriser has been fitted, the returned type reflects the actual fitted vocabulary. Defaults to None.

Returns

DashAIDataType: A Float type for each TF-IDF weight column.

transform(self, x: 'DashAIDataset', y=None) -> 'DashAIDataset'

Defined on TFIDFConverter

Transform text into TF-IDF weighted token columns.

Parameters

x : DashAIDataset: Input dataset. The first column is vectorised.
y : ignored: Present for API compatibility.

Returns

DashAIDataset: Original dataset with tfidf_* token-weight columns appended.

get_metadata(cls) -> 'Dict[str, Any]'

Defined on BaseConverter

Get metadata for the converter, used by the DashAI frontend.

Parameters

cls : type: The converter class (injected automatically by Python for classmethods).

Returns

Dict[str, Any]: Dictionary containing display name, short description, image preview path, category, icon, color, and whether the converter is supervised.

get_schema(cls) -> dict

Defined on ConfigObject

Generates the component related Json Schema.

Returns

dict: Dictionary representing the Json Schema of the component.

validate_and_transform(self, raw_data: dict) -> dict

Defined on ConfigObject

It takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.

Parameters

raw_data : dict: A dictionary with the data provided by the user to initialize the model.

Returns

dict: A validated dictionary with the necessary objects.

References​

Parameters​

Methods​

References

Parameters

Methods