TFIDFConverter
Convert raw text documents into a matrix of TF-IDF weighted features.
TF-IDF (Term Frequency - Inverse Document Frequency) re-weights raw token counts so that terms that appear frequently in a specific document but rarely across the whole corpus receive a higher score, while common stop-like terms are down-weighted. Each document is represented as a floating-point vector of TF-IDF scores, one dimension per vocabulary term.
The TF-IDF score for term t in document d is:
tfidf(t, d) = tf(t, d) x log((1 + n) / (1 + df(t))) + 1
where n is the total number of documents and df(t) is the number of
documents containing t (scikit-learn's smooth_idf=True default).
Optional preprocessing steps include lower-casing, stop-word removal, and n-gram extraction. The result is a floating-point DashAI dataset with one column per vocabulary term.
Internally wraps sklearn.feature_extraction.text.TfidfVectorizer.
The TF-IDF weighting scheme was introduced as a core technique in information retrieval by Salton & McGill (1983) [2].
References
- [1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- [2] Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
Parameters
- max_features : integer, default=
1000 - Maximum number of features (most frequent terms) to keep.
- lowercase : boolean, default=
True - Whether to convert all characters to lowercase before tokenizing.
- stop_words, default=
None - Stop word set to remove. Use 'english' or None.
- lower_bound_ngrams : integer, default=
1 - Lower bound for n-grams to be extracted. Must be <= upper bound.
- upper_bound_ngrams : integer, default=
1 - Upper bound for n-grams to be extracted. Must be >= lower bound.
Methods
fit(self, x: 'DashAIDataset', y=None) -> 'TFIDFConverter'
TFIDFConverterFit TfidfVectorizer on the first text column of the dataset.
Parameters
- x : DashAIDataset
- Input dataset. Only the first column is used for fitting.
- y : ignored
- Present for API compatibility.
Returns
- TFIDFConverter
- The fitted converter instance (
self).
transform(self, x: 'DashAIDataset', y=None) -> 'DashAIDataset'
TFIDFConverterTransform text into TF-IDF weighted token columns.
Parameters
- x : DashAIDataset
- Input dataset. The first column is vectorised.
- y : ignored
- Present for API compatibility.
Returns
- DashAIDataset
- Dataset where each token becomes a numeric TF-IDF weight column.
changes_row_count(self) -> 'bool'
BaseConverterIndicate whether this converter changes the number of dataset rows.
Returns
- bool
- True if the converter may add or remove rows, False otherwise.
get_metadata(cls) -> 'Dict[str, Any]'
BaseConverterGet metadata for the converter, used by the DashAI frontend.
Parameters
- cls : type
- The converter class (injected automatically by Python for classmethods).
Returns
- Dict[str, Any]
- Dictionary containing display name, short description, image preview path, category, icon, color, and whether the converter is supervised.
get_output_type(self, column_name: 'str' = None) -> 'DashAIDataType'
BaseConverterReturn the DashAI data type produced by this converter for a given column.
Parameters
- column_name : str, optional
- The name of the output column. Useful for converters that may produce different types per column. Defaults to None.
Returns
- DashAIDataType
- The output data type for the specified column.
get_schema(cls) -> dict
ConfigObjectGenerates the component related Json Schema.
Returns
- dict
- Dictionary representing the Json Schema of the component.
validate_and_transform(self, raw_data: dict) -> dict
ConfigObjectIt takes the data given by the user to initialize the model and returns it with all the objects that the model needs to work.
Parameters
- raw_data : dict
- A dictionary with the data provided by the user to initialize the model.
Returns
- dict
- A validated dictionary with the necessary objects.