TfidfVectorizer#

class cuml.feature_extraction.text.TfidfVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]#

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

Parameters:

lowercaseboolean, True by default

Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, or None (default)

If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.

ngram_rangetuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}, default=’word’

Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

max_dffloat in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_dffloat in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_featuresint or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

vocabularycudf.Series, optional

If not given, a vocabulary is determined from the input documents.

binaryboolean, default=False

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

dtypetype, optional

Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default

String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

norm{‘l1’, ‘l2’}, default=’l2’

Each output row will have unit norm, either:

‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
‘l1’: Sum of absolute values of vector elements is 1.

use_idfbool, default=True

Enable inverse-document-frequency reweighting.

smooth_idfbool, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tfbool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Attributes:

idf_array of shape (n_features)

The inverse document frequency (IDF) vector; only defined if use_idf is True.

vocabulary_cudf.Series[str]

Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]

Terms that were ignored because they either:

occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

`fit`(raw_documents)	Learn vocabulary and idf from training set.
`fit_transform`(raw_documents[, y])	Learn vocabulary and idf, return document-term matrix.
`get_feature_names`()	Array mapping from feature integer indices to feature name.
`transform`(raw_documents)	Transform documents to document-term matrix.

Notes

The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.

This class is largely based on scikit-learn 0.23.1’s TfIdfVectorizer code, which is provided under the BSD-3 license.

fit(raw_documents)[source]#

Learn vocabulary and idf from training set.

Parameters:

raw_documentscudf.Series or pd.Series: A Series of string documents

Returns:

selfobject: Fitted vectorizer.

fit_transform(raw_documents, y=None)[source]#

Learn vocabulary and idf, return document-term matrix. This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:

raw_documentscudf.Series or pd.Series: A Series of string documents
yNone: Ignored.

Returns:

Xcupy csr array of shape (n_samples, n_features): Tf-idf-weighted document-term matrix.

get_feature_names()[source]#

Array mapping from feature integer indices to feature name.

Returns:

feature_namesSeries: A list of feature names.

transform(raw_documents)[source]#

Transform documents to document-term matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

Parameters:

raw_documentscudf.Series or pd.Series: A Series of string documents

Returns:

Xcupy csr array of shape (n_samples, n_features): Tf-idf-weighted document-term matrix.