TfidfVectorizer#
- class cuml.feature_extraction.text.TfidfVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]#
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to
CountVectorizerfollowed byTfidfTransformer.- Parameters:
- lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.
- preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
- stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
- ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_rangeof(1, 1)means only unigrams,(1, 2)means unigrams and bigrams, and(2, 2)means only bigrams.- analyzerstring, {‘word’, ‘char’, ‘char_wb’}, default=’word’
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
- vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.
- binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
- delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.
- norm{‘l1’, ‘l2’}, default=’l2’
- Each output row will have unit norm, either:
‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
‘l1’: Sum of absolute values of vector elements is 1.
- use_idfbool, default=True
Enable inverse-document-frequency reweighting.
- smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- sublinear_tfbool, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
- Attributes:
- idf_array of shape (n_features)
The inverse document frequency (IDF) vector; only defined if
use_idfis True.- vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.
- stop_words_cudf.Series[str]
- Terms that were ignored because they either:
occurred in too many documents (
max_df)occurred in too few documents (
min_df)were cut off by feature selection (
max_features).
This is only available if no vocabulary was given.
Methods
fit(raw_documents)Learn vocabulary and idf from training set.
fit_transform(raw_documents[, y])Learn vocabulary and idf, return document-term matrix.
Array mapping from feature integer indices to feature name.
transform(raw_documents)Transform documents to document-term matrix.
Notes
The
stop_words_attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.This class is largely based on scikit-learn 0.23.1’s TfIdfVectorizer code, which is provided under the BSD-3 license.
- fit(raw_documents)[source]#
Learn vocabulary and idf from training set.
- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns:
- selfobject
Fitted vectorizer.
- fit_transform(raw_documents, y=None)[source]#
Learn vocabulary and idf, return document-term matrix. This is equivalent to fit followed by transform, but more efficiently implemented.
- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- yNone
Ignored.
- Returns:
- Xcupy csr array of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
- get_feature_names()[source]#
Array mapping from feature integer indices to feature name.
- Returns:
- feature_namesSeries
A list of feature names.
- transform(raw_documents)[source]#
Transform documents to document-term matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).
- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns:
- Xcupy csr array of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.