CountVectorizer#

class cuml.feature_extraction.text.CountVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#

Convert a collection of text documents to a matrix of token counts

If you do not provide an a-priori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.

Parameters:
lowercaseboolean, True by default

Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, or None (default)

If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.

ngram_rangetuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}

Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

max_dffloat in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_dffloat in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_featuresint or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

vocabularycudf.Series, optional

If not given, a vocabulary is determined from the input documents.

binaryboolean, default=False

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

dtypetype, optional

Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default

String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Attributes:
vocabulary_cudf.Series[str]

Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]
Terms that were ignored because they either:
  • occurred in too many documents (max_df)

  • occurred in too few documents (min_df)

  • were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

fit(raw_documents[, y])

Build a vocabulary of all tokens in the raw documents.

fit_transform(raw_documents[, y])

Build the vocabulary and return document-term matrix.

get_feature_names()

Array mapping from feature integer indices to feature name.

inverse_transform(X)

Return terms per document with nonzero entries in X.

transform(raw_documents)

Transform documents to document-term matrix.

fit(raw_documents, y=None)[source]#

Build a vocabulary of all tokens in the raw documents.

Parameters:
raw_documentscudf.Series or pd.Series

A Series of string documents

yNone

Ignored.

Returns:
self
fit_transform(raw_documents, y=None)[source]#

Build the vocabulary and return document-term matrix.

Equivalent to self.fit(X).transform(X) but preprocess X only once.

Parameters:
raw_documentscudf.Series or pd.Series

A Series of string documents

yNone

Ignored.

Returns:
Xcupy csr array of shape (n_samples, n_features)

Document-term matrix.

get_feature_names()[source]#

Array mapping from feature integer indices to feature name.

Returns:
feature_namesSeries

A list of feature names.

inverse_transform(X)[source]#

Return terms per document with nonzero entries in X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Document-term matrix.

Returns:
X_invlist of cudf.Series of shape (n_samples,)

List of Series of terms.

transform(raw_documents)[source]#

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters:
raw_documentscudf.Series or pd.Series

A Series of string documents

Returns:
Xcupy csr array of shape (n_samples, n_features)

Document-term matrix.