CountVectorizer#

class cuml.feature_extraction.text.CountVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#

Convert a collection of text documents to a matrix of token counts

If you do not provide an a-priori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.

Parameters:

lowercaseboolean, True by default: Convert all characters to lowercase before tokenizing.
preprocessorcallable or None (default): Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
stop_wordsstring {‘english’}, list, or None (default): If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
ngram_rangetuple (min_n, max_n), default=(1, 1): The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
analyzerstring, {‘word’, ‘char’, ‘char_wb’}: Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
max_dffloat in range [0.0, 1.0] or int, default=1.0: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_dffloat in range [0.0, 1.0] or int, default=1: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
max_featuresint or None, default=None: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
vocabularycudf.Series, optional: If not given, a vocabulary is determined from the input documents.
binaryboolean, default=False: If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
dtypetype, optional: Type of the matrix returned by fit_transform() or transform().
delimiterstr, whitespace by default: String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Attributes:

vocabulary_cudf.Series[str]

Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]

Terms that were ignored because they either:

occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

`fit`(raw_documents[, y])	Build a vocabulary of all tokens in the raw documents.
`fit_transform`(raw_documents[, y])	Build the vocabulary and return document-term matrix.
`get_feature_names`()	Array mapping from feature integer indices to feature name.
`inverse_transform`(X)	Return terms per document with nonzero entries in X.
`transform`(raw_documents)	Transform documents to document-term matrix.

fit(raw_documents, y=None)[source]#

Build a vocabulary of all tokens in the raw documents.

Parameters:

raw_documentscudf.Series or pd.Series: A Series of string documents
yNone: Ignored.

Returns:

self

fit_transform(raw_documents, y=None)[source]#

Build the vocabulary and return document-term matrix.

Equivalent to self.fit(X).transform(X) but preprocess X only once.

Parameters:

raw_documentscudf.Series or pd.Series: A Series of string documents
yNone: Ignored.

Returns:

Xcupy csr array of shape (n_samples, n_features): Document-term matrix.

get_feature_names()[source]#

Array mapping from feature integer indices to feature name.

Returns:

feature_namesSeries: A list of feature names.

inverse_transform(X)[source]#

Return terms per document with nonzero entries in X.

Parameters:

Xarray-like of shape (n_samples, n_features): Document-term matrix.

Returns:

X_invlist of cudf.Series of shape (n_samples,): List of Series of terms.

transform(raw_documents)[source]#

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters:

raw_documentscudf.Series or pd.Series: A Series of string documents

Returns:

Xcupy csr array of shape (n_samples, n_features): Document-term matrix.