CountVectorizer#
- class cuml.feature_extraction.text.CountVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#
Convert a collection of text documents to a matrix of token counts
If you do not provide an a-priori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.
- Parameters:
- lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.
- preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
- stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
- ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_rangeof(1, 1)means only unigrams,(1, 2)means unigrams and bigrams, and(2, 2)means only bigrams.- analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
- vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.
- binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
- delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.
- Attributes:
- vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.
- stop_words_cudf.Series[str]
- Terms that were ignored because they either:
occurred in too many documents (
max_df)occurred in too few documents (
min_df)were cut off by feature selection (
max_features).
This is only available if no vocabulary was given.
Methods
fit(raw_documents[, y])Build a vocabulary of all tokens in the raw documents.
fit_transform(raw_documents[, y])Build the vocabulary and return document-term matrix.
Array mapping from feature integer indices to feature name.
Return terms per document with nonzero entries in X.
transform(raw_documents)Transform documents to document-term matrix.
- fit(raw_documents, y=None)[source]#
Build a vocabulary of all tokens in the raw documents.
- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- yNone
Ignored.
- Returns:
- self
- fit_transform(raw_documents, y=None)[source]#
Build the vocabulary and return document-term matrix.
Equivalent to
self.fit(X).transform(X)but preprocessXonly once.- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- yNone
Ignored.
- Returns:
- Xcupy csr array of shape (n_samples, n_features)
Document-term matrix.
- get_feature_names()[source]#
Array mapping from feature integer indices to feature name.
- Returns:
- feature_namesSeries
A list of feature names.
- inverse_transform(X)[source]#
Return terms per document with nonzero entries in X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Document-term matrix.
- Returns:
- X_invlist of cudf.Series of shape (n_samples,)
List of Series of terms.
- transform(raw_documents)[source]#
Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns:
- Xcupy csr array of shape (n_samples, n_features)
Document-term matrix.