HashingVectorizer#

class cuml.feature_extraction.text.HashingVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#

Convert a collection of text documents to a matrix of token occurrences

It turns a collection of text documents into a cupyx.scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory which is even more important as GPU’s that are often memory constrained

  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

  • no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

Parameters:
lowercasebool, default=True

Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, default=None

If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

ngram_rangetuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}

Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

n_featuresint, default=(2 ** 20)

The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.

binarybool, default=False.

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

norm{‘l1’, ‘l2’}, default=’l2’

Norm used to normalize term vectors. None for no normalization.

alternate_signbool, default=True

When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.

dtypetype, optional

Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default

String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Methods

fit(X[, y])

This method only checks the input type and the model parameter.

fit_transform(X[, y])

Transform a sequence of documents to a document-term matrix.

partial_fit(X[, y])

Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.

transform(raw_documents)

Transform documents to document-term matrix.

Examples

>>> from cuml.feature_extraction.text import HashingVectorizer
>>> import pandas as pd
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = HashingVectorizer(n_features=2**4)
>>> X = vectorizer.fit_transform(pd.Series(corpus))
>>> print(X.shape)
(4, 16)
fit(X, y=None)[source]#

This method only checks the input type and the model parameter. It does not do anything meaningful as this transformer is stateless

Parameters:
Xcudf.Series or pd.Series

A Series of string documents

fit_transform(X, y=None)[source]#

Transform a sequence of documents to a document-term matrix.

Parameters:
Xiterable over raw text documents, length = n_samples

Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

yany

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
Xsparse CuPy CSR matrix of shape (n_samples, n_features)

Document-term matrix.

partial_fit(X, y=None)[source]#

Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.

Parameters:
Xcudf.Series(A Series of string documents).
transform(raw_documents)[source]#

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters:
raw_documentscudf.Series or pd.Series

A Series of string documents

Returns:
Xsparse CuPy CSR matrix of shape (n_samples, n_features)

Document-term matrix.