HashingVectorizer#
- class cuml.feature_extraction.text.HashingVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#
Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a cupyx.scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory which is even more important as GPU’s that are often memory constrained
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.
The hash function employed is the signed 32-bit version of Murmurhash3.
- Parameters:
- lowercasebool, default=True
Convert all characters to lowercase before tokenizing.
- preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
- stop_wordsstring {‘english’}, list, default=None
If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if
analyzer == 'word'.- ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_rangeof(1, 1)means only unigrams,(1, 2)means unigrams and bigrams, and(2, 2)means only bigrams.- analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- n_featuresint, default=(2 ** 20)
The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
- binarybool, default=False.
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- norm{‘l1’, ‘l2’}, default=’l2’
Norm used to normalize term vectors. None for no normalization.
- alternate_signbool, default=True
When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.
- dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
- delimiterstr, whitespace by default
String used as a replacement for stop words if
stop_wordsis not None. Typically the delimiting character between words is a good choice.
Methods
fit(X[, y])This method only checks the input type and the model parameter.
fit_transform(X[, y])Transform a sequence of documents to a document-term matrix.
partial_fit(X[, y])Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
transform(raw_documents)Transform documents to document-term matrix.
See also
Examples
>>> from cuml.feature_extraction.text import HashingVectorizer >>> import pandas as pd >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = HashingVectorizer(n_features=2**4) >>> X = vectorizer.fit_transform(pd.Series(corpus)) >>> print(X.shape) (4, 16)
- fit(X, y=None)[source]#
This method only checks the input type and the model parameter. It does not do anything meaningful as this transformer is stateless
- Parameters:
- Xcudf.Series or pd.Series
A Series of string documents
- fit_transform(X, y=None)[source]#
Transform a sequence of documents to a document-term matrix.
- Parameters:
- Xiterable over raw text documents, length = n_samples
Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns:
- Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Document-term matrix.
- partial_fit(X, y=None)[source]#
Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
- Parameters:
- Xcudf.Series(A Series of string documents).
- transform(raw_documents)[source]#
Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
- Parameters:
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns:
- Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Document-term matrix.