cudf.factorize#
- cudf.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)[source]#
Encode the input values as integer labels
- Parameters:
- values: Series, Index, or CuPy array
The data to be factorized.
- sortbool, default True
Sort uniques and shuffle codes to maintain the relationship.
- use_na_sentinelbool, default True
If True, the sentinel -1 will be used for NA values. If False, NA values will be encoded as non-negative integers and will not drop the NA from the uniques of the values.
- Returns:
- (labels, cats)(cupy.ndarray, cupy.ndarray or Index)
labels contains the encoded values
- cats contains the categories in order that the N-th
item corresponds to the (N-1) code.
See also
cudf.Series.factorize
Encode the input values of Series.
Examples
>>> import cudf >>> import numpy as np >>> data = cudf.Series(['a', 'c', 'c']) >>> codes, uniques = cudf.factorize(data) >>> codes array([0, 1, 1], dtype=int8) >>> uniques Index(['a' 'c'], dtype='object')
When
use_na_sentinel=True
(the default), missing values are indicated in the codes with the sentinel value-1
and missing values are not included in uniques.>>> codes, uniques = cudf.factorize(['b', None, 'a', 'c', 'b']) >>> codes array([ 1, -1, 0, 2, 1], dtype=int8) >>> uniques Index(['a', 'b', 'c'], dtype='object')
If NA is in the values, and we want to include NA in the uniques of the values, it can be achieved by setting
use_na_sentinel=False
.>>> values = np.array([1, 2, 1, np.nan]) >>> codes, uniques = cudf.factorize(values) >>> codes array([ 0, 1, 0, -1], dtype=int8) >>> uniques Index([1.0, 2.0], dtype='float64') >>> codes, uniques = cudf.factorize(values, use_na_sentinel=False) >>> codes array([1, 2, 1, 0], dtype=int8) >>> uniques Index([<NA>, 1.0, 2.0], dtype='float64')