cudf.factorize#

cudf.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)[source]#

Encode the input values as integer labels

Parameters:
values: Series, Index, or CuPy array

The data to be factorized.

sortbool, default True

Sort uniques and shuffle codes to maintain the relationship.

use_na_sentinelbool, default True

If True, the sentinel -1 will be used for NA values. If False, NA values will be encoded as non-negative integers and will not drop the NA from the uniques of the values.

Returns:
(labels, cats)(cupy.ndarray, cupy.ndarray or Index)
  • labels contains the encoded values

  • cats contains the categories in order that the N-th

    item corresponds to the (N-1) code.

See also

cudf.Series.factorize

Encode the input values of Series.

Examples

>>> import cudf
>>> import numpy as np
>>> data = cudf.Series(['a', 'c', 'c'])
>>> codes, uniques = cudf.factorize(data)
>>> codes
array([0, 1, 1], dtype=int8)
>>> uniques
Index(['a' 'c'], dtype='object')

When use_na_sentinel=True (the default), missing values are indicated in the codes with the sentinel value -1 and missing values are not included in uniques.

>>> codes, uniques = cudf.factorize(['b', None, 'a', 'c', 'b'])
>>> codes
array([ 1, -1,  0,  2,  1], dtype=int8)
>>> uniques
Index(['a', 'b', 'c'], dtype='object')

If NA is in the values, and we want to include NA in the uniques of the values, it can be achieved by setting use_na_sentinel=False.

>>> values = np.array([1, 2, 1, np.nan])
>>> codes, uniques = cudf.factorize(values)
>>> codes
array([ 0,  1,  0, -1], dtype=int8)
>>> uniques
Index([1.0, 2.0], dtype='float64')
>>> codes, uniques = cudf.factorize(values, use_na_sentinel=False)
>>> codes
array([1, 2, 1, 0], dtype=int8)
>>> uniques
Index([<NA>, 1.0, 2.0], dtype='float64')