cudf.get_dummies#

cudf.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, cats=None, sparse=False, drop_first=False, dtype='bool')[source]#

Returns a dataframe whose columns are the one hot encodings of all columns in df

Parameters:
dataarray-like, Series, or DataFrame

Data of which to get dummy indicators.

prefixstr, dict, or sequence, optional

Prefix to append. Either a str (to apply a constant prefix), dict mapping column names to prefixes, or sequence of prefixes to apply with the same length as the number of columns. If not supplied, defaults to the empty string

prefix_sepstr, dict, or sequence, optional, default ‘_’

Separator to use when appending prefixes

dummy_naboolean, optional

Add a column to indicate Nones, if False Nones are ignored.

catsdict, optional

Dictionary mapping column names to sequences of values representing that column’s category. If not supplied, it is computed as the unique values of the column.

sparseboolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

drop_firstboolean, optional

Whether to get k-1 dummies out of k categorical levels by removing the first level.

columnssequence of str, optional

Names of columns to encode. If not provided, will attempt to encode all columns. Note this is different from pandas default behavior, which encodes all columns with dtype object or categorical

dtypestr, optional

Output dtype, default ‘bool’

Examples

>>> import cudf
>>> df = cudf.DataFrame({"a": ["value1", "value2", None], "b": [0, 0, 0]})
>>> cudf.get_dummies(df)
   b  a_value1  a_value2
0  0      True     False
1  0     False      True
2  0     False     False
>>> cudf.get_dummies(df, dummy_na=True)
   b  a_<NA>  a_value1  a_value2
0  0   False      True     False
1  0   False     False      True
2  0    True     False     False
>>> import numpy as np
>>> df = cudf.DataFrame({"a":cudf.Series([1, 2, np.nan, None],
...                     nan_as_null=False)})
>>> df
      a
0   1.0
1   2.0
2   NaN
3  <NA>
>>> cudf.get_dummies(df, dummy_na=True, columns=["a"])
   a_<NA>  a_1.0  a_2.0  a_nan
0   False   True  False  False
1   False  False   True  False
2   False  False  False   True
3    True  False  False  False
>>> series = cudf.Series([1, 2, None, 2, 4])
>>> series
0       1
1       2
2    <NA>
3       2
4       4
dtype: int64
>>> cudf.get_dummies(series, dummy_na=True)
    <NA>      1      2      4
0  False   True  False  False
1  False  False   True  False
2   True  False  False  False
3  False  False   True  False
4  False  False  False   True