OneHotEncoder#
- class cuml.dask.preprocessing.OneHotEncoder(*, client=None, verbose=False, **kwargs)[source]#
Encode categorical features as a one-hot numeric array. The input to this transformer should be a dask_cuDF.DataFrame or cupy dask.Array, denoting the values taken on by categorical features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparseparameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify thecategoriesmanually.- Parameters:
- categories‘auto’, cupy.ndarray or cudf.DataFrame, default=’auto’
Categories (unique values) per feature. All categories are expected to fit on one GPU.
‘auto’ : Determine categories automatically from the training data.
DataFrame/ndarray :
categories[col]holds the categories expected in the feature col.
- drop‘first’, None or a dict, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
Dict :
drop[col]is the category in feature col that should be dropped.
- sparsebool, default=False
This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223
- dtypenumber type, default=np.float
Desired datatype of transform’s output.
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
Methods
fit(X)Fit a multi-node multi-gpu OneHotEncoder to X.
inverse_transform(X[, delayed])Convert the data back to the original representation.
transform(X[, delayed])Transform X using one-hot encoding.
- fit(X)[source]#
Fit a multi-node multi-gpu OneHotEncoder to X.
- Parameters:
- XDask cuDF DataFrame or CuPy backed Dask Array
The data to determine the categories of each feature.
- Returns:
- self
- inverse_transform(X, delayed=True)[source]#
Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding),
Noneis used to represent this category.- Parameters:
- XCuPy backed Dask Array, shape [n_samples, n_encoded_features]
The transformed data.
- delayedbool (default = True)
Whether to execute as a delayed task or eager.
- Returns:
- X_trDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the inverse transformed array.
- transform(X, delayed=True)[source]#
Transform X using one-hot encoding.
- Parameters:
- XDask cuDF DataFrame or CuPy backed Dask Array
The data to encode.
- delayedbool (default = True)
Whether to execute as a delayed task or eager.
- Returns:
- outDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the transformed input.