OneHotEncoder#
- class cuml.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse_output=True, dtype=<class 'numpy.float32'>, handle_unknown='error', verbose=False, output_type=None)[source]#
Encode categorical features as a one-hot numeric array. The input to this estimator should be a
cuDF.DataFrameor acupy.ndarray, denoting the unique values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on thesparse_outputparameter).By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the
categoriesmanually.Note
a one-hot encoding of y labels should use a LabelBinarizer instead.
- Parameters:
- categories‘auto’ an cupy.ndarray or a cudf.DataFrame, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
DataFrame/ndarray :
categories[col]holds the categories expected in the feature col.
- drop‘first’, None, a dict or a list, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
dict/list :
drop[col]is the category in feature col that should be dropped.
- sparse_outputbool, default=True
This feature is not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See cupy/cupy#3223
Added in version 24.06:
sparsewas renamed tosparse_output- dtypenumber type, default=np.float
Desired datatype of transform’s output.
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- drop_idx_array of shape (n_features,)
drop_idx_[i]is the index incategories_[i]of the category to be dropped for each feature. None if all the transformed features will be retained.
Methods
fit(X[, y])Fit OneHotEncoder to X.
fit_transform(X[, y])Fit OneHotEncoder to X, then transform X.
get_feature_names([input_features])Return feature names for output features.
Convert the data back to the original representation.
transform(X)Transform X using one-hot encoding.
- fit(X, y=None)[source]#
Fit OneHotEncoder to X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yNone
Ignored. This parameter exists for compatibility only.
- fit_transform(X, y=None)[source]#
Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yNone
Ignored. This parameter exists for compatibility only.
- Returns:
- X_outsparse matrix if sparse_output=True else a 2-d array
Transformed input.
- get_feature_names(input_features=None)[source]#
Return feature names for output features.
- Parameters:
- input_featureslist of str of shape (n_features,)
String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.
- Returns:
- output_feature_namesndarray of shape (n_output_features,)
Array of feature names.
- inverse_transform(X)[source]#
Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding),
Noneis used to represent this category.The return type is the same as the type of the input used by the first call to fit on this estimator instance.
- Parameters:
- Xarray-like or sparse matrix, shape [n_samples, n_encoded_features]
The transformed data.
- Returns:
- X_trcudf.DataFrame or cupy.ndarray
Inverse transformed array.
- transform(X)[source]#
Transform X using one-hot encoding.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns:
- X_outsparse matrix if sparse_output=True else a 2-d array
Transformed input.