TargetEncoder#

class cuml.preprocessing.TargetEncoder(*, n_folds=4, smooth=0, seed=42, split_method='interleaved', verbose=False, output_type=None, stat='mean', multi_feature_mode='combination')[source]#

A cudf based implementation of target encoding [1], which converts one or multiple categorical variables, ‘Xs’, with the average of corresponding values of the target variable, ‘Y’. The input data is grouped by the columns Xs and the aggregated mean value of Y of each group is calculated to replace each value of Xs. Several optimizations are applied to prevent label leakage and parallelize the execution.

Parameters:

n_foldsint (default=4)

Default number of folds for fitting training data. To prevent label leakage in fit, we split data into n_folds and encode one fold using the target variables of the remaining folds.

smoothint or float (default=0)

Count of samples to smooth the encoding. 0 means no smoothing.

seedint (default=42)

Random seed

split_method{‘random’, ‘continuous’, ‘interleaved’}, (default=’interleaved’)

Method to split train data into n_folds. ‘random’: random split. ‘continuous’: consecutive samples are grouped into one folds. ‘interleaved’: samples are assign to each fold in a round robin way. ‘customize’: customize splitting by providing a fold_ids array in fit() or fit_transform() functions.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

stat{‘mean’,’var’,’median’}, default = ‘mean’

The statistic used in encoding, mean, variance or median of the target.

multi_feature_mode{‘combination’, ‘independent’}, default=’combination’

How to handle multiple input features:

'combination': Encode all feature combinations together (cuML native behavior). Produces a single output column with encodings based on the joint distribution of all features.
'independent': Encode each feature independently (sklearn behavior). Produces N output columns, one per input feature, each containing encodings based only on that feature’s relationship with the target.

For single-feature input, both modes produce identical results.

Attributes:

categories_list of cupy.ndarray: The categories of each input feature determined during fitting. Each element is an array of unique category values for that feature, sorted in ascending order.
n_features_in_int: Number of features seen during fit().
encode_allcudf.DataFrame: DataFrame containing the learned encodings for all category combinations. Used internally for transforming new data.
meanfloat: The overall mean of the target variable, computed during fitting. Used for smoothing and imputing unseen categories.
y_stat_valfloat: The statistic value (mean, variance, or median) of the target variable, depending on the stat parameter. Used to impute encodings for unseen categories.
traincudf.DataFrame or None: The training DataFrame used during fitting, containing the original features, target values, and fold assignments. Set to None if the encoder was loaded from a sklearn model via from_sklearn().
train_encodecuml.internals.array.CumlArray or None: The encoded values for the training data, computed during fit() or fit_transform(). Set to None if the encoder was loaded from a sklearn model via from_sklearn().

Methods

`fit`(X, y, *[, fold_ids])	Fit a TargetEncoder instance to a set of categories
`fit_transform`(X, y, *[, fold_ids])	Simultaneously fit and transform an input
`transform`(X)	Transform an input into its categorical keys.

Notes

sklearn Conversion Limitations

When converting between cuML and sklearn via as_sklearn() and from_sklearn(), be aware of the following semantic differences:

Training data behavior: cuML’s transform() returns cross-validated (regularized) encodings when called on training data to prevent data leakage. sklearn’s transform always returns global mean encodings regardless of whether the input is training or test data. After roundtrip conversion, the cuML model will return global encodings for all data since the training data reference is not preserved.
Multi-feature encoding: cuML’s default multi_feature_mode='combination' encodes feature combinations jointly, while sklearn always encodes features independently. Multi-feature models fitted with 'combination' mode cannot be converted to sklearn; use multi_feature_mode='independent' for sklearn compatibility.

Cross-Validation Differences

cuML and sklearn use different cross-validation fold assignment strategies during fit_transform. Both are valid target encoding implementations, but they produce different encoded values for the same input:

sklearn: Uses KFold/StratifiedKFold with specific sample-to-fold assignments based on random_state.
cuML: Uses configurable split_method ('interleaved', 'random', 'continuous', or 'customize') with different fold assignment logic.

Because samples are assigned to different folds, the leave-fold-out encoding for each sample is computed from different data subsets. For example:

# Same data, same random_state, different encoded values:
# sklearn fit_transform: [0.52, 0.48, 0.51, 0.49, ...]
# cuML fit_transform:    [0.50, 0.51, 0.49, 0.52, ...]

This difference only affects fit_transform on training data. The transform method on test data produces equivalent results since it uses global statistics computed from all training samples.

References

[1]

https://maxhalford.github.io/blog/target-encoding/

Examples

Converting a categorical implementation to a numerical one

>>> from cudf import DataFrame, Series
>>> from cuml.preprocessing import TargetEncoder
>>> train = DataFrame({'category': ['a', 'b', 'b', 'a'],
...                    'label': [1, 0, 1, 1]})
>>> test = DataFrame({'category': ['a', 'c', 'b', 'a']})

>>> encoder = TargetEncoder(output_type='numpy')
>>> encoded = encoder.fit_transform(train[["category"]], train.label)
>>> encoded
array([[1.],
       [1.],
       [0.],
       [1.]])

fit(X, y, *, fold_ids=None)[source]#

Fit a TargetEncoder instance to a set of categories

Parameters:

Xcudf.Series or cudf.DataFrame or cupy.ndarray: categories to be encoded. It’s elements may or may not be unique
ycudf.Series or cupy.ndarray: Series containing the target variable.
fold_idscudf.Series or cupy.ndarray: Series containing the indices of the customized folds. Its values should be integers in range [0, N-1] to split data into N folds. If None, fold_ids is generated based on split_method.

Returns:

selfTargetEncoder: A fitted instance of itself to allow method chaining

fit_transform(X, y, *, fold_ids=None) → CumlArray[source]#

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)

Parameters:

Xcudf.Series or cudf.DataFrame or cupy.ndarray: categories to be encoded. It’s elements may or may not be unique
ycudf.Series or cupy.ndarray: Series containing the target variable.
fold_idscudf.Series or cupy.ndarray: Series containing the indices of the customized folds. Its values should be integers in range [0, N-1] to split data into N folds. If None, fold_ids is generated based on split_method.

Returns:

encodedcupy.ndarray: The ordinally encoded input series

transform(X) → CumlArray[source]#

Transform an input into its categorical keys.

This is intended for test data. For fitting and transforming the training data, prefer fit_transform.

Parameters:

Xcudf.Series: Input keys to be transformed. Its values doesn’t have to match the categories given to fit

Returns:

encodedcupy.ndarray: The ordinally encoded input series