TargetEncoder#
- class cuml.preprocessing.TargetEncoder(*, n_folds=4, smooth=0, seed=42, split_method='interleaved', verbose=False, output_type=None, stat='mean', multi_feature_mode='combination')[source]#
A cudf based implementation of target encoding [1], which converts one or multiple categorical variables, ‘Xs’, with the average of corresponding values of the target variable, ‘Y’. The input data is grouped by the columns
Xsand the aggregated mean value ofYof each group is calculated to replace each value ofXs. Several optimizations are applied to prevent label leakage and parallelize the execution.- Parameters:
- n_foldsint (default=4)
Default number of folds for fitting training data. To prevent label leakage in
fit, we split data inton_foldsand encode one fold using the target variables of the remaining folds.- smoothint or float (default=0)
Count of samples to smooth the encoding. 0 means no smoothing.
- seedint (default=42)
Random seed
- split_method{‘random’, ‘continuous’, ‘interleaved’}, (default=’interleaved’)
Method to split train data into
n_folds. ‘random’: random split. ‘continuous’: consecutive samples are grouped into one folds. ‘interleaved’: samples are assign to each fold in a round robin way. ‘customize’: customize splitting by providing afold_idsarray infit()orfit_transform()functions.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.- stat{‘mean’,’var’,’median’}, default = ‘mean’
The statistic used in encoding, mean, variance or median of the target.
- multi_feature_mode{‘combination’, ‘independent’}, default=’combination’
How to handle multiple input features:
'combination': Encode all feature combinations together (cuML native behavior). Produces a single output column with encodings based on the joint distribution of all features.'independent': Encode each feature independently (sklearn behavior). Produces N output columns, one per input feature, each containing encodings based only on that feature’s relationship with the target.
For single-feature input, both modes produce identical results.
- Attributes:
- categories_list of cupy.ndarray
The categories of each input feature determined during fitting. Each element is an array of unique category values for that feature, sorted in ascending order.
- n_features_in_int
Number of features seen during
fit().- encode_allcudf.DataFrame
DataFrame containing the learned encodings for all category combinations. Used internally for transforming new data.
- meanfloat
The overall mean of the target variable, computed during fitting. Used for smoothing and imputing unseen categories.
- y_stat_valfloat
The statistic value (mean, variance, or median) of the target variable, depending on the
statparameter. Used to impute encodings for unseen categories.- traincudf.DataFrame or None
The training DataFrame used during fitting, containing the original features, target values, and fold assignments. Set to
Noneif the encoder was loaded from a sklearn model viafrom_sklearn().- train_encodecuml.internals.array.CumlArray or None
The encoded values for the training data, computed during
fit()orfit_transform(). Set toNoneif the encoder was loaded from a sklearn model viafrom_sklearn().
Methods
fit(X, y, *[, fold_ids])Fit a TargetEncoder instance to a set of categories
fit_transform(X, y, *[, fold_ids])Simultaneously fit and transform an input
transform(X)Transform an input into its categorical keys.
Notes
sklearn Conversion Limitations
When converting between cuML and sklearn via
as_sklearn()andfrom_sklearn(), be aware of the following semantic differences:Training data behavior: cuML’s
transform()returns cross-validated (regularized) encodings when called on training data to prevent data leakage. sklearn’stransformalways returns global mean encodings regardless of whether the input is training or test data. After roundtrip conversion, the cuML model will return global encodings for all data since the training data reference is not preserved.Multi-feature encoding: cuML’s default
multi_feature_mode='combination'encodes feature combinations jointly, while sklearn always encodes features independently. Multi-feature models fitted with'combination'mode cannot be converted to sklearn; usemulti_feature_mode='independent'for sklearn compatibility.
Cross-Validation Differences
cuML and sklearn use different cross-validation fold assignment strategies during
fit_transform. Both are valid target encoding implementations, but they produce different encoded values for the same input:sklearn: Uses
KFold/StratifiedKFoldwith specific sample-to-fold assignments based onrandom_state.cuML: Uses configurable
split_method('interleaved','random','continuous', or'customize') with different fold assignment logic.
Because samples are assigned to different folds, the leave-fold-out encoding for each sample is computed from different data subsets. For example:
# Same data, same random_state, different encoded values: # sklearn fit_transform: [0.52, 0.48, 0.51, 0.49, ...] # cuML fit_transform: [0.50, 0.51, 0.49, 0.52, ...]
This difference only affects
fit_transformon training data. Thetransformmethod on test data produces equivalent results since it uses global statistics computed from all training samples.References
Examples
Converting a categorical implementation to a numerical one
>>> from cudf import DataFrame, Series >>> from cuml.preprocessing import TargetEncoder >>> train = DataFrame({'category': ['a', 'b', 'b', 'a'], ... 'label': [1, 0, 1, 1]}) >>> test = DataFrame({'category': ['a', 'c', 'b', 'a']})
>>> encoder = TargetEncoder(output_type='numpy') >>> encoded = encoder.fit_transform(train[["category"]], train.label) >>> encoded array([[1.], [1.], [0.], [1.]])
- fit(X, y, *, fold_ids=None)[source]#
Fit a TargetEncoder instance to a set of categories
- Parameters:
- Xcudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique
- ycudf.Series or cupy.ndarray
Series containing the target variable.
- fold_idscudf.Series or cupy.ndarray
Series containing the indices of the customized folds. Its values should be integers in range
[0, N-1]to split data intoNfolds. If None, fold_ids is generated based onsplit_method.
- Returns:
- selfTargetEncoder
A fitted instance of itself to allow method chaining
- fit_transform(X, y, *, fold_ids=None) CumlArray[source]#
Simultaneously fit and transform an input
This is functionally equivalent to (but faster than)
TargetEncoder().fit(y).transform(y)- Parameters:
- Xcudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique
- ycudf.Series or cupy.ndarray
Series containing the target variable.
- fold_idscudf.Series or cupy.ndarray
Series containing the indices of the customized folds. Its values should be integers in range
[0, N-1]to split data intoNfolds. If None, fold_ids is generated based onsplit_method.
- Returns:
- encodedcupy.ndarray
The ordinally encoded input series
- transform(X) CumlArray[source]#
Transform an input into its categorical keys.
This is intended for test data. For fitting and transforming the training data, prefer
fit_transform.- Parameters:
- Xcudf.Series
Input keys to be transformed. Its values doesn’t have to match the categories given to
fit
- Returns:
- encodedcupy.ndarray
The ordinally encoded input series