KBinsDiscretizer#

class cuml.preprocessing.KBinsDiscretizer(*args, **kwargs)[source]#

Bin continuous data into intervals.

Parameters:

n_binsint or array-like, shape (n_features,) (default=5)

The number of bins to produce. Raises ValueError if n_bins < 2.

encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, (default=’onehot’)

Method used to encode the transformed result.

onehot: Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.
onehot-dense: Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.
ordinal: Return the bin identifier encoded as an integer value.

strategy{‘uniform’, ‘quantile’, ‘kmeans’}, (default=’quantile’)

Strategy used to define the widths of the bins.

uniform: All bins in each feature have identical widths.
quantile: All bins in each feature have the same number of points.
kmeans: Values in each bin have the same nearest center of a 1D k-means cluster.

Attributes:

n_bins_int array, shape (n_features,): Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.
bin_edges_array of arrays, shape (n_features, ): The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.

Methods

`fit`(X[, y])	Fit the estimator.
`inverse_transform`(Xt)	Transform discretized data back to original feature space.
`transform`(X)	Discretize the data.

See also

cuml.preprocessing.Binarizer: Class used to bin values as 0 or 1 based on a parameter threshold.

Notes

In bin edges for feature i, the first and last values are used only for inverse_transform. During transform, bin edges are extended to:

np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

You can combine KBinsDiscretizer with cuml.compose.ColumnTransformer if you only want to preprocess part of the features.

KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., sklearn.feature_selection.VarianceThreshold).

Examples

>>> from cuml.preprocessing import KBinsDiscretizer
>>> import cupy as cp
>>> X = [[-2, 1, -4,   -1],
...      [-1, 2, -3, -0.5],
...      [ 0, 3, -2,  0.5],
...      [ 1, 4, -1,    2]]
>>> X = cp.array(X)
>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
>>> est.fit(X)
KBinsDiscretizer(...)
>>> Xt = est.transform(X)
>>> Xt
array([[0, 0, 0, 0],
       [1, 1, 1, 0],
       [2, 2, 2, 1],
       [2, 2, 2, 2]], dtype=int32)

Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

>>> est.bin_edges_[0]
array([-2., -1.,  0.,  1.])
>>> est.inverse_transform(Xt)
array([[-1.5,  1.5, -3.5, -0.5],
       [-0.5,  2.5, -2.5, -0.5],
       [ 0.5,  3.5, -1.5,  0.5],
       [ 0.5,  3.5, -1.5,  1.5]])

fit(X, y=None) → KBinsDiscretizer[source]#

Fit the estimator.

Parameters:

Xnumeric array-like, shape (n_samples, n_features): Data to be discretized.
yNone: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:

self

inverse_transform(Xt) → SparseCumlArray[source]#

Transform discretized data back to original feature space.

Note that this function does not regenerate the original data due to discretization rounding.

Parameters:

Xtnumeric array-like, shape (n_sample, n_features): Transformed data in the binned space.

Returns:

Xinvnumeric array-like: Data in the original feature space.

transform(X) → SparseCumlArray[source]#

Discretize the data.

Parameters:

Xnumeric array-like, shape (n_samples, n_features): Data to be discretized.

Returns:

Xtnumeric array-like or sparse matrix: Data in the binned space.