KernelExplainer#

class cuml.explainer.KernelExplainer(*, model, data, nsamples='auto', link='identity', verbose=False, random_state=None, is_gpu_model=None, dtype=<class 'numpy.float32'>, output_type=None)#

GPU accelerated of SHAP’s kernel explainer.

cuML’s SHAP based explainers accelerate the algorithmic part of SHAP. They are optimized to be used with fast GPU based models, like those in cuML. By creating the datasets and internal calculations, alongside minimizing data copies and transfers, they can accelerate explanations significantly. But they can also be used with CPU based models, where speedups can still be achieved, but those can be capped by factors like data transfers and the speed of the models.

KernelExplainer is based on the Python SHAP package’s KernelExplainer class: slundberg/shap

Current characteristics of the GPU version:

  • Unlike the SHAP package, nsamples is a parameter at the initialization of the explainer and there is a small initialization time.

  • Only tabular data is supported for now, via passing the background dataset explicitly.

  • Sparse data support is planned for the near future.

  • Further optimizations are in progress. For example, if the background dataset has constant value columns and the observation has the same value in some entries, the number of evaluations of the function can be reduced (this will come in the next version).

Parameters:
modelfunction

Function that takes a matrix of samples (n_samples, n_features) and computes the output for those samples with shape (n_samples). Function must use either CuPy or NumPy arrays as input/output.

dataDense matrix containing floats or doubles.

cuML’s kernel SHAP supports tabular data for now, so it expects a background dataset, as opposed to a shap.masker object. The background dataset to use for integrating out features. To determine the impact of a feature, that feature is set to “missing” and the change in the model output is observed. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

nsamplesint (default = 2 * data.shape[1] + 2048)

Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The “auto” setting uses nsamples = 2 * X.shape[1] + 2048.

linkfunction or str (default = ‘identity’)

The link function used to map between the output units of the model and the SHAP value units. From the SHAP package: The link function used to map between the output units of the model and the SHAP value units. By default it is identity, but logit can be useful so that expectations are computed in probability units while explanations remain in the (more naturally additive) log-odds units. For more details on how link functions work see any overview of link functions for generalized linear models.

random_state: int, RandomState instance or None (default = None)

Seed for the random number generator for dataset creation. Note: due to the design of the sampling algorithm the concurrency can affect results, so currently 100% deterministic execution is not guaranteed.

gpu_modelbool or None (default = None)

If None Explainer will try to infer whether model can take GPU data (as CuPy arrays), otherwise it will use NumPy arrays to call model. Set to True to force the explainer to use GPU data, set to False to force the Explainer to use NumPy data.

dtypenp.float32 or np.float64 (default = np.float32)

Parameter to specify the precision of data to generate to call the model.

output_type‘cupy’ or ‘numpy’ (default = ‘numpy’)

Parameter to specify the type of data to output. If not specified, the explainer will default to ‘numpy’ for the time being to improve compatibility.

Methods

shap_values(self, X[, l1_reg, as_list])

Interface to estimate the SHAP values for a set of samples.

Examples

>>> from cuml import SVR
>>> from cuml import make_regression
>>> from cuml import train_test_split
>>>
>>> from cuml.explainer import KernelExplainer
>>>
>>> X, y = make_regression(
...     n_samples=102,
...     n_features=10,
...     noise=0.1,
...     random_state=42)
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X,
...     y,
...     test_size=2,
...     random_state=42)
>>>
>>> model = SVR().fit(X_train, y_train)
>>>
>>> cu_explainer = KernelExplainer(
...     model=model.predict,
...     data=X_train,
...     is_gpu_model=True,
...     random_state=42)
>>>
>>> cu_shap_values = cu_explainer.shap_values(X_test)
>>> cu_shap_values
array([[-0.41163236, -0.29839307, -0.31082764, -0.21910861, 0.20798518,
      1.525831  , -0.07726735, -0.23897147, -0.5901833 , -0.03319931],
    [-0.37491834, -0.22581327, -1.2146976 ,  0.03793442, -0.24420738,
      -0.4875331 , -0.05438256, 0.16568947, -1.9978098 , -0.19110584]],
    dtype=float32)
shap_values(self, X, l1_reg='auto', as_list=True)[source]#

Interface to estimate the SHAP values for a set of samples. Corresponds to the SHAP package’s legacy interface, and is our main API currently.

Parameters:
XDense matrix containing floats or doubles.

Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

l1_regstr (default: ‘auto’)

The l1 regularization to use for feature selection.

as_listbool (default = True)

Set to True to return a list of arrays for multi-dimensional models (like predict_proba functions) to match the SHAP package behavior. Set to False to return them as an array of arrays.

Returns:
shap_valuesarray or list