Hyperparameter Search with RandomizedSearchCV#

This notebook demonstrates how cuml.accel speeds up a hyperparameter search workflow. Having your train of thought interrupted by long running steps in a workflow is not great. By using cuml.accel you can take a workflow that is tedious because it takes minutes to complete and make it complete in 30s.

In this example we build a preprocessing + classification pipeline and use RandomizedSearchCV to find the best configuration. However, the principle of using cuml.accel to take a task from “requires a coffee break per iteration” to “it is fun to iterate on ideas” by speeding it up applies to many other tasks as well.

Pipeline: StandardScaler → PCA → KNeighborsClassifier

KNN is distance-based, so the preprocessing steps are essential:

StandardScaler normalises features that span very different ranges (elevation 0–3800 vs binary soil-type indicators 0/1).
PCA reduces the 54-dimensional feature space (40 of which are sparse one-hot columns) to a compact representation where distances are more informative.

Dataset: Forest Cover Type (300K subsample, 54 features, 7 classes).

Without cuml.accel, this search takes several minutes (CPU, n_jobs=10). With cuml.accel enabled the same search completes in under a minute.

All three pipeline steps (StandardScaler, PCA, KNeighborsClassifier) are GPU-accelerated by cuml.accel.

[1]:

%load_ext cuml.accel

Load and prepare the dataset#

We use the Forest Cover Type dataset (581K samples, 54 features, 7 cover-type classes). To keep runtimes manageable we subsample to 300K rows and split 80/20 into train and test sets.

[2]:

import numpy as np
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split

X_full, y_full = fetch_covtype(return_X_y=True)

N_SUBSAMPLE = 300_000
rng = np.random.RandomState(42)
idx = rng.choice(len(X_full), size=N_SUBSAMPLE, replace=False)
X, y = X_full[idx], y_full[idx]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)

print(f"Full dataset:  {X_full.shape[0]:,} samples, {X_full.shape[1]} features")
print(f"Subsample:     {N_SUBSAMPLE:,}")
print(f"Train:         {X_train.shape[0]:,}")
print(f"Test:          {X_test.shape[0]:,}")
print(f"Classes:       {len(np.unique(y_train))}")

Full dataset:  581,012 samples, 54 features
Subsample:     300,000
Train:         240,000
Test:          60,000
Classes:       7

Define the pipeline and search space#

The pipeline chains three steps, each GPU-accelerated by cuml.accel:

StandardScaler — normalise feature scales so that distance computations are not dominated by high-magnitude features like elevation.
PCA — project the 54 features (many of which are sparse one-hot indicators) into a lower-dimensional space.
KNeighborsClassifier — classify based on nearest neighbours in the PCA-reduced space.

We search over PCA dimensionality, number of neighbours, distance weighting, and distance metric.

[3]:

from scipy.stats import randint
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA()),
    ("knn", KNeighborsClassifier()),
])

param_distributions = {
    "pca__n_components": [10, 20, 30, 40],
    "knn__n_neighbors": randint(3, 30),
    "knn__weights": ["uniform", "distance"],
    "knn__metric": ["euclidean", "manhattan"],
}

Run the search#

We sample 20 random parameter combinations and evaluate each with 5-fold cross-validation, for a total of 100 pipeline fits. With cuml.accel active this takes ~30 seconds; without it (CPU, n_jobs=10) the same search takes ~4.5 minutes.

[4]:

%%time

from sklearn.model_selection import RandomizedSearchCV

search = RandomizedSearchCV(
    pipe,
    param_distributions,
    n_iter=20,
    cv=5,
    scoring="accuracy",
    random_state=42,
    # For CPU, set n_jobs to a higher number
    n_jobs=1,
    refit=True,
)
search.fit(X_train, y_train)

CPU times: user 18.3 s, sys: 54 s, total: 1min 12s
Wall time: 1min 5s

[4]:

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                             ('pca', PCA()),
                                             ('knn', KNeighborsClassifier())]),
                   n_iter=20, n_jobs=1,
                   param_distributions={'knn__metric': ['euclidean',
                                                        'manhattan'],
                                        'knn__n_neighbors': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7c42158aaa50>,
                                        'knn__weights': ['uniform', 'distance'],
                                        'pca__n_components': [10, 20, 30, 40]},
                   random_state=42, scoring='accuracy')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomizedSearchCV

?Documentation for RandomizedSearchCViFitted

Parameters

	estimator estimator: estimator object An object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.	Pipeline(step...lassifier())])
	param_distributions param_distributions: dict or list of dicts Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. Distributions must provide a ``rvs`` method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.	{'knn__metric': ['euclidean', 'manhattan'], 'knn__n_neighbors': <scipy.stats....x7c42158aaa50>, 'knn__weights': ['uniform', 'distance'], 'pca__n_components': [10, 20, ...]}
	n_iter n_iter: int, default=10 Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.	20
	scoring scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s :ref:`default evaluation criterion <scoring_api_overview>` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example. If None, the estimator's score method is used.	'accuracy'
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details. .. versionchanged:: v0.20 `n_jobs` default changed from 1 to None	1
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - an iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	5
	random_state random_state: int, RandomState instance or None, default=None Pseudo random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	42
	refit refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given the ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``RandomizedSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`this example <sphx_glr_auto_examples_model_selection_plot_grid_search_refit_callable.py>` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20 Support for callable added.	True
	verbose verbose: int, default = 0 Controls the verbosity of information printed during fitting, with higher values yielding more detailed logging. - 0 : no messages are printed; - >=1 : summary of the total number of fits; - >=2 : computation time for each fold and parameter candidate; - >=3 : fold indices and scores; - >=10 : parameter candidate indices and START messages before each fit.	0
	pre_dispatch pre_dispatch: int, or str, default='2n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2n_jobs'	'2*n_jobs'
	error_score error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.	nan
	return_train_score return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``	False

Fitted attributes

Name	Type	Value
best_estimator_ best_estimator_: estimator Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if ``refit=False``. For multi-metric evaluation, this attribute is present only if ``refit`` is specified. See ``refit`` parameter for more information on allowed values.	Pipeline	Pipeline(step...'distance'))])
best_index_ best_index_: int The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting. The dict at ``search.cv_results_['params'][search.best_index_]`` gives the parameter setting for the best model, that gives the highest mean score (``search.best_score_``). For multi-metric evaluation, this is not available if ``refit`` is ``False``. See ``refit`` parameter for more information.	int64	np.int64(6)
best_params_ best_params_: dict Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is not available if ``refit`` is ``False``. See ``refit`` parameter for more information.	dict	{'kn...ic': 'ma...an', 'kn...rs': 8, 'kn...ts': 'di...ce', 'pc...ts': 40}
best_score_ best_score_: float Mean cross-validated score of the best_estimator. For multi-metric evaluation, this is not available if ``refit`` is ``False``. See ``refit`` parameter for more information. This attribute is not available if ``refit`` is a function.	float64	0.9001
classes_ classes_: ndarray of shape (n_classes,) The classes labels. This is present only if ``refit`` is specified and the underlying estimator is a classifier.	ndarray[int32](7,)	[1,2,3,...,5,6,7]
cv_results_ cv_results_: dict of numpy (masked) ndarrays A dict with keys as column headers and values as columns, that can be imported into a pandas ``DataFrame``. For instance the below given table +--------------+-------------+-------------------+---+---------------+ \| param_kernel \| param_gamma \| split0_test_score \|...\|rank_test_score\| +==============+=============+===================+===+===============+ \| 'rbf' \| 0.1 \| 0.80 \|...\| 1 \| +--------------+-------------+-------------------+---+---------------+ \| 'rbf' \| 0.2 \| 0.84 \|...\| 3 \| +--------------+-------------+-------------------+---+---------------+ \| 'rbf' \| 0.3 \| 0.70 \|...\| 2 \| +--------------+-------------+-------------------+---+---------------+ will be represented by a ``cv_results_`` dict of:: { 'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'], mask = False), 'param_gamma' : masked_array(data = [0.1 0.2 0.3], mask = False), 'split0_test_score' : [0.80, 0.84, 0.70], 'split1_test_score' : [0.82, 0.50, 0.70], 'mean_test_score' : [0.81, 0.67, 0.70], 'std_test_score' : [0.01, 0.24, 0.00], 'rank_test_score' : [1, 3, 2], 'split0_train_score' : [0.80, 0.92, 0.70], 'split1_train_score' : [0.82, 0.55, 0.70], 'mean_train_score' : [0.81, 0.74, 0.70], 'std_train_score' : [0.01, 0.19, 0.00], 'mean_fit_time' : [0.73, 0.63, 0.43], 'std_fit_time' : [0.01, 0.02, 0.01], 'mean_score_time' : [0.01, 0.06, 0.04], 'std_score_time' : [0.00, 0.00, 0.00], 'params' : [{'kernel' : 'rbf', 'gamma' : 0.1}, ...], } For an example of analysing ``cv_results_``, see :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_stats.py`. NOTE The key ``'params'`` is used to store a list of parameter settings dicts for all the parameter candidates. The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and ``std_score_time`` are all in seconds. For multi-metric evaluation, the scores for all the scorers are available in the ``cv_results_`` dict at the keys ending with that scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown above. ('split0_test_precision', 'mean_train_precision' etc.)	dict	{'me...me': array([1.6787..., 0.17275305]), 'me...me': array([0.2525..., 0.46909575]), 'me...re': array([0.8349..., 0.8853875 ]), 'pa...ic': masked_array(... dtype=object), ...}
multimetric_ multimetric_: bool Whether or not the scorers compute several metrics.	bool	False
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. Only defined if `best_estimator_` is defined (see the documentation for the `refit` parameter for more details) and that `best_estimator_` exposes `n_features_in_` when fit. .. versionadded:: 0.24	int	54
n_splits_ n_splits_: int The number of cross-validation splits (folds/iterations).	int	5
refit_time_ refit_time_: float Seconds used for refitting the best model on the whole dataset. This is present only if ``refit`` is not False. .. versionadded:: 0.20	float	0.2012
scorer_ scorer_: function or a dict Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated ``scoring`` dict which maps the scorer key to the scorer callable.	_Scorer	make_scorer(a...hod='predict')

best_estimator_: Pipeline

StandardScaler

?Documentation for StandardScaler

Parameters

	copy copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.	True
	with_mean with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.	True
	with_std with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).	True

Fitted attributes

Name	Type	Value
mean_ mean_: ndarray of shape (n_features,) or None The mean value for each feature in the training set. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](54,)	[2959.12, 155.53, 14.1 ,..., 0.03, 0.02, 0.02]
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	54
n_samples_seen_ n_samples_seen_: int or ndarray of shape (n_features,) The number of samples processed by the estimator for each feature. If there are no missing samples, the ``n_samples_seen`` will be an integer, otherwise it will be an array of dtype int. If `sample_weights` are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across ``partial_fit`` calls.	ndarray[float64]()	240000.
scale_ scale_: ndarray of shape (n_features,) or None Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using `np.sqrt(var_)`. If a variance is zero, we can't achieve unit variance, and the data is left as-is, giving a scaling factor of 1. `scale_` is equal to `None` when `with_std=False`. .. versionadded:: 0.17 scale_	ndarray[float64](54,)	[280.05,111.81, 7.49,..., 0.16, 0.15, 0.12]
var_ var_: ndarray of shape (n_features,) or None The variance for each feature in the training set. Used to compute `scale_`. Equal to ``None`` when ``with_mean=False`` and ``with_std=False``.	ndarray[float64](54,)	[78426.78,12502.49, 56.13,..., 0.03, 0.02, 0.02]

54 features

x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

x21

x22

x23

x24

x25

x26

x27

x28

x29

x30

x31

x32

x33

x34

x35

x36

x37

x38

x39

x40

x41

x42

x43

x44

x45

x46

x47

x48

x49

x50

x51

x52

x53

PCA

?Documentation for PCA

Parameters

	n_components n_components: int, float or 'mle', default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension. Use of ``n_components == 'mle'`` will interpret ``svd_solver == 'auto'`` as ``svd_solver == 'full'``. If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If ``svd_solver == 'arpack'``, the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1	40
	copy copy: bool, default=True If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.	True
	whiten whiten: bool, default=False When True (False by default) the `components_` vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.	False
	svd_solver svd_solver: {'auto', 'full', 'covariance_eigh', 'arpack', 'randomized'}, default='auto' "auto" : The solver is selected by a default 'auto' policy is based on `X.shape` and `n_components`: if the input data has fewer than 1000 features and more than 10 times as many samples, then the "covariance_eigh" solver is used. Otherwise, if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient "randomized" method is selected. Otherwise the exact "full" SVD is computed and optionally truncated afterwards. "full" : Run exact full SVD calling the standard LAPACK solver via `scipy.linalg.svd` and select the components by postprocessing "covariance_eigh" : Precompute the covariance matrix (on centered data), run a classical eigenvalue decomposition on the covariance matrix typically using LAPACK and select the components by postprocessing. This solver is very efficient for n_samples >> n_features and small n_features. It is, however, not tractable otherwise for large n_features (large memory footprint required to materialize the covariance matrix). Also note that compared to the "full" solver, this solver effectively doubles the condition number and is therefore less numerical stable (e.g. on input data with a large range of singular values). "arpack" : Run SVD truncated to `n_components` calling ARPACK solver via `scipy.sparse.linalg.svds`. It requires strictly `0 < n_components < min(X.shape)` "randomized" : Run randomized SVD by the method of Halko et al. .. versionadded:: 0.18.0 .. versionchanged:: 1.5 Added the 'covariance_eigh' solver.	'auto'
	tol tol: float, default=0.0 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). .. versionadded:: 0.18.0	0.0
	iterated_power iterated_power: int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). .. versionadded:: 0.18.0	'auto'
	n_oversamples n_oversamples: int, default=10 This parameter is only relevant when `svd_solver="randomized"`. It corresponds to the additional number of random vectors to sample the range of `X` so as to ensure proper conditioning. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	10
	power_iteration_normalizer power_iteration_normalizer: {'auto', 'QR', 'LU', 'none'}, default='auto' Power iteration normalizer for randomized SVD solver. Not used by ARPACK. See :func:`~sklearn.utils.extmath.randomized_svd` for more details. .. versionadded:: 1.1	'auto'
	random_state random_state: int, RandomState instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`. .. versionadded:: 0.18.0	None

Fitted attributes

Name	Type	Value
components_ components_: ndarray of shape (n_components, n_features) Principal axes in feature space, representing the directions of maximum variance in the data. Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors. The components are sorted by decreasing ``explained_variance_``.	ndarray[float64](40, 54)	[[ 0.24,-0.13,-0.28,..., 0.04, 0.01, 0.02], [ 0.37, 0.28,-0.16,..., 0.08, 0.03, 0.06], [-0.23, 0.38,-0.08,...,-0.02,-0.07,-0.09], ..., [-0. ,-0. , 0. ,..., 0. , 0. , 0. ], [-0. ,-0. , 0. ,..., 0. ,-0. , 0. ], [ 0.02,-0.11,-0. ,..., 0.09,-0.13,-0.03]]
explained_variance_ explained_variance_: ndarray of shape (n_components,) The amount of variance explained by each of the selected components. The variance estimation uses `n_samples - 1` degrees of freedom. Equal to n_components largest eigenvalues of the covariance matrix of X. .. versionadded:: 0.18	ndarray[float64](40,)	[3.7 ,2.91,2.4 ,...,1. ,1. ,0.98]
explained_variance_ratio_ explained_variance_ratio_: ndarray of shape (n_components,) Percentage of variance explained by each of the selected components. If ``n_components`` is not set then all components are stored and the sum of the ratios is equal to 1.0.	ndarray[float64](40,)	[0.07,0.05,0.04,...,0.02,0.02,0.02]
mean_ mean_: ndarray of shape (n_features,) Per-feature empirical mean, estimated from the training set. Equal to `X.mean(axis=0)`.	ndarray[float64](54,)	[-0.,-0.,-0.,...,-0., 0., 0.]
n_components_ n_components_: int The estimated number of components. When n_components is set to 'mle' or a number between 0 and 1 (with svd_solver == 'full') this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.	int	40
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	54
n_samples_ n_samples_: int Number of samples in the training data.	int	240000
noise_variance_ noise_variance_: float The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See "Pattern Recognition and Machine Learning" by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples. Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.	float	0.3532
singular_values_ singular_values_: ndarray of shape (n_components,) The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the ``n_components`` variables in the lower-dimensional space. .. versionadded:: 0.19	ndarray[float64](40,)	[942.28,836.3 ,759.23,...,489.96,489.9 ,485.05]

40 features

pca0

pca1

pca2

pca3

pca4

pca5

pca6

pca7

pca8

pca9

pca10

pca11

pca12

pca13

pca14

pca15

pca16

pca17

pca18

pca19

pca20

pca21

pca22

pca23

pca24

pca25

pca26

pca27

pca28

pca29

pca30

pca31

pca32

pca33

pca34

pca35

pca36

pca37

pca38

pca39

KNeighborsClassifier

?Documentation for KNeighborsClassifier

Parameters

	n_neighbors n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.	8
	weights weights: {'uniform', 'distance'}, callable or None, default='uniform' Weight function used in prediction. Possible values: - 'uniform' : uniform weights. All points in each neighborhood are weighted equally. - 'distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. Refer to the example entitled :ref:`sphx_glr_auto_examples_neighbors_plot_classification.py` showing the impact of the `weights` parameter on the decision boundary.	'distance'
	metric metric: str or callable, default='minkowski' Metric to use for distance computation. Default is "minkowski", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance <https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is "precomputed", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only "nonzero" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.	'manhattan'
	algorithm algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.	'auto'
	leaf_size leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.	30
	p p: float, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. This parameter is expected to be positive.	2
	metric_params metric_params: dict, default=None Additional keyword arguments for the metric function.	None
	n_jobs n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details. Doesn't affect :meth:`fit` method.	None

Fitted attributes

Name	Type	Value
classes_ classes_: array of shape (n_classes,) Class labels known to the classifier	ndarray[int32](7,)	[1,2,3,...,5,6,7]
effective_metric_ effective_metric_: str or callble The distance metric used. It will be same as the `metric` parameter or a synonym of it, e.g. 'euclidean' if the `metric` parameter set to 'minkowski' and `p` parameter set to 2.	str	'ma...an'
effective_metric_params_ effective_metric_params_: dict Additional keyword arguments for the metric function. For most metrics will be same with `metric_params` parameter, but may also contain the `p` parameter value if the `effective_metric_` attribute is set to 'minkowski'.	dict	{}
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	40
n_samples_fit_ n_samples_fit_: int Number of samples in the fitted data.	int	240000
outputs_2d_ outputs_2d_: bool False when `y`'s shape is (n_samples, ) or (n_samples, 1) during fit otherwise True.	bool	False

Inspect the results#

Let’s look at the best hyperparameters found by the search and how the top configurations compare.

[5]:

print("Best parameters:")
for param, val in sorted(search.best_params_.items()):
    print(f"  {param}: {val}")
print(f"\nBest CV accuracy: {search.best_score_:.4f}")

Best parameters:
  knn__metric: manhattan
  knn__n_neighbors: 8
  knn__weights: distance
  pca__n_components: 40

Best CV accuracy: 0.9001

[6]:

import pandas as pd

cv = pd.DataFrame(search.cv_results_)
cv = cv.sort_values("rank_test_score")
cv[["param_pca__n_components", "param_knn__n_neighbors",
    "param_knn__weights", "param_knn__metric",
    "mean_test_score", "std_test_score", "mean_fit_time"]].head(10)

[6]:

	param_pca__n_components	param_knn__n_neighbors	param_knn__weights	param_knn__metric	mean_test_score	std_test_score	mean_fit_time
6	40	8	distance	manhattan	0.900050	0.001693	0.172919
4	30	10	distance	manhattan	0.896242	0.000351	0.171193
16	40	6	uniform	manhattan	0.888579	0.001919	0.174930
12	30	17	distance	manhattan	0.885387	0.000704	0.172005
19	30	17	distance	manhattan	0.885387	0.000704	0.172753
18	40	23	distance	manhattan	0.877742	0.001438	0.175768
5	40	23	distance	manhattan	0.877742	0.001438	0.173877
13	40	25	distance	manhattan	0.875308	0.001065	0.173403
14	30	5	uniform	euclidean	0.874437	0.001196	0.170027
10	40	12	distance	euclidean	0.873946	0.001695	0.175108

Evaluate on the test set#

RandomizedSearchCV with refit=True automatically refits the best model on the full training set. We can use it directly to score on held-out data.

[7]:

test_acc = search.score(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")

Test accuracy: 0.9076