Hyperparameter Search with RandomizedSearchCV#
This notebook demonstrates how cuml.accel speeds up a hyperparameter search workflow. Having your train of thought interrupted by long running steps in a workflow is not great. By using cuml.accel you can take a workflow that is tedious because it takes minutes to complete and make it complete in 30s.
In this example we build a preprocessing + classification pipeline and use RandomizedSearchCV to find the best configuration. However, the principle of using cuml.accel to take a task from “requires a coffee break per iteration” to “it is fun to iterate on ideas” by speeding it up applies to many other tasks as well.
Pipeline: StandardScaler → PCA → KNeighborsClassifier
KNN is distance-based, so the preprocessing steps are essential:
StandardScalernormalises features that span very different ranges (elevation 0–3800 vs binary soil-type indicators 0/1).PCAreduces the 54-dimensional feature space (40 of which are sparse one-hot columns) to a compact representation where distances are more informative.
Dataset: Forest Cover Type (300K subsample, 54 features, 7 classes).
Without cuml.accel, this search takes several minutes (CPU, n_jobs=10). With cuml.accel enabled the same search completes in under a minute.
All three pipeline steps (StandardScaler, PCA, KNeighborsClassifier) are GPU-accelerated by cuml.accel.
[1]:
%load_ext cuml.accel
Load and prepare the dataset#
We use the Forest Cover Type dataset (581K samples, 54 features, 7 cover-type classes). To keep runtimes manageable we subsample to 300K rows and split 80/20 into train and test sets.
[2]:
import numpy as np
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
X_full, y_full = fetch_covtype(return_X_y=True)
N_SUBSAMPLE = 300_000
rng = np.random.RandomState(42)
idx = rng.choice(len(X_full), size=N_SUBSAMPLE, replace=False)
X, y = X_full[idx], y_full[idx]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y,
)
print(f"Full dataset: {X_full.shape[0]:,} samples, {X_full.shape[1]} features")
print(f"Subsample: {N_SUBSAMPLE:,}")
print(f"Train: {X_train.shape[0]:,}")
print(f"Test: {X_test.shape[0]:,}")
print(f"Classes: {len(np.unique(y_train))}")
Full dataset: 581,012 samples, 54 features
Subsample: 300,000
Train: 240,000
Test: 60,000
Classes: 7
Define the pipeline and search space#
The pipeline chains three steps, each GPU-accelerated by cuml.accel:
StandardScaler— normalise feature scales so that distance computations are not dominated by high-magnitude features like elevation.PCA— project the 54 features (many of which are sparse one-hot indicators) into a lower-dimensional space.KNeighborsClassifier— classify based on nearest neighbours in the PCA-reduced space.
We search over PCA dimensionality, number of neighbours, distance weighting, and distance metric.
[3]:
from scipy.stats import randint
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA()),
("knn", KNeighborsClassifier()),
])
param_distributions = {
"pca__n_components": [10, 20, 30, 40],
"knn__n_neighbors": randint(3, 30),
"knn__weights": ["uniform", "distance"],
"knn__metric": ["euclidean", "manhattan"],
}
Run the search#
We sample 20 random parameter combinations and evaluate each with 5-fold cross-validation, for a total of 100 pipeline fits. With cuml.accel active this takes ~30 seconds; without it (CPU, n_jobs=10) the same search takes ~4.5 minutes.
[4]:
%%time
from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(
pipe,
param_distributions,
n_iter=20,
cv=5,
scoring="accuracy",
random_state=42,
# For CPU, set n_jobs to a higher number
n_jobs=1,
refit=True,
)
search.fit(X_train, y_train)
CPU times: user 17.5 s, sys: 54 s, total: 1min 11s
Wall time: 1min 5s
[4]:
RandomizedSearchCV(cv=5,
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('pca', PCA()),
('knn', KNeighborsClassifier())]),
n_iter=20, n_jobs=1,
param_distributions={'knn__metric': ['euclidean',
'manhattan'],
'knn__n_neighbors': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7ac8f17f0590>,
'knn__weights': ['uniform', 'distance'],
'pca__n_components': [10, 20, 30, 40]},
random_state=42, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
Parameters
Parameters
Inspect the results#
Let’s look at the best hyperparameters found by the search and how the top configurations compare.
[5]:
print("Best parameters:")
for param, val in sorted(search.best_params_.items()):
print(f" {param}: {val}")
print(f"\nBest CV accuracy: {search.best_score_:.4f}")
Best parameters:
knn__metric: manhattan
knn__n_neighbors: 8
knn__weights: distance
pca__n_components: 40
Best CV accuracy: 0.9001
[6]:
import pandas as pd
cv = pd.DataFrame(search.cv_results_)
cv = cv.sort_values("rank_test_score")
cv[["param_pca__n_components", "param_knn__n_neighbors",
"param_knn__weights", "param_knn__metric",
"mean_test_score", "std_test_score", "mean_fit_time"]].head(10)
[6]:
| param_pca__n_components | param_knn__n_neighbors | param_knn__weights | param_knn__metric | mean_test_score | std_test_score | mean_fit_time | |
|---|---|---|---|---|---|---|---|
| 6 | 40 | 8 | distance | manhattan | 0.900050 | 0.001693 | 0.170435 |
| 4 | 30 | 10 | distance | manhattan | 0.896242 | 0.000351 | 0.167082 |
| 16 | 40 | 6 | uniform | manhattan | 0.888579 | 0.001919 | 0.175955 |
| 12 | 30 | 17 | distance | manhattan | 0.885387 | 0.000704 | 0.172816 |
| 19 | 30 | 17 | distance | manhattan | 0.885387 | 0.000704 | 0.170839 |
| 18 | 40 | 23 | distance | manhattan | 0.877742 | 0.001438 | 0.175124 |
| 5 | 40 | 23 | distance | manhattan | 0.877742 | 0.001438 | 0.170652 |
| 13 | 40 | 25 | distance | manhattan | 0.875308 | 0.001065 | 0.176696 |
| 14 | 30 | 5 | uniform | euclidean | 0.874437 | 0.001196 | 0.172954 |
| 10 | 40 | 12 | distance | euclidean | 0.873946 | 0.001695 | 0.169286 |
Evaluate on the test set#
RandomizedSearchCV with refit=True automatically refits the best model on the full training set. We can use it directly to score on held-out data.
[7]:
test_acc = search.score(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")
Test accuracy: 0.9076