Hyperparameter Search with RandomizedSearchCV#

This notebook demonstrates how cuml.accel speeds up a hyperparameter search workflow. Having your train of thought interrupted by long running steps in a workflow is not great. By using cuml.accel you can take a workflow that is tedious because it takes minutes to complete and make it complete in 30s.

In this example we build a preprocessing + classification pipeline and use RandomizedSearchCV to find the best configuration. However, the principle of using cuml.accel to take a task from “requires a coffee break per iteration” to “it is fun to iterate on ideas” by speeding it up applies to many other tasks as well.

Pipeline: StandardScalerPCAKNeighborsClassifier

KNN is distance-based, so the preprocessing steps are essential:

  • StandardScaler normalises features that span very different ranges (elevation 0–3800 vs binary soil-type indicators 0/1).

  • PCA reduces the 54-dimensional feature space (40 of which are sparse one-hot columns) to a compact representation where distances are more informative.

Dataset: Forest Cover Type (300K subsample, 54 features, 7 classes).

Without cuml.accel, this search takes several minutes (CPU, n_jobs=10). With cuml.accel enabled the same search completes in under a minute.

All three pipeline steps (StandardScaler, PCA, KNeighborsClassifier) are GPU-accelerated by cuml.accel.

[1]:
%load_ext cuml.accel

Load and prepare the dataset#

We use the Forest Cover Type dataset (581K samples, 54 features, 7 cover-type classes). To keep runtimes manageable we subsample to 300K rows and split 80/20 into train and test sets.

[2]:
import numpy as np
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split

X_full, y_full = fetch_covtype(return_X_y=True)

N_SUBSAMPLE = 300_000
rng = np.random.RandomState(42)
idx = rng.choice(len(X_full), size=N_SUBSAMPLE, replace=False)
X, y = X_full[idx], y_full[idx]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)

print(f"Full dataset:  {X_full.shape[0]:,} samples, {X_full.shape[1]} features")
print(f"Subsample:     {N_SUBSAMPLE:,}")
print(f"Train:         {X_train.shape[0]:,}")
print(f"Test:          {X_test.shape[0]:,}")
print(f"Classes:       {len(np.unique(y_train))}")
Full dataset:  581,012 samples, 54 features
Subsample:     300,000
Train:         240,000
Test:          60,000
Classes:       7

Define the pipeline and search space#

The pipeline chains three steps, each GPU-accelerated by cuml.accel:

  1. StandardScaler — normalise feature scales so that distance computations are not dominated by high-magnitude features like elevation.

  2. PCA — project the 54 features (many of which are sparse one-hot indicators) into a lower-dimensional space.

  3. KNeighborsClassifier — classify based on nearest neighbours in the PCA-reduced space.

We search over PCA dimensionality, number of neighbours, distance weighting, and distance metric.

[3]:
from scipy.stats import randint
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA()),
    ("knn", KNeighborsClassifier()),
])

param_distributions = {
    "pca__n_components": [10, 20, 30, 40],
    "knn__n_neighbors": randint(3, 30),
    "knn__weights": ["uniform", "distance"],
    "knn__metric": ["euclidean", "manhattan"],
}

Inspect the results#

Let’s look at the best hyperparameters found by the search and how the top configurations compare.

[5]:
print("Best parameters:")
for param, val in sorted(search.best_params_.items()):
    print(f"  {param}: {val}")
print(f"\nBest CV accuracy: {search.best_score_:.4f}")
Best parameters:
  knn__metric: manhattan
  knn__n_neighbors: 8
  knn__weights: distance
  pca__n_components: 40

Best CV accuracy: 0.9001
[6]:
import pandas as pd

cv = pd.DataFrame(search.cv_results_)
cv = cv.sort_values("rank_test_score")
cv[["param_pca__n_components", "param_knn__n_neighbors",
    "param_knn__weights", "param_knn__metric",
    "mean_test_score", "std_test_score", "mean_fit_time"]].head(10)
[6]:
param_pca__n_components param_knn__n_neighbors param_knn__weights param_knn__metric mean_test_score std_test_score mean_fit_time
6 40 8 distance manhattan 0.900050 0.001693 0.170435
4 30 10 distance manhattan 0.896242 0.000351 0.167082
16 40 6 uniform manhattan 0.888579 0.001919 0.175955
12 30 17 distance manhattan 0.885387 0.000704 0.172816
19 30 17 distance manhattan 0.885387 0.000704 0.170839
18 40 23 distance manhattan 0.877742 0.001438 0.175124
5 40 23 distance manhattan 0.877742 0.001438 0.170652
13 40 25 distance manhattan 0.875308 0.001065 0.176696
14 30 5 uniform euclidean 0.874437 0.001196 0.172954
10 40 12 distance euclidean 0.873946 0.001695 0.169286

Evaluate on the test set#

RandomizedSearchCV with refit=True automatically refits the best model on the full training set. We can use it directly to score on held-out data.

[7]:
test_acc = search.score(X_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")
Test accuracy: 0.9076