{ "cells": [ { "cell_type": "markdown", "id": "ea791931", "metadata": {}, "source": [ "# Hyperparameter Search with RandomizedSearchCV\n", "\n", "This notebook demonstrates how `cuml.accel` speeds up a hyperparameter search\n", "workflow. Having your train of thought interrupted by long running steps in\n", "a workflow is not great. By using `cuml.accel` you can take a workflow that\n", "is tedious because it takes minutes to complete and make it complete in 30s.\n", "\n", "In this example we build a preprocessing + classification pipeline and use\n", "`RandomizedSearchCV` to find the best configuration. However, the principle\n", "of using `cuml.accel` to take a task from \"requires a coffee break per\n", "iteration\" to \"it is fun to iterate on ideas\" by speeding it up applies\n", "to many other tasks as well.\n", "\n", "**Pipeline:** `StandardScaler` → `PCA` → `KNeighborsClassifier`\n", "\n", "KNN is distance-based, so the preprocessing steps are essential:\n", "- `StandardScaler` normalises features that span very different ranges\n", " (elevation 0–3800 vs binary soil-type indicators 0/1).\n", "- `PCA` reduces the 54-dimensional feature space (40 of which are sparse\n", " one-hot columns) to a compact representation where distances are more\n", " informative.\n", "\n", "**Dataset:** Forest Cover Type (300K subsample, 54 features, 7 classes).\n", "\n", "Without `cuml.accel`, this search takes several minutes (CPU,\n", "`n_jobs=10`). With `cuml.accel` enabled the same search completes in\n", "under a minute.\n", "\n", "All three pipeline steps (`StandardScaler`, `PCA`, `KNeighborsClassifier`)\n", "are GPU-accelerated by `cuml.accel`." ] }, { "cell_type": "code", "execution_count": null, "id": "557d6f89", "metadata": {}, "outputs": [], "source": [ "%load_ext cuml.accel" ] }, { "cell_type": "markdown", "id": "18f941fe", "metadata": {}, "source": [ "## Load and prepare the dataset\n", "\n", "We use the [Forest Cover Type](https://archive.ics.uci.edu/dataset/31/covertype)\n", "dataset (581K samples, 54 features, 7 cover-type classes). To keep runtimes\n", "manageable we subsample to 300K rows and split 80/20 into train and test sets." ] }, { "cell_type": "code", "execution_count": null, "id": "7ae45ccc", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.datasets import fetch_covtype\n", "from sklearn.model_selection import train_test_split\n", "\n", "X_full, y_full = fetch_covtype(return_X_y=True)\n", "\n", "N_SUBSAMPLE = 300_000\n", "rng = np.random.RandomState(42)\n", "idx = rng.choice(len(X_full), size=N_SUBSAMPLE, replace=False)\n", "X, y = X_full[idx], y_full[idx]\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.2, random_state=42, stratify=y,\n", ")\n", "\n", "print(f\"Full dataset: {X_full.shape[0]:,} samples, {X_full.shape[1]} features\")\n", "print(f\"Subsample: {N_SUBSAMPLE:,}\")\n", "print(f\"Train: {X_train.shape[0]:,}\")\n", "print(f\"Test: {X_test.shape[0]:,}\")\n", "print(f\"Classes: {len(np.unique(y_train))}\")" ] }, { "cell_type": "markdown", "id": "0f6b0b40", "metadata": {}, "source": [ "## Define the pipeline and search space\n", "\n", "The pipeline chains three steps, each GPU-accelerated by `cuml.accel`:\n", "\n", "1. `StandardScaler` — normalise feature scales so that distance computations\n", " are not dominated by high-magnitude features like elevation.\n", "2. `PCA` — project the 54 features (many of which are sparse one-hot\n", " indicators) into a lower-dimensional space.\n", "3. `KNeighborsClassifier` — classify based on nearest neighbours in the\n", " PCA-reduced space.\n", "\n", "We search over PCA dimensionality, number of neighbours, distance weighting,\n", "and distance metric." ] }, { "cell_type": "code", "execution_count": null, "id": "8fe0d3fd", "metadata": {}, "outputs": [], "source": [ "from scipy.stats import randint\n", "from sklearn.decomposition import PCA\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "pipe = Pipeline([\n", " (\"scaler\", StandardScaler()),\n", " (\"pca\", PCA()),\n", " (\"knn\", KNeighborsClassifier()),\n", "])\n", "\n", "param_distributions = {\n", " \"pca__n_components\": [10, 20, 30, 40],\n", " \"knn__n_neighbors\": randint(3, 30),\n", " \"knn__weights\": [\"uniform\", \"distance\"],\n", " \"knn__metric\": [\"euclidean\", \"manhattan\"],\n", "}" ] }, { "cell_type": "markdown", "id": "daf61609", "metadata": {}, "source": [ "## Run the search\n", "\n", "We sample 20 random parameter combinations and evaluate each with 5-fold\n", "cross-validation, for a total of 100 pipeline fits. With `cuml.accel` active\n", "this takes ~30 seconds; without it (CPU, `n_jobs=10`) the same search takes\n", "~4.5 minutes." ] }, { "cell_type": "code", "execution_count": null, "id": "1163fa12", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "search = RandomizedSearchCV(\n", " pipe,\n", " param_distributions,\n", " n_iter=20,\n", " cv=5,\n", " scoring=\"accuracy\",\n", " random_state=42,\n", " # For CPU, set n_jobs to a higher number\n", " n_jobs=1,\n", " refit=True,\n", ")\n", "search.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "id": "110d5cba", "metadata": {}, "source": [ "## Inspect the results\n", "\n", "Let's look at the best hyperparameters found by the search and how the\n", "top configurations compare." ] }, { "cell_type": "code", "execution_count": null, "id": "ce51ad16", "metadata": {}, "outputs": [], "source": [ "print(\"Best parameters:\")\n", "for param, val in sorted(search.best_params_.items()):\n", " print(f\" {param}: {val}\")\n", "print(f\"\\nBest CV accuracy: {search.best_score_:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "02167a16", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "cv = pd.DataFrame(search.cv_results_)\n", "cv = cv.sort_values(\"rank_test_score\")\n", "cv[[\"param_pca__n_components\", \"param_knn__n_neighbors\",\n", " \"param_knn__weights\", \"param_knn__metric\",\n", " \"mean_test_score\", \"std_test_score\", \"mean_fit_time\"]].head(10)" ] }, { "cell_type": "markdown", "id": "d3f26add", "metadata": {}, "source": [ "## Evaluate on the test set\n", "\n", "`RandomizedSearchCV` with `refit=True` automatically refits the best model on\n", "the full training set. We can use it directly to score on held-out data." ] }, { "cell_type": "code", "execution_count": null, "id": "20befa55", "metadata": {}, "outputs": [], "source": [ "test_acc = search.score(X_test, y_test)\n", "print(f\"Test accuracy: {test_acc:.4f}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }