{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "ea791931",
      "metadata": {},
      "source": [
        "# Hyperparameter Search with RandomizedSearchCV\n",
        "\n",
        "This notebook demonstrates how `cuml.accel` speeds up a hyperparameter search\n",
        "workflow. Having your train of thought interrupted by long running steps in\n",
        "a workflow is not great. By using `cuml.accel` you can take a workflow that\n",
        "is tedious because it takes minutes to complete and make it complete in 30s.\n",
        "\n",
        "In this example we build a preprocessing + classification pipeline and use\n",
        "`RandomizedSearchCV` to find the best configuration. However, the principle\n",
        "of using `cuml.accel` to take a task from \"requires a coffee break per\n",
        "iteration\" to \"it is fun to iterate on ideas\" by speeding it up applies\n",
        "to many other tasks as well.\n",
        "\n",
        "**Pipeline:** `StandardScaler` → `PCA` → `KNeighborsClassifier`\n",
        "\n",
        "KNN is distance-based, so the preprocessing steps are essential:\n",
        "- `StandardScaler` normalises features that span very different ranges\n",
        "  (elevation 0–3800 vs binary soil-type indicators 0/1).\n",
        "- `PCA` reduces the 54-dimensional feature space (40 of which are sparse\n",
        "  one-hot columns) to a compact representation where distances are more\n",
        "  informative.\n",
        "\n",
        "**Dataset:** Forest Cover Type (300K subsample, 54 features, 7 classes).\n",
        "\n",
        "Without `cuml.accel`, this search takes several minutes (CPU,\n",
        "`n_jobs=10`). With `cuml.accel` enabled the same search completes in\n",
        "under a minute.\n",
        "\n",
        "All three pipeline steps (`StandardScaler`, `PCA`, `KNeighborsClassifier`)\n",
        "are GPU-accelerated by `cuml.accel`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "557d6f89",
      "metadata": {},
      "outputs": [],
      "source": [
        "%load_ext cuml.accel"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "18f941fe",
      "metadata": {},
      "source": [
        "## Load and prepare the dataset\n",
        "\n",
        "We use the [Forest Cover Type](https://archive.ics.uci.edu/dataset/31/covertype)\n",
        "dataset (581K samples, 54 features, 7 cover-type classes). To keep runtimes\n",
        "manageable we subsample to 300K rows and split 80/20 into train and test sets."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7ae45ccc",
      "metadata": {},
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "from sklearn.datasets import fetch_covtype\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "X_full, y_full = fetch_covtype(return_X_y=True)\n",
        "\n",
        "N_SUBSAMPLE = 300_000\n",
        "rng = np.random.RandomState(42)\n",
        "idx = rng.choice(len(X_full), size=N_SUBSAMPLE, replace=False)\n",
        "X, y = X_full[idx], y_full[idx]\n",
        "\n",
        "X_train, X_test, y_train, y_test = train_test_split(\n",
        "    X, y, test_size=0.2, random_state=42, stratify=y,\n",
        ")\n",
        "\n",
        "print(f\"Full dataset:  {X_full.shape[0]:,} samples, {X_full.shape[1]} features\")\n",
        "print(f\"Subsample:     {N_SUBSAMPLE:,}\")\n",
        "print(f\"Train:         {X_train.shape[0]:,}\")\n",
        "print(f\"Test:          {X_test.shape[0]:,}\")\n",
        "print(f\"Classes:       {len(np.unique(y_train))}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0f6b0b40",
      "metadata": {},
      "source": [
        "## Define the pipeline and search space\n",
        "\n",
        "The pipeline chains three steps, each GPU-accelerated by `cuml.accel`:\n",
        "\n",
        "1. `StandardScaler` — normalise feature scales so that distance computations\n",
        "   are not dominated by high-magnitude features like elevation.\n",
        "2. `PCA` — project the 54 features (many of which are sparse one-hot\n",
        "   indicators) into a lower-dimensional space.\n",
        "3. `KNeighborsClassifier` — classify based on nearest neighbours in the\n",
        "   PCA-reduced space.\n",
        "\n",
        "We search over PCA dimensionality, number of neighbours, distance weighting,\n",
        "and distance metric."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8fe0d3fd",
      "metadata": {},
      "outputs": [],
      "source": [
        "from scipy.stats import randint\n",
        "from sklearn.decomposition import PCA\n",
        "from sklearn.neighbors import KNeighborsClassifier\n",
        "from sklearn.pipeline import Pipeline\n",
        "from sklearn.preprocessing import StandardScaler\n",
        "\n",
        "pipe = Pipeline([\n",
        "    (\"scaler\", StandardScaler()),\n",
        "    (\"pca\", PCA()),\n",
        "    (\"knn\", KNeighborsClassifier()),\n",
        "])\n",
        "\n",
        "param_distributions = {\n",
        "    \"pca__n_components\": [10, 20, 30, 40],\n",
        "    \"knn__n_neighbors\": randint(3, 30),\n",
        "    \"knn__weights\": [\"uniform\", \"distance\"],\n",
        "    \"knn__metric\": [\"euclidean\", \"manhattan\"],\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "daf61609",
      "metadata": {},
      "source": [
        "## Run the search\n",
        "\n",
        "We sample 20 random parameter combinations and evaluate each with 5-fold\n",
        "cross-validation, for a total of 100 pipeline fits. With `cuml.accel` active\n",
        "this takes ~30 seconds; without it (CPU, `n_jobs=10`) the same search takes\n",
        "~4.5 minutes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1163fa12",
      "metadata": {},
      "outputs": [],
      "source": [
        "%%time\n",
        "\n",
        "from sklearn.model_selection import RandomizedSearchCV\n",
        "\n",
        "search = RandomizedSearchCV(\n",
        "    pipe,\n",
        "    param_distributions,\n",
        "    n_iter=20,\n",
        "    cv=5,\n",
        "    scoring=\"accuracy\",\n",
        "    random_state=42,\n",
        "    # For CPU, set n_jobs to a higher number\n",
        "    n_jobs=1,\n",
        "    refit=True,\n",
        ")\n",
        "search.fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "110d5cba",
      "metadata": {},
      "source": [
        "## Inspect the results\n",
        "\n",
        "Let's look at the best hyperparameters found by the search and how the\n",
        "top configurations compare."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ce51ad16",
      "metadata": {},
      "outputs": [],
      "source": [
        "print(\"Best parameters:\")\n",
        "for param, val in sorted(search.best_params_.items()):\n",
        "    print(f\"  {param}: {val}\")\n",
        "print(f\"\\nBest CV accuracy: {search.best_score_:.4f}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "02167a16",
      "metadata": {},
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "cv = pd.DataFrame(search.cv_results_)\n",
        "cv = cv.sort_values(\"rank_test_score\")\n",
        "cv[[\"param_pca__n_components\", \"param_knn__n_neighbors\",\n",
        "    \"param_knn__weights\", \"param_knn__metric\",\n",
        "    \"mean_test_score\", \"std_test_score\", \"mean_fit_time\"]].head(10)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d3f26add",
      "metadata": {},
      "source": [
        "## Evaluate on the test set\n",
        "\n",
        "`RandomizedSearchCV` with `refit=True` automatically refits the best model on\n",
        "the full training set. We can use it directly to score on held-out data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "20befa55",
      "metadata": {},
      "outputs": [],
      "source": [
        "test_acc = search.score(X_test, y_test)\n",
        "print(f\"Test accuracy: {test_acc:.4f}\")"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.11.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}