HPO with dask-ml and cuml#

Introduction#

Hyperparameter optimization is the task of picking hyperparameters values of the model that provide the optimal results for the problem, as measured on a specific test dataset. This is often a crucial step and can help boost the model accuracy when done correctly. Cross-validation is often used to more accurately estimate the performance of the models in the search process. Cross-validation is the method of splitting the training set into complementary subsets and performing training on one of the subsets, then predicting the models performance on the other. This is a potential indication of how the model will generalise to data it has not seen before.

Despite its theoretical importance, HPO has been difficult to implement in practical applications because of the resources needed to run so many distinct training jobs.

The two approaches that we will be exploring in this notebook are :

1. GridSearch#

As the name suggests, the “search” is done over each possible combination in a grid of parameters that the user provides. The user must manually define this grid.. For each parameter that needs to be tuned, a set of values are given and the final grid search is performed with tuple having one element from each set, thus resulting in a Catersian Product of the elements.

For example, assume we want to perform HPO on XGBoost. For simplicity lets tune only n_estimators and max_depth

n_estimators: [50, 100, 150]
max_depth: [6, 7, ,8]

The grid search will take place over |n_estimators| x |max_depth| which is 3 x 3 = 9. As you have probably guessed, the grid size grows rapidly as the number of parameters and their search space increases.

2. RandomSearch#

Random Search replaces the exhaustive nature of the search from before with a random selection of parameters over the specified space. This method can outperform GridSearch in cases where the number of parameters affecting the model’s performance is small (low-dimension optimization problems). Since this does not pick every tuple from the cartesian product, it tends to yield results faster, and the performance can be comparable to that of the Grid Search approach. It’s worth keeping in mind that the random nature of this search means, the results with each run might differ.

Some of the other methods used for HPO include:

Bayesian Optimization
Gradient-based Optimization
Evolutionary Optimization

To learn more about HPO, some papers are linked to at the end of the notebook for further reading.

Now that we have a basic understanding of what HPO is, let’s discuss what we wish to achieve with this demo. The aim of this notebook is to show the importance of hyper parameter optimisation and the performance of dask-ml GPU for xgboost and cuML-RF.

For this demo, we will be using the Airline dataset. The aim of the problem is to predict the arrival delay. It has about 116 million entries with 13 attributes that are used to determine the delay for a given airline. We have modified this problem to serve as a binary classification problem to determine if the airline will be delayed (True) or not.

Let’s get started!

import warnings

warnings.filterwarnings("ignore")  # Reduce number of messages/warnings displayed

Spinning up a CUDA Cluster#

This notebook is designed to run on a single node with multiple GPUs, you can get multi-GPU VMs from AWS, GCP, Azure, IBM and more.

We start a local cluster and keep it ready for running distributed tasks with dask.

Below, LocalCUDACluster launches one Dask worker for each GPU in the current systems. It’s developed as a part of the RAPIDS project. Learn More:

cluster = LocalCUDACluster()
client = Client(cluster)

client

Data Preparation#

We download the Airline dataset and save it to local directory specific by data_dir and file_name. In this step, we also want to convert the input data into appropriate dtypes. For this, we will use the prepare_dataset function.

Note: To ensure that this example runs quickly on a modest machine, we default to using a small subset of the airline dataset. To use the full dataset, pass the argument use_full_dataset=True to the prepare_dataset function.

data_dir = "./rapids_hpo/data/"
file_name = "airlines.parquet"
parquet_name = os.path.join(data_dir, file_name)

parquet_name

def prepare_dataset(use_full_dataset=False):
    global file_path, data_dir

    if use_full_dataset:
        url = "https://data.rapids.ai/cloud-ml/airline_20000000.parquet"
    else:
        url = "https://data.rapids.ai/cloud-ml/airline_small.parquet"

    if os.path.isfile(parquet_name):
        print(f" > File already exists. Ready to load at {parquet_name}")
    else:
        # Ensure folder exists
        os.makedirs(data_dir, exist_ok=True)

        def data_progress_hook(block_number, read_size, total_filesize):
            if (block_number % 1000) == 0:
                print(
                    f" > percent complete: { 100 * ( block_number * read_size ) / total_filesize:.2f}\r",
                    end="",
                )
            return

        urlretrieve(
            url=url,
            filename=parquet_name,
            reporthook=data_progress_hook,
        )

        print(f" > Download complete {file_name}")

    input_cols = [
        "Year",
        "Month",
        "DayofMonth",
        "DayofWeek",
        "CRSDepTime",
        "CRSArrTime",
        "UniqueCarrier",
        "FlightNum",
        "ActualElapsedTime",
        "Origin",
        "Dest",
        "Distance",
        "Diverted",
    ]

    dataset = cudf.read_parquet(parquet_name)

    # encode categoricals as numeric
    for col in dataset.select_dtypes(["object"]).columns:
        dataset[col] = dataset[col].astype("category").cat.codes.astype(np.int32)

    # cast all columns to int32
    for col in dataset.columns:
        dataset[col] = dataset[col].astype(np.float32)  # needed for random forest

    # put target/label column first [ classic XGBoost standard ]
    output_cols = ["ArrDelayBinary"] + input_cols

    dataset = dataset.reindex(columns=output_cols)
    return dataset

df = prepare_dataset()

import time
from contextlib import contextmanager


# Helping time blocks of code
@contextmanager
def timed(txt):
    t0 = time.time()
    yield
    t1 = time.time()
    print("%32s time:  %8.5f" % (txt, t1 - t0))

# Define some default values to make use of across the notebook for a fair comparison
N_FOLDS = 5
N_ITER = 25

label = "ArrDelayBinary"

Splitting Data#

We split the data randomnly into train and test sets using the cuml train_test_split and create CPU versions of the data.

X_train, X_test, y_train, y_test = train_test_split(df, label, test_size=0.2)

X_cpu = X_train.to_pandas()
y_cpu = y_train.to_numpy()

X_test_cpu = X_test.to_pandas()
y_test_cpu = y_test.to_numpy()

Setup Custom cuML scorers#

The search functions (such as GridSearchCV) for scikit-learn and dask-ml expect the metric functions (such as accuracy_score) to match the “scorer” API. This can be achieved using the scikit-learn’s make_scorer function.

We will generate a cuml_scorer with the cuML accuracy_score function. You’ll also notice an accuracy_score_wrapper which primarily converts the y label into a float32 type. This is because some cuML models only accept this type for now and in order to make it compatible, we perform this conversion.

We also create helper functions for performing HPO in 2 different modes:

gpu-grid: Perform GPU based GridSearchCV
gpu-random: Perform GPU based RandomizedSearchCV

def accuracy_score_wrapper(y, y_hat):
    """
    A wrapper function to convert labels to float32,
    and pass it to accuracy_score.

    Params:
    - y: The y labels that need to be converted
    - y_hat: The predictions made by the model
    """
    y = y.astype("float32")  # cuML RandomForest needs the y labels to be float32
    return accuracy_score(y, y_hat, convert_dtype=True)


accuracy_wrapper_scorer = make_scorer(accuracy_score_wrapper)
cuml_accuracy_scorer = make_scorer(accuracy_score, convert_dtype=True)

def do_HPO(model, gridsearch_params, scorer, X, y, mode="gpu-Grid", n_iter=10):
    """
    Perform HPO based on the mode specified

    mode: default gpu-Grid. The possible options are:
    1. gpu-grid: Perform GPU based GridSearchCV
    2. gpu-random: Perform GPU based RandomizedSearchCV

    n_iter: specified with Random option for number of parameter settings sampled

    Returns the best estimator and the results of the search
    """
    if mode == "gpu-grid":
        print("gpu-grid selected")
        clf = dcv.GridSearchCV(model, gridsearch_params, cv=N_FOLDS, scoring=scorer)
    elif mode == "gpu-random":
        print("gpu-random selected")
        clf = dcv.RandomizedSearchCV(
            model, gridsearch_params, cv=N_FOLDS, scoring=scorer, n_iter=n_iter
        )

    else:
        print("Unknown Option, please choose one of [gpu-grid, gpu-random]")
        return None, None
    res = clf.fit(X, y)
    print(f"Best clf and score {res.best_estimator_} {res.best_score_}\n---\n")
    return res.best_estimator_, res

def print_acc(model, X_train, y_train, X_test, y_test, mode_str="Default"):
    """
    Trains a model on the train data provided, and prints the accuracy of the trained model.
    mode_str: User specifies what model it is to print the value
    """
    y_pred = model.fit(X_train, y_train).predict(X_test)
    score = accuracy_score(y_pred, y_test.astype("float32"), convert_dtype=True)
    print(f"{mode_str} model accuracy: {score}")

X_train.shape

Launch HPO#

We will first see the model’s performances without the gridsearch and then compare it with the performance after searching.

XGBoost#

To perform the Hyperparameter Optimization, we make use of the sklearn version of the XGBClassifier.We’re making use of this version to make it compatible and easily comparable to the scikit-learn version. The model takes a set of parameters that can be found in the documentation. We’re primarily interested in the max_depth, learning_rate, min_child_weight, reg_alpha and num_round as these affect the performance of XGBoost the most.

Read more about what these parameters are useful for here

Default Performance#

We first use the model with it’s default parameters and see the accuracy of the model. In this case, it is 84%

model_gpu_xgb_ = xgb.XGBClassifier(tree_method="gpu_hist")

print_acc(model_gpu_xgb_, X_train, y_cpu, X_test, y_test_cpu)

Parameter Distributions#

The way we define the grid to perform the search is by including ranges of parameters that need to be used for the search. In this example we make use of np.arange which returns an ndarray of even spaced values, np.logspace returns a specified number of ssamples that are equally spaced on the log scale. We can also specify as lists, NumPy arrays or make use of any random variate sample that gives a sample when called. SciPy provides various functions for this too.

# For xgb_model
model_gpu_xgb = xgb.XGBClassifier(tree_method="gpu_hist")

# More range
params_xgb = {
    "max_depth": np.arange(start=3, stop=12, step=3),  # Default = 6
    "alpha": np.logspace(-3, -1, 5),  # default = 0
    "learning_rate": [0.05, 0.1, 0.15],  # default = 0.3
    "min_child_weight": np.arange(start=2, stop=10, step=3),  # default = 1
    "n_estimators": [100, 200, 1000],
}

RandomizedSearchCV#

We’ll now try RandomizedSearchCV. n_iter specifies the number of parameters points theat the search needs to perform. Here we will search N_ITER (defined earlier) points for the best performance.

mode = "gpu-random"

with timed("XGB-" + mode):
    res, results = do_HPO(
        model_gpu_xgb,
        params_xgb,
        cuml_accuracy_scorer,
        X_train,
        y_cpu,
        mode=mode,
        n_iter=N_ITER,
    )
num_params = len(results.cv_results_["mean_test_score"])
print(f"Searched over {num_params} parameters")

print_acc(res, X_train, y_cpu, X_test, y_test_cpu, mode_str=mode)

mode = "gpu-grid"

with timed("XGB-" + mode):
    res, results = do_HPO(
        model_gpu_xgb, params_xgb, cuml_accuracy_scorer, X_train, y_cpu, mode=mode
    )
num_params = len(results.cv_results_["mean_test_score"])
print(f"Searched over {num_params} parameters")

print_acc(res, X_train, y_cpu, X_test, y_test_cpu, mode_str=mode)

Improved performance#

There’s a 5% improvement in the performance.

We notice that performing grid search and random search yields similar performance improvements even though random search used just 25 combination of parameters. We will stick to performing Random Search for the rest of the notebook with RF with the assumption that there will not be a major difference in performance if the ranges are large enough.

Visualizing the Search#

Let’s plot some graphs to get an understanding how the parameters affect the accuracy. The code for these plots are included in cuml/experimental/hyperopt_utils/plotting_utils.py

Mean/Std of test scores#

We fix all parameters except one for each of these graphs and plot the effect the parameter has on the mean test score with the error bar indicating the standard deviation

from cuml.experimental.hyperopt_utils import plotting_utils

plotting_utils.plot_search_results(results)

Heatmaps#

Between parameter pairs (we can do a combination of all possible pairs, but only one are shown in this notebook)
This gives a visual representation of how the pair affect the test score

df_gridsearch = pd.DataFrame(results.cv_results_)
plotting_utils.plot_heatmap(df_gridsearch, "param_max_depth", "param_n_estimators")

RandomForest#

Let’s use RandomForest Classifier to perform a hyper-parameter search. We’ll make use of the cuml RandomForestClassifier and visualize the results using heatmap.

## Random Forest
model_rf_ = RandomForestClassifier()

params_rf = {
    "max_depth": np.arange(start=3, stop=15, step=2),  # Default = 6
    "max_features": [0.1, 0.50, 0.75, "auto"],  # default = 0.3
    "n_estimators": [100, 200, 500, 1000],
}

for col in X_train.columns:
    X_train[col] = X_train[col].astype("float32")
y_train = y_train.astype("int32")

print(
    "Default acc: ",
    accuracy_score(model_rf_.fit(X_train, y_train).predict(X_test), y_test),
)

mode = "gpu-random"
model_rf = RandomForestClassifier()


with timed("RF-" + mode):
    res, results = do_HPO(
        model_rf,
        params_rf,
        cuml_accuracy_scorer,
        X_train,
        y_cpu,
        mode=mode,
        n_iter=N_ITER,
    )
num_params = len(results.cv_results_["mean_test_score"])
print(f"Searched over {num_params} parameters")

print("Improved acc: ", accuracy_score(res.predict(X_test), y_test))

df_gridsearch = pd.DataFrame(results.cv_results_)

plotting_utils.plot_heatmap(df_gridsearch, "param_max_depth", "param_n_estimators")

Conclusion and Next Steps#

We notice improvements in the performance for a really basic version of the GridSearch and RandomizedSearch. Generally, the more data we use, the better the model performs, so you are encouraged to try for larger data and broader range of parameters.

This experiment can also be repeated with different classifiers and different ranges of parameters to notice how HPO can help improve the performance metric. In this example, we have chosen a basic metric - accuracy, but you can use more interesting metrics that help in determining the usefulness of a model. You can even send a list of parameters to the scoring function. This makes HPO really powerful, and it can add a significant boost to the model that we generate.