cuml.benchmark#

Algorithms#

class cuml.benchmark.algorithms.AlgorithmPair(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)[source]#

Wraps a cuML algorithm and (optionally) a cpu-based algorithm (typically scikit-learn, but does not need to be as long as it offers fit and predict or transform methods). Provides mechanisms to run each version with default arguments. If no CPU-based version of the algorithm is available, pass None for the cpu_class when instantiating

Parameters:
cpu_classclass

Class for CPU version of algorithm. Set to None if not available.

cuml_classclass

Class for cuML algorithm. Can be None for CPU-only algorithms.

shared_argsdict

Arguments passed to both implementations’s initializer

cuml_argsdict

Arguments only passed to cuml’s initializer

cpu_args dict

Arguments only passed to sklearn’s initializer

accepts_labelsboolean

If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.

data_prep_hookfunction (data -> data)

Optional function to run on input data before passing to fit

accuracy_functionfunction (y_test, y_pred)

Function that returns a scalar representing accuracy

bench_funccustom function to perform fit/predict/transform

calls.

Attributes:
tmpdir

Methods

has_cpu()

Check if this algorithm has a CPU implementation.

has_cuml()

Check if this algorithm has a cuML implementation.

run_cpu(data[, bench_args])

Runs the cpu-based algorithm's fit method on specified data

run_cuml(data[, bench_args])

Runs the cuml-based algorithm's fit method on specified data

cleanup

setup_cpu

setup_cuml

has_cpu()[source]#

Check if this algorithm has a CPU implementation.

has_cuml()[source]#

Check if this algorithm has a cuML implementation.

run_cpu(data, bench_args={}, **override_setup_args)[source]#

Runs the cpu-based algorithm’s fit method on specified data

run_cuml(data, bench_args={}, **override_setup_args)[source]#

Runs the cuml-based algorithm’s fit method on specified data

cuml.benchmark.algorithms.algorithm_by_name(name)[source]#

Returns the algorithm pair with the name ‘name’ (case-insensitive)

cuml.benchmark.algorithms.all_algorithms()[source]#

Returns all defined AlgorithmPair objects.

Each AlgorithmPair contains both CPU (sklearn) and GPU (cuML) implementations. When cuML is not installed, cuml_class will be None but cpu_class will still be available for CPU-only benchmarking.

Runners#

Wrappers to run ML benchmarks

class cuml.benchmark.runners.AccuracyComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)[source]#

Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.

In CPU-only mode, only runs CPU benchmarks.

class cuml.benchmark.runners.BenchmarkTimer(reps=1)[source]#

Provides a context manager that runs a code block reps times and records results to the instance variable timings. Use like:

timer = BenchmarkTimer(reps=5)
for _ in timer.benchmark_runs():
    ... do something ...
print(np.min(timer.timings))

Methods

benchmark_runs

class cuml.benchmark.runners.SpeedupComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)[source]#

Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.

In CPU-only mode, only runs CPU benchmarks.

Methods

run

cuml.benchmark.runners.run_variations(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], dtype=<class 'numpy.float32'>, input_type='numpy', test_fraction=0.1, run_cpu=True, run_cuml=True, raise_on_error=False, n_reps=1)[source]#

Runs each algo in algos once per bench_rows X bench_dims X params_override_list X cuml_param_override_list combination and returns a dataframe containing timing and accuracy data.

Parameters:
algosstr or list

Name of algorithms to run and evaluate

dataset_namestr

Name of dataset to use

bench_rowslist of int

Dataset row counts to test

bench_dimslist of int

Dataset column counts to test

param_override_listlist of dict

Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.

cuml_param_override_listlist of dict

Dicts containing parameters to pass to __init__ of the cuml algo only.

cpu_param_override_listlist of dict

Dicts containing parameters to pass to __init__ of the cpu algo only.

dataset_param_override_listdict

Dicts containing parameters to pass to dataset generator function

dtype: [np.float32|np.float64]

Specifies the dataset precision to be used for benchmarking.

test_fractionfloat

The fraction of data to use for testing.

run_cpuboolean

If True, run the cpu-based algorithm for comparison

run_cumlboolean

If True, run the cuml-based algorithm (requires GPU)

Data Generation#

Data generators for cuML benchmarks

The main entry point for consumers is gen_data, which wraps the underlying data generators.

Notes when writing new generators:

Each generator is a function that accepts:
  • n_samples (set to 0 for ‘default’)

  • n_features (set to 0 for ‘default’)

  • random_state

  • (and optional generator-specific parameters)

The function should return a 2-tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)

A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.

cuml.benchmark.datagen.gen_data(dataset_name, dataset_format, n_samples=None, n_features=None, test_fraction=0.0, datasets_root_dir='.', dtype=<class 'numpy.float32'>, **kwargs)[source]#

Returns a tuple of data from the specified generator.

Parameters:
dataset_namestr

Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)

dataset_formatstr

Type of data to return. (One of cudf, numpy, pandas, gpuarray)

n_samplesint, optional

Total number of samples. If None, uses generator default.

n_featuresint, optional

Number of features. If None, uses generator default.

test_fractionfloat

Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.

Returns:
(train_features, train_labels, test_features, test_labels) tuple
containing matrices or dataframes of the requested format.
test_features and test_labels may be None if no splitting was done.