cuml.benchmark#
Algorithms#
- class cuml.benchmark.algorithms.AlgorithmPair(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)[source]#
Wraps a cuML algorithm and (optionally) a cpu-based algorithm (typically scikit-learn, but does not need to be as long as it offers
fitandpredictortransformmethods). Provides mechanisms to run each version with default arguments. If no CPU-based version of the algorithm is available, pass None for the cpu_class when instantiating- Parameters:
- cpu_classclass
Class for CPU version of algorithm. Set to None if not available.
- cuml_classclass
Class for cuML algorithm. Can be None for CPU-only algorithms.
- shared_argsdict
Arguments passed to both implementations’s initializer
- cuml_argsdict
Arguments only passed to cuml’s initializer
- cpu_args dict
Arguments only passed to sklearn’s initializer
- accepts_labelsboolean
If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.
- data_prep_hookfunction (data -> data)
Optional function to run on input data before passing to fit
- accuracy_functionfunction (y_test, y_pred)
Function that returns a scalar representing accuracy
- bench_funccustom function to perform fit/predict/transform
calls.
- Attributes:
- tmpdir
Methods
has_cpu()Check if this algorithm has a CPU implementation.
has_cuml()Check if this algorithm has a cuML implementation.
run_cpu(data[, bench_args])Runs the cpu-based algorithm's fit method on specified data
run_cuml(data[, bench_args])Runs the cuml-based algorithm's fit method on specified data
cleanup
setup_cpu
setup_cuml
Runners#
Wrappers to run ML benchmarks
- class cuml.benchmark.runners.AccuracyComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)[source]#
Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.
In CPU-only mode, only runs CPU benchmarks.
- class cuml.benchmark.runners.BenchmarkTimer(reps=1)[source]#
Provides a context manager that runs a code block
repstimes and records results to the instance variabletimings. Use like:timer = BenchmarkTimer(reps=5) for _ in timer.benchmark_runs(): ... do something ... print(np.min(timer.timings))
Methods
benchmark_runs
- class cuml.benchmark.runners.SpeedupComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)[source]#
Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.
In CPU-only mode, only runs CPU benchmarks.
Methods
run
- cuml.benchmark.runners.run_variations(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], dtype=<class 'numpy.float32'>, input_type='numpy', test_fraction=0.1, run_cpu=True, run_cuml=True, raise_on_error=False, n_reps=1)[source]#
Runs each algo in
algosonce perbench_rows X bench_dims X params_override_list X cuml_param_override_listcombination and returns a dataframe containing timing and accuracy data.- Parameters:
- algosstr or list
Name of algorithms to run and evaluate
- dataset_namestr
Name of dataset to use
- bench_rowslist of int
Dataset row counts to test
- bench_dimslist of int
Dataset column counts to test
- param_override_listlist of dict
Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.
- cuml_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cuml algo only.
- cpu_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cpu algo only.
- dataset_param_override_listdict
Dicts containing parameters to pass to dataset generator function
- dtype: [np.float32|np.float64]
Specifies the dataset precision to be used for benchmarking.
- test_fractionfloat
The fraction of data to use for testing.
- run_cpuboolean
If True, run the cpu-based algorithm for comparison
- run_cumlboolean
If True, run the cuml-based algorithm (requires GPU)
Data Generation#
Data generators for cuML benchmarks
The main entry point for consumers is gen_data, which wraps the underlying data generators.
Notes when writing new generators:
- Each generator is a function that accepts:
n_samples (set to 0 for ‘default’)
n_features (set to 0 for ‘default’)
random_state
(and optional generator-specific parameters)
The function should return a 2-tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)
A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.
- cuml.benchmark.datagen.gen_data(dataset_name, dataset_format, n_samples=None, n_features=None, test_fraction=0.0, datasets_root_dir='.', dtype=<class 'numpy.float32'>, **kwargs)[source]#
Returns a tuple of data from the specified generator.
- Parameters:
- dataset_namestr
Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)
- dataset_formatstr
Type of data to return. (One of cudf, numpy, pandas, gpuarray)
- n_samplesint, optional
Total number of samples. If None, uses generator default.
- n_featuresint, optional
Number of features. If None, uses generator default.
- test_fractionfloat
Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.
- Returns:
- (train_features, train_labels, test_features, test_labels) tuple
- containing matrices or dataframes of the requested format.
- test_features and test_labels may be None if no splitting was done.