Performance Profiling and Debugging with cuml.accel#

This notebook demonstrates how to use the profiling capabilities in cuml.accel to understand which operations are being accelerated on GPU and which are falling back to CPU execution. This can be particularly useful for debugging performance issues or understanding why certain operations might not be accelerated.

cuml.accel provides two types of profilers:

  1. Function Profiler: Shows statistics about potentially accelerated function and method calls

  2. Line Profiler: Shows per-line statistics on your script with GPU utilization percentages

Let’s explore both profilers with practical examples.

Setup#

First, let’s load the cuml.accel extension and import the necessary libraries.

[1]:
# Load the cuml.accel extension
%load_ext cuml.accel

[2]:
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Function Profiler#

The function profiler gathers statistics about potentially accelerated function and method calls. It can show:

  • Which method calls cuml.accel had the potential to accelerate

  • Which methods were accelerated on GPU, and their total runtime

  • Which methods required a CPU fallback, their total runtime, and why a fallback was needed

Example 1: Ridge Regression with Mixed GPU/CPU Execution#

Let’s start with a simple example that demonstrates both GPU acceleration and CPU fallback using Ridge regression.

[3]:
# Generate sample data
X, y = make_regression(n_samples=1000, n_features=100, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[4]:
%%cuml.accel.profile

# Fit and predict on GPU (supported parameters)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
predictions_gpu = ridge.predict(X_test)

# Retry, using a hyperparameter that isn't supported on GPU
ridge_cpu = Ridge(positive=True)  # positive=True is not supported on GPU
ridge_cpu.fit(X_train, y_train)
predictions_cpu = ridge_cpu.predict(X_test)

cuml.accel profile                                             
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Function       GPU calls  GPU time  CPU calls  CPU time ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ Ridge.fit     │         1 │  109.5ms │         1 │    4.1ms │
│ Ridge.predict │         1 │    2.1ms │         1 │  147.8µs │
├───────────────┼───────────┼──────────┼───────────┼──────────┤
│ Total         │         2 │  111.6ms │         2 │    4.2ms │
└───────────────┴───────────┴──────────┴───────────┴──────────┘
Not all operations ran on the GPU. The following functions
required CPU fallback for the following reasons:

* Ridge.fit
  - `positive=True` is not supported
* Ridge.predict
  - Estimator not fit on GPU

The function profiler output above shows:

  • GPU calls: Methods that ran successfully on GPU

  • GPU time: Total time spent on GPU operations

  • CPU calls: Methods that fell back to CPU execution

  • CPU time: Total time spent on CPU operations

  • Fallback reasons: Why certain operations couldn’t run on GPU

Example 2: Random Forest Classification#

Let’s try a more complex example with Random Forest classification.

[5]:
# Generate classification data
from sklearn.datasets import make_classification
X_class, y_class = make_classification(n_samples=2000, n_features=20, n_informative=15,
                                      n_redundant=5, n_classes=3, random_state=42)
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42)

[6]:
%%cuml.accel.profile

# Random Forest with supported parameters
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train_class, y_train_class)
rf_predictions = rf.predict(X_test_class)
rf_probabilities = rf.predict_proba(X_test_class)

cuml.accel profile                                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Function                              GPU calls  GPU time  CPU calls  CPU time ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ RandomForestClassifier.fit           │         1 │  390.3ms │         0 │       0s │
│ RandomForestClassifier.predict       │         1 │   36.4ms │         0 │       0s │
│ RandomForestClassifier.predict_proba │         1 │    1.5ms │         0 │       0s │
├──────────────────────────────────────┼───────────┼──────────┼───────────┼──────────┤
│ Total                                │         3 │  428.2ms │         0 │       0s │
└──────────────────────────────────────┴───────────┴──────────┴───────────┴──────────┘

Line Profiler#

The line profiler collects per-line statistics on your script. It can show:

  • Which lines took the most cumulative time

  • Which lines (if any) were able to benefit from acceleration

  • The percentage of each line’s runtime that was spent on GPU through cuml.accel

⚠️ Warning: The line profiler can add non-negligible overhead. It’s useful for understanding what parts of your code were accelerated, but you shouldn’t compare runtimes when run with the line profiler enabled to other runs.

Example 3: Line Profiling with Ridge Regression#

Let’s use the line profiler to see detailed per-line statistics.

[7]:
%%cuml.accel.line_profile

# Generate data
X, y = make_regression(n_samples=1000, n_features=100, noise=0.1, random_state=42)

# Fit and predict on GPU
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
predictions = ridge.predict(X)

# Retry, using a hyperparameter that isn't supported on GPU
ridge_cpu = Ridge(positive=True)
ridge_cpu.fit(X, y)
predictions_cpu = ridge_cpu.predict(X)

cuml.accel line profile                                                                                          
┏━━━━┳━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  #  N     Time  GPU %  Source                                                                             ┃
┡━━━━╇━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│  1 │   │         │       │ # Generate data                                                                    │
│  2 │ 1 │   3.6ms │     - │ X, y = make_regression(n_samples=1000, n_features=100, noise=0.1, random_state=42) │
│  3 │   │         │       │                                                                                    │
│  4 │   │         │       │ # Fit and predict on GPU                                                           │
│  5 │ 1 │ 948.2µs │     - │ ridge = Ridge(alpha=1.0)                                                           │
│  6 │ 1 │   9.7ms │  93.0 │ ridge.fit(X, y)                                                                    │
│  7 │ 1 │   1.8ms │  98.0 │ predictions = ridge.predict(X)                                                     │
│  8 │   │         │       │                                                                                    │
│  9 │   │         │       │ # Retry, using a hyperparameter that isn't supported on GPU                        │
│ 10 │ 1 │       - │     - │ ridge_cpu = Ridge(positive=True)                                                   │
│ 11 │ 1 │   4.9ms │   0.0 │ ridge_cpu.fit(X, y)                                                                │
│ 12 │ 1 │ 235.8µs │   0.0 │ predictions_cpu = ridge_cpu.predict(X)                                             │
└────┴───┴─────────┴───────┴────────────────────────────────────────────────────────────────────────────────────┘
Ran in 21.6ms, 50.1% on GPU

The line profiler output shows:

  • #: Line number

  • N: Number of times the line was executed

  • Time: Total time spent on that line

  • GPU %: Percentage of time spent on GPU for that line

  • Source: The actual code line

At the bottom, you’ll see the total runtime and the percentage of time spent on GPU.

Example 4: Line Profiling with Multiple Algorithms#

Let’s try a more comprehensive example with multiple machine learning algorithms.

[8]:
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans

[9]:
%%cuml.accel.line_profile

# Generate data for multiple tasks
X_reg, y_reg = make_regression(n_samples=500, n_features=50, noise=0.1, random_state=42)
X_class, y_class = make_classification(n_samples=500, n_features=20, n_classes=2, random_state=42)

# Regression task
ridge = Ridge(alpha=1.0)
ridge.fit(X_reg, y_reg)
ridge_pred = ridge.predict(X_reg)

# Classification task
logreg = LogisticRegression(random_state=42)
logreg.fit(X_class, y_class)
logreg_pred = logreg.predict(X_class)

# Clustering task
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_class)
kmeans_pred = kmeans.predict(X_class)

cuml.accel line profile                                                                                            
┏━━━━┳━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  #  N     Time  GPU %  Source                                                                               ┃
┡━━━━╇━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│  1 │   │         │       │ # Generate data for multiple tasks                                                   │
│  2 │ 1 │   1.3ms │     - │ X_reg, y_reg = make_regression(n_samples=500, n_features=50, noise=0.1, random_stat… │
│  3 │ 1 │   1.2ms │     - │ X_class, y_class = make_classification(n_samples=500, n_features=20, n_classes=2, r… │
│  4 │   │         │       │                                                                                      │
│  5 │   │         │       │ # Regression task                                                                    │
│  6 │ 1 │ 849.3µs │     - │ ridge = Ridge(alpha=1.0)                                                             │
│  7 │ 1 │   5.3ms │  90.0 │ ridge.fit(X_reg, y_reg)                                                              │
│  8 │ 1 │   1.2ms │  93.0 │ ridge_pred = ridge.predict(X_reg)                                                    │
│  9 │   │         │       │                                                                                      │
│ 10 │   │         │       │ # Classification task                                                                │
│ 11 │ 1 │       - │     - │ logreg = LogisticRegression(random_state=42)                                         │
│ 12 │ 1 │ 595.1ms │  99.0 │ logreg.fit(X_class, y_class)                                                         │
│ 13 │ 1 │  93.9ms │  99.0 │ logreg_pred = logreg.predict(X_class)                                                │
│ 14 │   │         │       │                                                                                      │
│ 15 │   │         │       │ # Clustering task                                                                    │
│ 16 │ 1 │       - │     - │ kmeans = KMeans(n_clusters=3, random_state=42)                                       │
│ 17 │ 1 │  74.9ms │  99.0 │ kmeans.fit(X_class)                                                                  │
│ 18 │ 1 │     2ms │  98.0 │ kmeans_pred = kmeans.predict(X_class)                                                │
└────┴───┴─────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────┘
Ran in 775.9ms, 99.3% on GPU

Programmatic Profiling#

You can also use the profilers programmatically with context managers. This is useful when you want to profile specific sections of code rather than entire cells.

[10]:
# Generate data
X, y = make_regression(n_samples=1000, n_features=100, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

[11]:
# Using function profiler programmatically
# Note: that requires to import cuml – typically not needed for zero-code-change acceleration
import cuml

with cuml.accel.profile():
    # This section will be profiled
    ridge = Ridge(alpha=1.0)
    ridge.fit(X_train, y_train)
    predictions = ridge.predict(X_test)

    # This will fall back to CPU
    ridge_cpu = Ridge(positive=True)
    ridge_cpu.fit(X_train, y_train)
    predictions_cpu = ridge_cpu.predict(X_test)

# This section will NOT be profiled
print("Profiling complete!")

cuml.accel profile                                             
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Function       GPU calls  GPU time  CPU calls  CPU time ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ Ridge.fit     │         1 │    9.2ms │         1 │    3.8ms │
│ Ridge.predict │         1 │    1.1ms │         1 │    155µs │
├───────────────┼───────────┼──────────┼───────────┼──────────┤
│ Total         │         2 │   10.3ms │         2 │      4ms │
└───────────────┴───────────┴──────────┴───────────┴──────────┘
Not all operations ran on the GPU. The following functions
required CPU fallback for the following reasons:

* Ridge.fit
  - `positive=True` is not supported
* Ridge.predict
  - Estimator not fit on GPU
Profiling complete!

Logging#

In addition to profiling, cuml.accel also provides logging capabilities. You can enable different levels of logging to see what’s happening behind the scenes.

Setting Log Levels#

You can set the logging level when installing cuml.accel programmatically:

[12]:
# Note: This needs to be done before loading the extension
# Uncomment and restart kernel to try different log levels

# import cuml
# cuml.accel.install(log_level="debug")  # Most verbose
# cuml.accel.install(log_level="info")   # Shows GPU/CPU dispatch info
# cuml.accel.install(log_level="warn")   # Default - warnings only

Example with Info Logging#

Let’s demonstrate what info-level logging looks like. First, let’s reinstall cuml.accel with info logging:

[13]:
# Reinstall with info logging
cuml.accel.install(log_level="info")

[14]:
# Now let's run some code and see the logging output
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)

# This should run on GPU
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
ridge.predict(X)

# This should fall back to CPU
ridge_cpu = Ridge(positive=True)
ridge_cpu.fit(X, y)
ridge_cpu.predict(X)

[14]:
array([-4.96828185e+02,  4.13259834e+02,  2.29693047e+02, -2.01945381e+02,
       -1.82165742e+02, -1.27461725e+02, -1.84216790e+02,  6.41643935e+01,
        2.81977988e+01, -9.85732803e+01, -2.48726688e+02, -4.59059179e+01,
       -3.62515595e+02, -1.97116640e+02, -7.59989954e+01, -1.46745797e+02,
        1.83729766e+02,  1.52333807e+02,  1.58614336e+02, -2.80348072e+02,
        1.20902174e+02, -1.17809033e+02, -9.97851311e+01, -1.12611893e+02,
        3.63757770e+02, -9.44789347e+01,  3.73068707e+02, -1.63022890e+02,
       -1.79148538e+02, -4.76543795e+01, -1.22622512e+02,  2.53736026e+02,
       -6.01694025e+01, -1.36405943e+01,  4.60700767e+02,  2.92104314e+00,
       -1.23464126e+01,  3.97667623e+01, -5.95730592e+01,  2.53842230e+02,
       -6.30289062e+01,  4.94020282e-01,  2.02603324e+01, -1.42016832e+02,
        3.67620494e+02,  1.61207400e+02,  1.77934285e+02, -3.49704150e+02,
        1.48142903e+01, -9.68361205e+01,  3.95261071e+01,  1.98248208e+02,
        1.47737919e+02, -3.57354347e+02, -3.85222830e+01, -5.95684855e+01,
        1.27330414e+02,  2.62530067e+02,  7.13475452e+01,  3.69909633e+01,
        1.27019908e+02, -2.68814318e+00,  1.64428940e+02,  6.70256697e+01,
        2.92337172e+02,  2.49242376e+02, -4.03311726e+02, -3.85953399e+01,
        6.14672781e+01,  1.30952366e+02,  2.87280719e+02, -2.58860504e+02,
        1.10095893e+01,  7.70383502e+01, -1.57451567e+02, -3.51116613e+02,
        1.33489115e+02,  1.39837447e+02, -2.91451101e+02,  7.65135528e+01,
        2.36832431e+02, -4.84163633e+01,  1.86140659e+02,  2.10714217e+02,
        7.20502477e+01,  1.10458158e+02,  8.35301830e+01,  2.04801832e+02,
        1.87477964e+02,  2.18505503e+01, -1.09905765e+02, -1.03149695e+02,
       -5.95948543e+01,  1.66125538e+01, -1.59187302e+02, -1.55099546e+02,
       -5.64796237e+01,  1.95535745e+02,  5.09002161e+01,  1.57947532e+01])

Key Takeaways#

  1. Function Profiler (%%cuml.accel.profile): Best for understanding which methods were accelerated and why some fell back to CPU

  2. Line Profiler (%%cuml.accel.line_profile): Best for understanding which specific lines of code benefited from acceleration and the overall GPU utilization percentage

  3. Logging: Useful for real-time feedback on what’s happening during execution

  4. Performance Insights:

    • High GPU utilization percentages indicate good acceleration

    • CPU fallbacks are clearly identified with reasons

    • Small datasets may show higher GPU times due to transfer overhead

    • Larger datasets typically show better GPU acceleration benefits

  5. Debugging: Use these tools to identify why certain operations aren’t being accelerated and optimize your code accordingly.

The profiling tools in cuml.accel are essential for understanding and optimizing your GPU-accelerated machine learning workflows!