Databricks#

You can install RAPIDS on Databricks in a few different ways:

  1. Accelerate machine learning workflows in a single-node GPU notebook environment

  2. Spark users can install RAPIDS Accelerator for Apache Spark 3.x on Databricks

  3. Install Dask alongside Spark and then use libraries like dask-cudf for multi-node workloads

Single-node GPU Notebook environment#

Create init-script#

To get started, you must first configure an initialization script to install RAPIDS libraries and all other dependencies for your project.

Databricks recommends using cluster-scoped init scripts stored in the workspace files.

Navigate to the top-left Workspace tab and click on your Home directory then select Add > File from the menu. Create an init.sh script with contents:

#!/bin/bash
set -e

# Install RAPIDS libraries
pip install \
    --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
    "cudf-cu12>=25.12.*,>=0.0.0a0" "cuml-cu12>=25.12.*,>=0.0.0a0" \
    "dask-cuda>=25.12.*,>=0.0.0a0"

Launch cluster#

To get started, navigate to the All Purpose Compute tab of the Compute section in Databricks and select Create Compute. Name your cluster and choose “Single node”.

Screenshot of the Databricks compute page

In order to launch a GPU node uncheck Use Photon Acceleration and select any 15.x, 16.x or 17.x ML LTS runtime with GPU support. For example for long-term support releases you could select the 15.4 LTS ML (includes Apache Spark 3.5.0, GPU, Scala 2.12) runtime version.

The “GPU accelerated” nodes should now be available in the Node type dropdown.

Screenshot of selecting a g4dn.xlarge node type

Then expand the Advanced Options section, open the Init Scripts tab and enter the file path to the init-script in your Workspace directory starting with /Users/<user-name>/<script-name>.sh and click “Add”.

Screenshot of init script path

Select Create Compute

Test RAPIDS#

Once your cluster has started, you can create a new notebook or open an existing one from the /Workspace directory then attach it to your running cluster.

import cudf

gdf = cudf.DataFrame({"a":[1,2,3],"b":[4,5,6]})
gdf
    a   b
0   1   4
1   2   5
2   3   6

Quickstart with cuDF Pandas#

RAPIDS recently introduced cuDF’s pandas accelerator mode to accelerate existing pandas workflows with zero changes to code.

Using cudf.pandas in Databricks on a single-node can offer significant performance improvements over traditional pandas when dealing with large datasets; operations are optimized to run on the GPU (cuDF) whenever possible, seamlessly falling back to the CPU (pandas) when necessary, with synchronization happening in the background.

Below is a quick example how to load the cudf.pandas extension in a Jupyter notebook:


%load_ext cudf.pandas

%%time

import pandas as pd

df = pd.read_parquet(
    "nyc_parking_violations_2022.parquet",
    columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
)

(df[["Registration State", "Violation Description"]]
 .value_counts()
 .groupby("Registration State")
 .head(1)
 .sort_index()
 .reset_index()
)

Upload the 10 Minutes to RAPIDS cuDF Pandas notebook in your single-node Databricks cluster and run through the cells.

NOTE: cuDF pandas is open beta and under active development. You can learn more through the documentation and the release blog.