Interoperability between cuDF and CuPy#

This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations).

import cudf
import cupy as cp
from packaging import version

if version.parse(cp.__version__) >= version.parse("10.0.0"):
    cupy_from_dlpack = cp.from_dlpack
else:
    cupy_from_dlpack = cp.fromDlpack

Converting a cuDF DataFrame to a CuPy Array#

If we want to convert a cuDF DataFrame to a CuPy ndarray, There are multiple ways to do it:

We can use the dlpack interface.
We can also use DataFrame.values.
We can also convert via the CUDA array interface by using cuDF’s to_cupy functionality.

nelem = 10000
df = cudf.DataFrame(
    {
        "a": range(nelem),
        "b": range(500, nelem + 500),
        "c": range(1000, nelem + 1000),
    }
)

%timeit arr_cupy = cupy_from_dlpack(df.to_dlpack())
%timeit arr_cupy = df.values
%timeit arr_cupy = df.to_cupy()

391 μs ± 15.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

156 μs ± 1.08 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

145 μs ± 2.88 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

arr_cupy = cupy_from_dlpack(df.to_dlpack())
arr_cupy

array([[    0,   500,  1000],
       [    1,   501,  1001],
       [    2,   502,  1002],
       ...,
       [ 9997, 10497, 10997],
       [ 9998, 10498, 10998],
       [ 9999, 10499, 10999]], shape=(10000, 3))

Converting a cuDF Series to a CuPy Array#

There are also multiple ways to convert a cuDF Series to a CuPy array:

We can pass the Series to cupy.asarray as cuDF Series exposes __cuda_array_interface__.
We can leverage the dlpack interface to_dlpack().
We can also use Series.values

col = "a"

%timeit cola_cupy = cp.asarray(df[col])
%timeit cola_cupy = cupy_from_dlpack(df[col].to_dlpack())
%timeit cola_cupy = df[col].values

135 μs ± 5.93 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

395 μs ± 4.29 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

122 μs ± 1.76 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

cola_cupy = cp.asarray(df[col])
cola_cupy

array([   0,    1,    2, ..., 9997, 9998, 9999], shape=(10000,))

From here, we can proceed with normal CuPy workflows, such as reshaping the array, getting the diagonal, or calculating the norm.

reshaped_arr = cola_cupy.reshape(50, 200)
reshaped_arr

array([[   0,    1,    2, ...,  197,  198,  199],
       [ 200,  201,  202, ...,  397,  398,  399],
       [ 400,  401,  402, ...,  597,  598,  599],
       ...,
       [9400, 9401, 9402, ..., 9597, 9598, 9599],
       [9600, 9601, 9602, ..., 9797, 9798, 9799],
       [9800, 9801, 9802, ..., 9997, 9998, 9999]], shape=(50, 200))

reshaped_arr.diagonal()

array([   0,  201,  402,  603,  804, 1005, 1206, 1407, 1608, 1809, 2010,
       2211, 2412, 2613, 2814, 3015, 3216, 3417, 3618, 3819, 4020, 4221,
       4422, 4623, 4824, 5025, 5226, 5427, 5628, 5829, 6030, 6231, 6432,
       6633, 6834, 7035, 7236, 7437, 7638, 7839, 8040, 8241, 8442, 8643,
       8844, 9045, 9246, 9447, 9648, 9849])

cp.linalg.norm(reshaped_arr)

array(577306.967739)

Converting a CuPy Array to a cuDF DataFrame#

We can also convert a CuPy ndarray to a cuDF DataFrame. Like before, there are multiple ways to do it:

Easiest; We can directly use the DataFrame constructor.
We can use CUDA array interface with the DataFrame constructor.
We can also use the dlpack interface.

For the latter two cases, we’ll need to make sure that our CuPy array is Fortran contiguous in memory (if it’s not already). We can either transpose the array or simply coerce it to be Fortran contiguous beforehand.

%timeit reshaped_df = cudf.DataFrame(reshaped_arr)

6.69 ms ± 47 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

reshaped_df = cudf.DataFrame(reshaped_arr)
reshaped_df.head()

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

5 rows × 200 columns

We can check whether our array is Fortran contiguous by using cupy.isfortran or looking at the flags of the array.

cp.isfortran(reshaped_arr)

False

In this case, we’ll need to convert it before going to a cuDF DataFrame. In the next two cells, we create the DataFrame by leveraging dlpack and the CUDA array interface, respectively.

%%timeit

fortran_arr = cp.asfortranarray(reshaped_arr)
reshaped_df = cudf.DataFrame(fortran_arr)

6.34 ms ± 288 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit

fortran_arr = cp.asfortranarray(reshaped_arr)
reshaped_df = cudf.from_dlpack(fortran_arr.__dlpack__())

4.64 ms ± 129 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

fortran_arr = cp.asfortranarray(reshaped_arr)
reshaped_df = cudf.DataFrame(fortran_arr)
reshaped_df.head()

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

5 rows × 200 columns

Converting a CuPy Array to a cuDF Series#

To convert an array to a Series, we can directly pass the array to the Series constructor.

cudf.Series(reshaped_arr.diagonal()).head()

    0
  201
  402
  603
  804
dtype: int64

Interweaving CuDF and CuPy for Smooth PyData Workflows#

RAPIDS libraries and the entire GPU PyData ecosystem are developing quickly, but sometimes a one library may not have the functionality you need. One example of this might be taking the row-wise sum (or mean) of a Pandas DataFrame. cuDF’s support for row-wise operations isn’t mature, so you’d need to either transpose the DataFrame or write a UDF and explicitly calculate the sum across each row. Transposing could lead to hundreds of thousands of columns (which cuDF wouldn’t perform well with) depending on your data’s shape, and writing a UDF can be time intensive.

By leveraging the interoperability of the GPU PyData ecosystem, this operation becomes very easy. Let’s take the row-wise sum of our previously reshaped cuDF DataFrame.

reshaped_df.head()

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

5 rows × 200 columns

We can just transform it into a CuPy array and use the axis argument of sum.

new_arr = cupy_from_dlpack(reshaped_df.to_dlpack())
new_arr.sum(axis=1)

array([  19900,   59900,   99900,  139900,  179900,  219900,  259900,
        299900,  339900,  379900,  419900,  459900,  499900,  539900,
        579900,  619900,  659900,  699900,  739900,  779900,  819900,
        859900,  899900,  939900,  979900, 1019900, 1059900, 1099900,
       1139900, 1179900, 1219900, 1259900, 1299900, 1339900, 1379900,
       1419900, 1459900, 1499900, 1539900, 1579900, 1619900, 1659900,
       1699900, 1739900, 1779900, 1819900, 1859900, 1899900, 1939900,
       1979900])

With just that single line, we’re able to seamlessly move between data structures in this ecosystem, giving us enormous flexibility without sacrificing speed.

Converting a cuDF DataFrame to a CuPy Sparse Matrix#

We can also convert a DataFrame or Series to a CuPy sparse matrix. We might want to do this if downstream processes expect CuPy sparse matrices as an input.

The sparse matrix data structure is defined by three dense arrays. We’ll define a small helper function for cleanliness.

def cudf_to_cupy_sparse_matrix(data, sparseformat="column"):
    """Converts a cuDF object to a CuPy Sparse Column matrix."""
    if sparseformat not in (
        "row",
        "column",
    ):
        raise ValueError("Let's focus on column and row formats for now.")

    _sparse_constructor = cp.sparse.csc_matrix
    if sparseformat == "row":
        _sparse_constructor = cp.sparse.csr_matrix

    return _sparse_constructor(cupy_from_dlpack(data.to_dlpack()))

We can define a sparsely populated DataFrame to illustrate this conversion to either sparse matrix format.

df = cudf.DataFrame()
nelem = 10000
nonzero = 1000
for i in range(20):
    arr = cp.random.normal(5, 5, nelem)
    arr[cp.random.choice(arr.shape[0], nelem - nonzero, replace=False)] = 0
    df["a" + str(i)] = arr

df.head()

	a1	a7	a9	a18
0	7.415838	0.000000	0.000000	0.000000
1	0.000000	0.000000	-3.846452	0.000000
2	0.000000	0.000000	0.767447	0.000000
3	0.000000	7.731252	0.000000	0.000000
4	0.000000	0.000000	0.000000	-1.842659

sparse_data = cudf_to_cupy_sparse_matrix(df)
print(sparse_data)

<Compressed Sparse Column sparse matrix of dtype 'float64'
	with 20000 stored elements and shape (10000, 20)>
  Coords	Values
  (640, 0)	-0.9841573282036165
  (1152, 0)	16.825906274334127
  (770, 0)	0.4242982284309055
  (1284, 0)	15.219541142164099
  (1924, 0)	7.511156797450288
  (1669, 0)	9.066899017638935
  (1926, 0)	11.796934581462985
  (264, 0)	14.252871908481815
  (1418, 0)	12.48064937248323
  (1291, 0)	-6.119438975811137
  (1036, 0)	7.841322628348456
  (1420, 0)	2.0233477718267605
  (781, 0)	7.391358809864952
  (909, 0)	9.492962017405517
  (655, 0)	7.674444433530975
  (16, 0)	14.054017944526304
  (1808, 0)	-3.548509542409951
  (1937, 0)	9.33813694063069
  (1298, 0)	10.134681491611083
  (1682, 0)	3.0602273560909405
  (1810, 0)	0.10445431398911131
  (787, 0)	-2.005314374682757
  (1299, 0)	3.9682908066617566
  (1172, 0)	3.7862799221619956
  (1940, 0)	5.464627353831987
  :	:
  (8813, 19)	2.815635255486209
  (8943, 19)	11.432230292234625
  (9071, 19)	0.41435230110970944
  (8560, 19)	6.716365160660065
  (9456, 19)	8.30721648085016
  (8689, 19)	8.982423017988047
  (9329, 19)	-1.1428305284607632
  (9585, 19)	7.345080250736343
  (8562, 19)	9.477288321306622
  (8563, 19)	3.8758766905288806
  (9075, 19)	0.828612178933763
  (9203, 19)	3.651743170206008
  (9587, 19)	10.787292636244635
  (9206, 19)	5.378817757343889
  (8823, 19)	7.62776439837825
  (9080, 19)	11.478531513712394
  (8953, 19)	2.4999748295837425
  (9213, 19)	9.110426325210199
  (9597, 19)	-0.8285174397809225
  (8702, 19)	0.9435493228713943
  (9214, 19)	3.496498004389217
  (9832, 19)	5.12881641017313
  (9964, 19)	8.72134653951998
  (9966, 19)	6.480578529530884
  (9972, 19)	8.61984932267304

From here, we could continue our workflow with a CuPy sparse matrix.

For a full list of the functionality built into these libraries, we encourage you to check out the API docs for cuDF and CuPy.

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

Interoperability between cuDF and CuPy#

Converting a cuDF DataFrame to a CuPy Array#

Converting a cuDF Series to a CuPy Array#

Converting a CuPy Array to a cuDF DataFrame#

Converting a CuPy Array to a cuDF Series#

Interweaving CuDF and CuPy for Smooth PyData Workflows#

Converting a cuDF DataFrame to a CuPy Sparse Matrix#

This Page

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999

	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
0	0	1	2	3	4	5	6	7	8	9	...	190	191	192	193	194	195	196	197	198	199
1	200	201	202	203	204	205	206	207	208	209	...	390	391	392	393	394	395	396	397	398	399
2	400	401	402	403	404	405	406	407	408	409	...	590	591	592	593	594	595	596	597	598	599
3	600	601	602	603	604	605	606	607	608	609	...	790	791	792	793	794	795	796	797	798	799
4	800	801	802	803	804	805	806	807	808	809	...	990	991	992	993	994	995	996	997	998	999