10 Minutes to cuDF and CuPy¶
This notebook provides introductory examples of how you can use cuDF and CuPy together to take advantage of CuPy array functionality (such as advanced linear algebra operations).
[1]:
import time
from numba import cuda
import cupy as cp
import cudf
Converting a cuDF DataFrame to a CuPy Array¶
If we want to convert a cuDF DataFrame to a CuPy ndarray, There are multiple ways to do it:
The best way is to use the dlpack interface.
We can also convert via the CUDA array interface by using cuDF’s
as_gpu_matrix
and CuPy’sasarray
functionality. Because CuPy arrays have a single dtype, each column in our DataFrame must have the same dtype, regardless of which method we use.
[2]:
nelem = 10000
df = cudf.DataFrame({'a':range(nelem),
'b':range(500, nelem + 500),
'c':range(1000, nelem + 1000)}
)
%time arr_cupy = cp.fromDlpack(df.to_dlpack())
arr_cupy
CPU times: user 346 µs, sys: 400 µs, total: 746 µs
Wall time: 646 µs
/conda/envs/cudf/lib/python3.7/site-packages/cudf/io/dlpack.py:74: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.
return libdlpack.to_dlpack(gdf_cols)
[2]:
array([[ 0, 500, 1000],
[ 1, 501, 1001],
[ 2, 502, 1002],
...,
[ 9997, 10497, 10997],
[ 9998, 10498, 10998],
[ 9999, 10499, 10999]])
[3]:
cp.asarray(df.as_gpu_matrix())
[3]:
array([[ 0, 500, 1000],
[ 1, 501, 1001],
[ 2, 502, 1002],
...,
[ 9997, 10497, 10997],
[ 9998, 10498, 10998],
[ 9999, 10499, 10999]])
Converting a cuDF Series to a CuPy Array¶
There are multiple ways to convert a cuDF Series to a CuPy array:
Easiest & Preferred: You can convert a cuDF Series to a CuPy by passing the Series to
cupy.asarray
as cuDF Series exposes`__cuda_array_interface__
<https://docs-cupy.chainer.org/en/stable/reference/interoperability.html>`__By passing the underlying Numba DeviceNDArray to
cupy.asarray
.We can also leverage the dlpack interface.
to_dlpack()
[4]:
col = 'a'
%time cola_cupy = cp.asarray(df[col])
%time cola_cupy = cp.asarray(df[col].data)
%time cola_cupy = cp.fromDlpack(df[col].to_dlpack())
type(cola_cupy)
CPU times: user 145 µs, sys: 113 µs, total: 258 µs
Wall time: 265 µs
CPU times: user 42 µs, sys: 47 µs, total: 89 µs
Wall time: 93 µs
CPU times: user 385 µs, sys: 436 µs, total: 821 µs
Wall time: 512 µs
[4]:
cupy.core.core.ndarray
From here, we can proceed with normal CuPy workflows, such as reshaping the array, getting the diagonal, or calculating the norm.
[5]:
reshaped_arr = cola_cupy.reshape(50, 200)
reshaped_arr
[5]:
array([[ 0, 1, 2, ..., 197, 198, 199],
[ 200, 201, 202, ..., 397, 398, 399],
[ 400, 401, 402, ..., 597, 598, 599],
...,
[9400, 9401, 9402, ..., 9597, 9598, 9599],
[9600, 9601, 9602, ..., 9797, 9798, 9799],
[9800, 9801, 9802, ..., 9997, 9998, 9999]])
[6]:
reshaped_arr.diagonal()
[6]:
array([ 0, 201, 402, 603, 804, 1005, 1206, 1407, 1608, 1809, 2010,
2211, 2412, 2613, 2814, 3015, 3216, 3417, 3618, 3819, 4020, 4221,
4422, 4623, 4824, 5025, 5226, 5427, 5628, 5829, 6030, 6231, 6432,
6633, 6834, 7035, 7236, 7437, 7638, 7839, 8040, 8241, 8442, 8643,
8844, 9045, 9246, 9447, 9648, 9849])
[7]:
cp.linalg.norm(reshaped_arr)
[7]:
array(577306.967739)
Converting a CuPy Array to a cuDF DataFrame¶
We can also convert a CuPy ndarray to a cuDF DataFrame. As above, we can use the either the dlpack interface or CUDA array interface with cuDF’s cudf.DataFrame
. Either way, we’ll need to make sure that our CuPy array is Fortran contiguous in memory (if it’s not already). We can either transpose the array or simply coerce it to be Fortran contiguous beforehand.
We can check whether our array is Fortran contiguous by using cupy.isfortran
or looking at the flags of the array.
[8]:
cp.isfortran(reshaped_arr)
[8]:
False
In this case, we’ll need to convert it before going to a cuDF DataFrame. In the next two cells, we create the DataFrame by leveraging dlpack and the CUDA array interface, respectively.
[9]:
reshaped_arr = cp.asfortranarray(reshaped_arr)
reshaped_df = cudf.from_dlpack(reshaped_arr.toDlpack())
reshaped_df.head()
/conda/envs/cudf/lib/python3.7/site-packages/cudf/io/dlpack.py:33: UserWarning: WARNING: cuDF from_dlpack() assumes column-major (Fortran order) input. If the input tensor is row-major, transpose it before passing it to this function.
res = libdlpack.from_dlpack(pycapsule_obj)
[9]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 |
1 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | ... | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 |
2 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | ... | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 |
3 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | ... | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 |
4 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | ... | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 |
5 rows × 200 columns
[10]:
reshaped_df = cudf.DataFrame(reshaped_arr)
reshaped_df.head()
[10]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 |
1 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | ... | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 |
2 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | ... | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 |
3 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | ... | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 |
4 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | ... | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 |
5 rows × 200 columns
Converting a CuPy Array to a cuDF Series¶
To convert an array to a Series, we can directly pass the array to the constructor. We just need to make sure that the array is stored in contiguous memory. If it’s not, we need to create a contiguous array with ascontiguousarray
. We could also use asfortranarray
, but it won’t matter in the case of this one-dimensional array.
[11]:
diag_data = cp.ascontiguousarray(reshaped_arr.diagonal())
cudf.Series(diag_data).head()
[11]:
0 0
1 201
2 402
3 603
4 804
dtype: int64
Interweaving CuDF and CuPy for Smooth PyData Workflows¶
RAPIDS libraries and the entire GPU PyData ecosystem are developing quickly, but sometimes a one library may not have the functionality you need. One example of this might be taking the row-wise sum (or mean) of a Pandas DataFrame. cuDF’s support for row-wise operations isn’t mature, so you’d need to either transpose the DataFrame or write a UDF and explicitly calculate the sum across each row. Transposing could lead to hundreds of thousands of columns (which cuDF wouldn’t perform well with) depending on your data’s shape, and writing a UDF can be time intensive.
By leveraging the interoperability of the GPU PyData ecosystem, this operation becomes very easy. Let’s take the row-wise sum of our previously reshaped cuDF DataFrame.
[12]:
reshaped_df.head()
[12]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 |
1 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | ... | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 |
2 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | ... | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 |
3 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | ... | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 |
4 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | ... | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 |
5 rows × 200 columns
We can just transform it into a CuPy array via dlpack and use the axis
argument of sum
.
[13]:
new_arr = cp.fromDlpack(reshaped_df.to_dlpack())
new_arr.sum(axis=1)
/conda/envs/cudf/lib/python3.7/site-packages/cudf/io/dlpack.py:74: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.
return libdlpack.to_dlpack(gdf_cols)
[13]:
array([ 19900, 59900, 99900, 139900, 179900, 219900, 259900,
299900, 339900, 379900, 419900, 459900, 499900, 539900,
579900, 619900, 659900, 699900, 739900, 779900, 819900,
859900, 899900, 939900, 979900, 1019900, 1059900, 1099900,
1139900, 1179900, 1219900, 1259900, 1299900, 1339900, 1379900,
1419900, 1459900, 1499900, 1539900, 1579900, 1619900, 1659900,
1699900, 1739900, 1779900, 1819900, 1859900, 1899900, 1939900,
1979900])
With just that single line, we’re able to seamlessly move between data structures in this ecosystem, giving us enormous flexibility without sacrificing speed.
Converting a cuDF DataFrame to a CuPy Sparse Matrix¶
We can also convert a DataFrame or Series to a CuPy sparse matrix. We might want to do this if downstream processes expect CuPy sparse matrices as an input.
The sparse matrix data structure is defined by three dense arrays, which we could create manually from an existing cuDF DataFrame or Series. Luckily, we don’t need to do that. We can simply leverage dlpack again. We’ll define a small helper function for cleanliness.
[14]:
def cudf_to_cupy_sparse_matrix(data, sparseformat='column'):
"""Converts a cuDF object to a CuPy Sparse Column matrix.
"""
if sparseformat not in ('row', 'column',):
raise ValueError("Let's focus on column and row formats for now.")
_sparse_constructor = cp.sparse.csc_matrix
if sparseformat == 'row':
_sparse_constructor = cp.sparse.csr_matrix
return _sparse_constructor(cp.fromDlpack(data.to_dlpack()))
We can define a sparsely populated DataFrame to illustrate this conversion to either sparse matrix format.
[15]:
df = cudf.DataFrame()
nelem = 10000
nonzero = 1000
for i in range(20):
arr = cp.random.normal(5, 5, nelem)
arr[cp.random.choice(arr.shape[0], nelem-nonzero, replace=False)] = 0
df['a' + str(i)] = cp.ascontiguousarray(arr)
[16]:
df.head()
[16]:
a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | a8 | a9 | a10 | a11 | a12 | a13 | a14 | a15 | a16 | a17 | a18 | a19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.712374 | 8.767093 | 0.0 | 0.200284 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
1 | 0.0 | 0.0 | 0.0 | 8.846976 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 8.297748 | 0.0 | 0.0 | 0.000000 | 0.000000 |
2 | 0.0 | 0.0 | 0.0 | 7.201353 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
3 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 9.195801 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.16354 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
4 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 3.340914 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.000000 | 0.0 | 0.0 | 8.887779 | 4.562149 |
[17]:
sparse_data = cudf_to_cupy_sparse_matrix(df)
sparse_data
/conda/envs/cudf/lib/python3.7/site-packages/cudf/io/dlpack.py:74: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.
return libdlpack.to_dlpack(gdf_cols)
[17]:
<cupyx.scipy.sparse.csc.csc_matrix at 0x7f25814012d0>
From here, we could continue our workflow with a CuPy sparse matrix.
For a full list of the functionality built into these libraries, we encourage you to check out the API docs for cuDF and CuPy.