Overview of User Defined Functions with cuDF

Like many tabular data processing APIs, cuDF provides a range of composable, DataFrame style operators. While out of the box functions are flexible and useful, it is sometimes necessary to write custom code, or user-defined functions (UDFs), that can be applied to rows, columns, and other groupings of the cells making up the DataFrame.

In conjunction with the broader GPU PyData ecosystem, cuDF provides interfaces to run UDFs on a variety of data structures. Currently, we can only execute UDFs on numeric and Boolean typed data (support for strings is being planned). This guide covers writing and executing UDFs on the following data structures:

  • Series

  • DataFrame

  • Rolling Windows Series

  • Groupby DataFrames

  • CuPy NDArrays

  • Numba DeviceNDArrays

It also demonstrates cuDF’s default null handling behavior, and how to write UDFs that can interact with null values in a limited fashion. Finally, it demonstrates some newer more general null handling via the DataFrame.apply API.

Overview

When cuDF executes a UDF, it gets just-in-time (JIT) compiled into a CUDA kernel (either explicitly or implicitly) and is run on the GPU. Exploring CUDA and GPU architecture in-depth is out of scope for this guide. At a high level:

  • Compute is spread across multiple “blocks”, which have access to both global memory and their own block local memory

  • Within each block, many “threads” operate independently and simultaneously access their block-specific shared memory with low latency

This guide covers APIs that automatically handle dividing columns into chunks and assigning them into different GPU blocks for parallel computation (see apply_chunks or the numba CUDA JIT API if you need to control this yourself).

Series UDFs

You can execute UDFs on Series in two ways:

  • Writing a standard Python function and using applymap

  • Writing a Numba kernel and using Numba’s forall syntax

Using applymap is simpler, but writing a Numba kernel offers the flexibility to build more complex functions (we’ll be writing only simple kernels in this guide).

Let’s start by importing a few libraries and creating a DataFrame of several Series.

[1]:
import numpy as np

import cudf
from cudf.datasets import randomdata

df = randomdata(nrows=10, dtypes={'a':float, 'b':bool, 'c':str}, seed=12)
df.head()
[1]:
a b c
0 -0.691674 True Dan
1 0.480099 False Bob
2 -0.473370 True Xavier
3 0.067479 True Alice
4 -0.970850 False Sarah

Next, we’ll define a basic Python function and call it as a UDF with applymap.

[2]:
def udf(x):
    if x > 0:
        return x + 5
    else:
        return x - 5
[3]:
df['a'].applymap(udf)
[3]:
0   -5.691674
1    5.480099
2   -5.473370
3    5.067479
4   -5.970850
5    5.837494
6    5.801430
7   -5.933157
8    5.913899
9   -5.725581
Name: a, dtype: float64

That’s all there is to it. For more complex UDFs, though, we’d want to write an actual Numba kernel.

For more complex logic (for instance, accessing values from multiple input columns or rows, you’ll need to use a more complex API. There are several types. First we’ll cover writing and running a Numba JITed CUDA kernel.

The easiest way to write a Numba kernel is to use cuda.grid(1) to manage our thread indices, and then leverage Numba’s forall method to configure the kernel for us. Below, define a basic multiplication kernel as an example and use @cuda.jit to compile it.

[4]:
from numba import cuda

@cuda.jit
def multiply(in_col, out_col, multiplier):
    i = cuda.grid(1)
    if i < in_col.size: # boundary guard
        out_col[i] = in_col[i] * multiplier

This kernel will take an input array, multiply it by a configurable value (supplied at runtime), and store the result in an output array. Notice that we wrapped our logic in an if statement. Because we can launch more threads than the size of our array, we need to make sure that we don’t use threads with an index that would be out of bounds. Leaving this out can result in undefined behavior.

To execute our kernel, we just need to pre-allocate an output array and leverage the forall method mentioned above. First, we create a Series of all 0.0 in our DataFrame, since we want float64 output. Next, we run the kernel with forall. forall requires us to specify our desired number of tasks, so we’ll supply in the length of our Series (which we store in size). The **cuda_array_interface** is what allows us to directly call our Numba kernel on our Series.

[5]:
size = len(df['a'])
df['e'] = 0.0
multiply.forall(size)(df['a'], df['e'], 10.0)

After calling our kernel, our DataFrame is now populated with the result.

[6]:
df.head()
[6]:
a b c e
0 -0.691674 True Dan -6.916743
1 0.480099 False Bob 4.800994
2 -0.473370 True Xavier -4.733700
3 0.067479 True Alice 0.674788
4 -0.970850 False Sarah -9.708501

Note that, while we’re operating on the Series df['e'], the kernel executes on the DeviceNDArray “underneath” the Series. If you ever need to access the underlying DeviceNDArray of a Series, you can do so with Series.data.mem. We’ll use this during an example in the Null Handling section of this guide.

DataFrame UDFs

We could apply a UDF on a DataFrame like we did above with forall. We’d need to write a kernel that expects multiple inputs, and pass multiple Series as arguments when we execute our kernel. Because this is fairly common and can be difficult to manage, cuDF provides two APIs to streamline this: apply_rows and apply_chunks. Below, we walk through an example of using apply_rows. apply_chunks works in a similar way, but also offers more control over low-level kernel behavior.

Now that we have two numeric columns in our DataFrame, let’s write a kernel that uses both of them.

[7]:
def conditional_add(x, y, out):
    for i, (a, e) in enumerate(zip(x, y)):
        if a > 0:
            out[i] = a + e
        else:
            out[i] = a

Notice that we need to enumerate through our zipped function arguments (which either match or are mapped to our input column names). We can pass this kernel to apply_rows. We’ll need to specify a few arguments: - incols - A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as {'col1': 'arg1'}. - outcols - A dictionary defining our output column names and their data types. These names must match our function arguments. - kwargs (optional) - We can optionally pass keyword arguments as a dictionary. Since we don’t need any, we pass an empty one.

While it looks like our function is looping sequentially through our columns, it actually executes in parallel in multiple threads on the GPU. This parallelism is the heart of GPU-accelerated computing. With that background, we’re ready to use our UDF.

[8]:
df = df.apply_rows(conditional_add,
                   incols={'a':'x', 'e':'y'},
                   outcols={'out': np.float64},
                   kwargs={}
                  )
df.head()
[8]:
a b c e out
0 -0.691674 True Dan -6.916743 -0.691674
1 0.480099 False Bob 4.800994 5.281093
2 -0.473370 True Xavier -4.733700 -0.473370
3 0.067479 True Alice 0.674788 0.742267
4 -0.970850 False Sarah -9.708501 -0.970850

As expected, we see our conditional addition worked. At this point, we’ve successfully executed UDFs on the core data structures of cuDF.

Rolling Window UDFs

For time-series data, we may need to operate on a small “window” of our column at a time, processing each portion independently. We could slide (“roll”) this window over the entire column to answer questions like “What is the 3-day moving average of a stock price over the past year?”

We can apply more complex functions to rolling windows to rolling Series and DataFrames using apply. This example is adapted from cuDF’s API documentation. First, we’ll create an example Series and then create a rolling object from the Series.

[9]:
ser = cudf.Series([16, 25, 36, 49, 64, 81], dtype='float64')
ser
[9]:
0    16.0
1    25.0
2    36.0
3    49.0
4    64.0
5    81.0
dtype: float64
[10]:
rolling = ser.rolling(window=3, min_periods=3, center=False)
rolling
[10]:
Rolling [window=3,min_periods=3,center=False]

Next, we’ll define a function to use on our rolling windows. We created this one to highlight how you can include things like loops, mathematical functions, and conditionals. Rolling window UDFs do not yet support null values.

[11]:
import math

def example_func(window):
    b = 0
    for a in window:
        b = max(b, math.sqrt(a))
    if b == 8:
        return 100
    return b

We can execute the function by passing it to apply. With window=3, min_periods=3, and center=False, our first two values are null.

[12]:
rolling.apply(example_func)
[12]:
0     <NA>
1     <NA>
2      6.0
3      7.0
4    100.0
5      9.0
dtype: float64

We can apply this function to every column in a DataFrame, too.

[13]:
df2 = cudf.DataFrame()
df2['a'] = np.arange(55, 65, dtype='float64')
df2['b'] = np.arange(55, 65, dtype='float64')
df2.head()
[13]:
a b
0 55.0 55.0
1 56.0 56.0
2 57.0 57.0
3 58.0 58.0
4 59.0 59.0
[14]:
rolling = df2.rolling(window=3, min_periods=3, center=False)
rolling.apply(example_func)
[14]:
a b
0 <NA> <NA>
1 <NA> <NA>
2 7.549834435 7.549834435
3 7.615773106 7.615773106
4 7.681145748 7.681145748
5 7.745966692 7.745966692
6 7.810249676 7.810249676
7 7.874007874 7.874007874
8 7.937253933 7.937253933
9 100.0 100.0

GroupBy DataFrame UDFs

We can also apply UDFs to grouped DataFrames using apply_grouped. This example is also drawn and adapted from the RAPIDS API documentation.

First, we’ll group our DataFrame based on column b, which is either True or False. Note that we currently need to pass method="cudf" to use UDFs with GroupBy objects.

[15]:
df.head()
[15]:
a b c e out
0 -0.691674 True Dan -6.916743 -0.691674
1 0.480099 False Bob 4.800994 5.281093
2 -0.473370 True Xavier -4.733700 -0.473370
3 0.067479 True Alice 0.674788 0.742267
4 -0.970850 False Sarah -9.708501 -0.970850
[16]:
grouped = df.groupby(['b'])

Next we’ll define a function to apply to each group independently. In this case, we’ll take the rolling average of column e, and call that new column rolling_avg_e.

[17]:
def rolling_avg(e, rolling_avg_e):
    win_size = 3
    for i in range(cuda.threadIdx.x, len(e), cuda.blockDim.x):
        if i < win_size - 1:
            # If there is not enough data to fill the window,
            # take the average to be NaN
            rolling_avg_e[i] = np.nan
        else:
            total = 0
            for j in range(i - win_size + 1, i + 1):
                total += e[j]
            rolling_avg_e[i] = total / win_size

We can execute this with a very similar API to apply_rows. This time, though, it’s going to execute independently for each group.

[18]:
results = grouped.apply_grouped(rolling_avg,
                               incols=['e'],
                               outcols=dict(rolling_avg_e=np.float64))
results
[18]:
a b c e out rolling_avg_e
1 0.480099 False Bob 4.800994 5.281093 NaN
4 -0.970850 False Sarah -9.708501 -0.970850 NaN
6 0.801430 False Sarah 8.014297 8.815727 1.035597
7 -0.933157 False Quinn -9.331571 -0.933157 -3.675258
0 -0.691674 True Dan -6.916743 -0.691674 NaN
2 -0.473370 True Xavier -4.733700 -0.473370 NaN
3 0.067479 True Alice 0.674788 0.742267 -3.658552
5 0.837494 True Wendy 8.374940 9.212434 1.438676
8 0.913899 True Ursula 9.138987 10.052885 6.062905
9 -0.725581 True George -7.255814 -0.725581 3.419371

Notice how, with a window size of three in the kernel, the first two values in each group for our output column are null.

Numba Kernels on CuPy Arrays

We can also execute Numba kernels on CuPy NDArrays, again thanks to the __cuda_array_interface__. We can even run the same UDF on the Series and the CuPy array. First, we define a Series and then create a CuPy array from that Series.

[19]:
import cupy as cp

s = cudf.Series([1.0, 2, 3, 4, 10])
arr = cp.asarray(s)
arr
[19]:
array([ 1.,  2.,  3.,  4., 10.])

Next, we define a UDF and execute it on our Series. We need to allocate a Series of the same size for our output, which we’ll call out.

[20]:
from cudf.utils import cudautils

@cuda.jit
def multiply_by_5(x, out):
    i = cuda.grid(1)
    if i < x.size:
        out[i] = x[i] * 5

out = cudf.Series(cp.zeros(len(s), dtype='int32'))
multiply_by_5.forall(s.shape[0])(s, out)
out
[20]:
0     5
1    10
2    15
3    20
4    50
dtype: int32

Finally, we execute the same function on our array. We allocate an empty array out to store our results.

[21]:
out = cp.empty_like(arr)
multiply_by_5.forall(arr.size)(arr, out)
out
[21]:
array([ 5., 10., 15., 20., 50.])

Null Handling in UDFs

Above, we covered most basic usage of UDFs with cuDF.

The remainder of the guide focuses on considerations for executing UDFs on DataFrames containing null values. If your UDFs will read or write any column containing nulls, you should read this section carefully.

Writing UDFs that can handle null values is complicated by the fact that a separate bitmask is used to identify when a value is valid and when it’s null. By default, DataFrame methods for applying UDFs like apply_rows will handle nulls pessimistically (all rows with a null value will be removed from the output if they are used in the kernel). Exploring how not handling not pessimistically can lead to undefined behavior is outside the scope of this guide. Suffice it to say, pessimistic null handling is the safe and consistent approach. You can see an example below.

[22]:
def gpu_add(a, b, out):
    for i, (x, y) in enumerate(zip(a, b)):
        out[i] = x + y

df = randomdata(nrows=5, dtypes={'a':int, 'b':int, 'c':int}, seed=12)
df.loc[2, 'a'] = None
df.loc[3, 'b'] = None
df.loc[1, 'c'] = None
df.head()
[22]:
a b c
0 963 1005 997
1 977 1026 <NA>
2 <NA> 1026 1019
3 1078 <NA> 985
4 979 982 1011

In the dataframe above, there are three null values. Each column has a null in a different row. When we use our UDF with apply_rows, our output should have two nulls due to pessimistic null handling (because we’re not using column c, the null value there does not matter to us).

[23]:
df = df.apply_rows(gpu_add,
              incols=['a', 'b'],
              outcols={'out':np.float64},
              kwargs={})
df.head()
[23]:
a b c out
0 963 1005 997 1968.0
1 977 1026 <NA> 2003.0
2 <NA> 1026 1019 <NA>
3 1078 <NA> 985 <NA>
4 979 982 1011 1961.0

As expected, we end up with two nulls in our output. The null values from the columns we used propogated to our output, but the null from the column we ignored did not.

Generalized NA Support

More general support for NA handling is provided on an experimental basis. While the details of the way this works are out of scope of this guide, the broad strokes of the pipeline are similar to those of Series.applymap: Numba is used to translate a standard python function into an operation on the data columns and their masks, and then the reduced and optimized version of this function is runtime compiled and called using the data.

One advantage of this approach apart from the ability to handle nulls generally in an intuitive manner is it results in a very familiar API to Pandas users. Let’s see how this works with an example.

The key to accessing this API is a decorator: cudf.core.udf.pipeline.nulludf:

[24]:
from cudf.core.udf.pipeline import nulludf

Let’s create a simple example DataFrame for demonstrational purposes.

[25]:
df = cudf.DataFrame({
    'A': [1,2,3],
    'B': [4,cudf.NA,6]
})
df
[25]:
A B
0 1 4
1 2 <NA>
2 3 6

The entrypoint for UDFs used in this manner is cudf.DataFrame.apply. To use it, start by defining a completely standard python function decorated with the decorator nulludf:

[26]:
@nulludf
def f(x, y):
    return x + y

Finally call the function as you would in pandas - by using a lambda function to map the UDF onto “rows” of the DataFrame:

[27]:
df.apply(
    lambda row: f(
        row['A'],
        row['B']
    ),
    axis=1
)
[27]:
0       5
1    <NA>
2       9
dtype: int64

Advanced users might recognize that cuDF does not actually have a row object (a special type of Pandas series that behaves like a dict). The nulludf decorator is the key to making this work - it really just rearranges things nicely such that the API works in this way. The same function works the same way in pandas, except without the decorator of course:

[28]:
def g(x, y):
    return x + y

df.to_pandas(nullable=True).apply(
    lambda row: g(
        row['A'],
        row['B']
    ),
    axis=1
)
[28]:
0       5
1    <NA>
2       9
dtype: object

Notice that Pandas returns object dtype - see notes on this in the caveats section.

This API supports UDFs that interact with nulls in more complex ways, and leverages the cudf.NA singleton object much in the same manner as Pandas, allowing for more flexible functions. As a basic example this function conditions on wether or not a value is NA and returns a scalar in that case:

[29]:
@nulludf
def f(x):
    if x is cudf.NA:
        return 0
    else:
        return x + 1

df = cudf.DataFrame({'a': [1, cudf.NA, 3]})
df
[29]:
a
0 1
1 <NA>
2 3
[30]:
df.apply(lambda row: f(row['a']))
[30]:
0    2
1    0
2    4
dtype: int64

cudf.NA can also be directly returned from a function resulting in data that has the the correct nulls in the end, just as if it were run in Pandas. For the following data, the last row fulfills the condition that 1 + 3 > 3 and returns NA for that row:

[31]:
@nulludf
def f(x, y):
    if x + y > 3:
        return cudf.NA
    else:
        return x + y

df = cudf.DataFrame({
    'a': [1, 2, 3],
    'b': [2, 1, 1]
})
df
[31]:
a b
0 1 2
1 2 1
2 3 1
[32]:
df.apply(lambda row: f(row['a'], row['b']))
[32]:
0       3
1       3
2    <NA>
dtype: int64

Mixed types are allowed, but will return the common type, rather than object as in Pandas. Here’s a null aware op between an int and a float column:

[33]:
@nulludf
def f(x, y):
     return x + y

df = cudf.DataFrame({
    'a': [1, 2, 3],
    'b': [0.5, cudf.NA, 3.14]
})
df
[33]:
a b
0 1 0.5
1 2 <NA>
2 3 3.14
[34]:
df.apply(lambda row: f(row['a'], row['b']))
[34]:
0     1.5
1    <NA>
2    6.14
dtype: float64

Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data. This means even if you have a function like:

def f(x):
    if x > 1000:
        return 1.5
    else:
        return 2

And your data is:

[1,2,3,4,5]

You will get floats in the final data even though a float is never returned. This is because Numba ultimately needs to produce one function that can handle any data, which means if there’s any possibility a float could result, you must always assume it will happen. Here’s an example of a function that returns a scalar in some cases:

[35]:
@nulludf
def f(x):
    if x > 3:
            return x
    else:
            return 1.5

df = cudf.DataFrame({
    'a': [1, 3, 5]
})
df
[35]:
a
0 1
1 3
2 5
[36]:
df.apply(lambda row: f(row['a']))
[36]:
0    1.5
1    1.5
2    5.0
dtype: float64

Any number of columns and many arithmetic operators are supported, allowing for complex UDFs:

[37]:
@nulludf
def f(v, w, x, y, z):
    return x + (y - (z / w)) % v

df = cudf.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6],
    'c': [cudf.NA, 4, 4],
    'd': [8, 7, 8],
    'e': [7, 1, 6]
})
df
[37]:
a b c d e
0 1 4 <NA> 8 7
1 2 5 4 7 1
2 3 6 4 8 6
[38]:
df.apply(
    lambda row: f(
            row['a'],
            row['b'],
            row['c'],
            row['d'],
            row['e']
    )
)
[38]:
0    <NA>
1     4.8
2     5.0
dtype: float64

Caveats

  • Only numeric nondecimal scalar types are currently supported as of yet, but strings and structured types are in planning. Attempting to use this API with those types will throw a TypeError.

  • Due to some more recent CUDA features being leveraged in the pipeline, support for CUDA 11.0 is currently unavailable. In particular, the 11.1+ toolkit will be needed, else the API will raise.

  • We do not yet fully support all arithmetic operators. Certain ops like bitwise operations are not currently implemented, but planned in future releases. If an operator is needed, a github issue should be raised so that it can be properly prioritized and implemented.

  • Due to limitations in the Numba’s output is currently runtime compiled, we can’t yet support certain functions:

    • pow

    • sin

    • cos

    • tan

Attempting to use these functions inside a UDF will result in an NVRTC error.

Summary

This guide has covered a lot of content. At this point, you should hopefully feel comfortable writing UDFs (with or without null values) that operate on

  • Series

  • DataFrame

  • Rolling Windows

  • GroupBy DataFrames

  • CuPy NDArrays

  • Numba DeviceNDArrays

  • Generalized NA UDFs

For more information please see the cuDF, Numba.cuda, and CuPy documentation.