cudf.DataFrame.apply#
- DataFrame.apply(func, axis=1, raw=False, result_type=None, args=(), by_row: Literal[False, 'compat'] = 'compat', engine: Literal['python', 'numba'] = 'python', engine_kwargs: dict[str, bool] | None = None, **kwargs)[source]#
Apply a function along an axis of the DataFrame.
apply
relies on Numba to JIT compilefunc
. Thus the allowed operations withinfunc
are limited to those supported by the CUDA Python Numba target. For more information, see the cuDF guide to user defined functions.Some string functions and methods are supported. Refer to the guide to UDFs for details.
- Parameters:
- funcfunction
Function to apply to each row.
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied. - 0 or ‘index’: apply function to each column (not yet supported). - 1 or ‘columns’: apply function to each row.
- raw: bool, default False
Not yet supported
- result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None
Not yet supported
- args: tuple
Positional arguments to pass to func in addition to the dataframe.
- by_rowFalse or “compat”, default “compat”
Only has an effect when
func
is a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g.Series().apply(np.sum)
will be translated toSeries().sum()
). If that doesn’t work, will try call to apply again withby_row=True
and if that fails, will call apply again withby_row=False
(backward compatible). If False, the funcs will be passed the whole Series at once.Currently not supported.
- engine{‘python’, ‘numba’}, default ‘python’
Unused. Added for compatibility with pandas.
- engine_kwargsdict
Unused. Added for compatibility with pandas.
- **kwargs
Additional keyword arguments to pass as keywords arguments to func.
Examples
Simple function of a single variable which could be NA:
>>> def f(row): ... if row['a'] is cudf.NA: ... return 0 ... else: ... return row['a'] + 1 ... >>> df = cudf.DataFrame({'a': [1, cudf.NA, 3]}) >>> df.apply(f, axis=1) 0 2 1 0 2 4 dtype: int64
Function of multiple variables will operate in a null aware manner:
>>> def f(row): ... return row['a'] - row['b'] ... >>> df = cudf.DataFrame({ ... 'a': [1, cudf.NA, 3, cudf.NA], ... 'b': [5, 6, cudf.NA, cudf.NA] ... }) >>> df.apply(f) 0 -4 1 <NA> 2 <NA> 3 <NA> dtype: int64
Functions may conditionally return NA as in pandas:
>>> def f(row): ... if row['a'] + row['b'] > 3: ... return cudf.NA ... else: ... return row['a'] + row['b'] ... >>> df = cudf.DataFrame({ ... 'a': [1, 2, 3], ... 'b': [2, 1, 1] ... }) >>> df.apply(f, axis=1) 0 3 1 3 2 <NA> dtype: int64
Mixed types are allowed, but will return the common type, rather than object as in pandas:
>>> def f(row): ... return row['a'] + row['b'] ... >>> df = cudf.DataFrame({ ... 'a': [1, 2, 3], ... 'b': [0.5, cudf.NA, 3.14] ... }) >>> df.apply(f, axis=1) 0 1.5 1 <NA> 2 6.14 dtype: float64
Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data:
>>> def f(row): ... if row['a'] > 3: ... return row['a'] ... else: ... return 1.5 ... >>> df = cudf.DataFrame({ ... 'a': [1, 3, 5] ... }) >>> df.apply(f, axis=1) 0 1.5 1 1.5 2 5.0 dtype: float64
Ops against N columns are supported generally:
>>> def f(row): ... v, w, x, y, z = ( ... row['a'], row['b'], row['c'], row['d'], row['e'] ... ) ... return x + (y - (z / w)) % v ... >>> df = cudf.DataFrame({ ... 'a': [1, 2, 3], ... 'b': [4, 5, 6], ... 'c': [cudf.NA, 4, 4], ... 'd': [8, 7, 8], ... 'e': [7, 1, 6] ... }) >>> df.apply(f, axis=1) 0 <NA> 1 4.8 2 5.0 dtype: float64
UDFs manipulating string data are allowed, as long as they neither modify strings in place nor create new strings. For example, the following UDF is allowed:
>>> def f(row): ... st = row['str_col'] ... scale = row['scale'] ... if len(st) == 0: ... return -1 ... elif st.startswith('a'): ... return 1 - scale ... elif 'example' in st: ... return 1 + scale ... else: ... return 42 ... >>> df = cudf.DataFrame({ ... 'str_col': ['', 'abc', 'some_example'], ... 'scale': [1, 2, 3] ... }) >>> df.apply(f, axis=1) 0 -1 1 -1 2 4 dtype: int64
However, the following UDF is not allowed since it includes an operation that requires the creation of a new string: a call to the
upper
method. Methods that are not supported in this manner will raise anAttributeError
.>>> def f(row): ... st = row['str_col'].upper() ... return 'ABC' in st >>> df.apply(f, axis=1)
For a complete list of supported functions and methods that may be used to manipulate string data, see the UDF guide, <https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>