API Reference

DataFrame

class cudf.core.dataframe. DataFrame ( data=None , index=None , columns=None )

A GPU Dataframe object.

Examples

Build dataframe with __setitem__ :

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0

Build dataframe with initializer:

>>> import cudf
>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> ids = np.arange(5)

Create some datetime data

>>> t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
>>> datetimes = [(t0+ timedelta(seconds=x)) for x in range(5)]
>>> dts = np.array(datetimes, dtype='datetime64')

Create the GPU DataFrame

>>> df = cudf.DataFrame([('id', ids), ('datetimes', dts)])
>>> df
    id                datetimes
0    0  2018-10-07T12:00:00.000
1    1  2018-10-07T12:00:01.000
2    2  2018-10-07T12:00:02.000
3    3  2018-10-07T12:00:03.000
4    4  2018-10-07T12:00:04.000

Convert from a Pandas DataFrame:

>>> import pandas as pd
>>> import cudf
>>> pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
>>> df = cudf.from_pandas(pdf)
>>> df
  a b
0 0 0.1
1 1 0.2
2 2 nan
3 3 0.3
Attributes
T
columns

Returns a tuple of columns

dtypes

Return the dtypes in this object.

empty
iloc

Selecting rows and column by position.

index

Returns the index of the DataFrame

loc

Selecting rows and columns by label or boolean mask.

ndim

Dimension of the data.

shape

Returns a tuple representing the dimensionality of the DataFrame.

values

Methods

add_column (self, name, data[, forceindex])

Add a column

apply_chunks (self, func, incols, outcols[, …])

Transform user-specified chunks using the user-provided function.

apply_rows (self, func, incols, outcols, kwargs)

Apply a row-wise user defined function.

as_gpu_matrix (self[, columns, order])

Convert to a matrix in device memory.

as_matrix (self[, columns])

Convert to a matrix in host memory.

assign (self, \*\*kwargs)

Assign columns to DataFrame from keyword arguments.

at (self)

Alias for DataFrame.loc ; provided for compatibility with Pandas.

copy (self[, deep])

Returns a copy of this dataframe

describe (self[, percentiles, include, exclude])

Compute summary statistics of a DataFrame’s columns.

drop (self[, labels, axis, columns, errors])

Drop column(s)

drop_column (self, name)

Drop a column by name

drop_duplicates (self[, subset, keep, inplace])

Return DataFrame with duplicate rows removed, optionally only considering certain subset of columns.

dropna (self[, axis, how, subset, thresh])

Drops rows (or columns) containing nulls from a Column.

fillna (self, value[, method, axis, inplace, …])

Fill null values with value .

from_arrow (table)

Convert from a PyArrow Table.

from_gpu_matrix (data[, index, columns, …])

Convert from a numba gpu ndarray.

from_pandas (dataframe[, nan_as_null])

Convert from a Pandas DataFrame.

from_records (data[, index, columns, nan_as_null])

Convert from a numpy recarray or structured array.

groupby (self[, by, sort, as_index, method, …])

Groupby

hash_columns (self[, columns])

Hash the given columns and return a new Series

head (self[, n])

Returns the first n rows as a new DataFrame

iat (self)

Alias for DataFrame.iloc ; provided for compatibility with Pandas.

isna (self)

Identify missing values in a DataFrame.

isnull (self)

Identify missing values in a DataFrame.

iteritems (self)

Iterate over column names and series pairs

join (self, other[, on, how, lsuffix, …])

Join columns with other DataFrame on index or on a key column.

label_encoding (self, column, prefix, cats[, …])

Encode labels in a column with label encoding.

mean (self[, numeric_only])

Return the mean of the values for the requested axis.

melt (self, \*\*kwargs)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

merge (self, right[, on, how, left_on, …])

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

nans_to_nulls (self)

Convert nans (if any) to nulls.

nlargest (self, n, columns[, keep])

Get the rows of the DataFrame sorted by the n largest value of columns

notna (self)

Identify non-missing values in a DataFrame.

notnull (self)

Identify non-missing values in a DataFrame.

nsmallest (self, n, columns[, keep])

Get the rows of the DataFrame sorted by the n smallest value of columns

one_hot_encoding (self, column, prefix, cats)

Expand a column with one-hot-encoding.

partition_by_hash (self, columns, nparts)

Partition the dataframe by the hashed value of data in columns .

pop (self, item)

Return a column and drop it from the DataFrame.

quantile (self[, q, axis, numeric_only, …])

Return values at the given quantile.

query (self, expr[, local_dict])

Query with a boolean expression using Numba to compile a GPU kernel.

reindex (self[, labels, axis, index, …])

Return a new DataFrame whose axes conform to a new index

rename (self[, mapper, columns, copy, inplace])

Alter column labels.

replace (self, to_replace, replacement)

Replace values given in to_replace with replacement .

rolling (self, window[, min_periods, center, …])

Rolling window calculations.

scatter_by_map (self, map_index[, map_size])

Scatter to a list of dataframes.

select_dtypes (self[, include, exclude])

Return a subset of the DataFrame’s columns based on the column dtypes.

set_index (self, index[, drop])

Return a new DataFrame with a new index

sort_index (self[, ascending])

Sort by the index

sort_values (self, by[, ascending, na_position])

Sort by the values row-wise.

tail (self[, n])

Returns the last n rows as a new DataFrame

to_arrow (self[, preserve_index])

Convert to a PyArrow Table.

to_csv (self[, path, sep, na_rep, columns, …])

Write a dataframe to csv file format.

to_dlpack (self)

Converts a cuDF object into a DLPack tensor.

to_feather (self, path, \*args, \*\*kwargs)

Write a DataFrame to the feather format.

to_gpu_matrix (self)

Convert to a numba gpu ndarray

to_hdf (self, path_or_buf, key, \*args, …)

Write the contained data to an HDF5 file using HDFStore.

to_json (self[, path_or_buf])

Convert the cuDF object to a JSON string.

to_orc (self, fname[, compression])

Write a DataFrame to the ORC format.

to_pandas (self)

Convert to a Pandas DataFrame.

to_parquet (self, path, \*args, \*\*kwargs)

Write a DataFrame to the parquet format.

to_records (self[, index])

Convert to a numpy recarray

to_string (self)

Convert to string

transpose (self)

Transpose index and columns.

acos

add

all

any

argsort

asin

atan

cos

count

cummax

cummin

cumprod

cumsum

deserialize

equals

exp

floordiv

get_renderable_dataframe

kurtosis

log

mask

max

min

mod

mul

pow

product

radd

repeat

reset_index

rfloordiv

rmod

rmul

rpow

rsub

rtruediv

serialize

sin

skew

sqrt

std

sub

sum

take

tan

truediv

var

add_column ( self , name , data , forceindex=False )

Add a column

Parameters
name str

Name of column to be added.

data Series, array-like

Values to be added.

apply_chunks ( self , func , incols , outcols , kwargs={} , pessimistic_nulls=True , chunks=None , blkct=None , tpb=None )

Transform user-specified chunks using the user-provided function.

Parameters
df DataFrame

The source dataframe.

func function

The transformation function that will be executed on the CUDA GPU.

incols: list or dict

A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as {‘col1’: ‘arg1’}.

outcols: dict

A dictionary of output column names and their dtype.

kwargs: dict

name-value of extra arguments. These values are passed directly into the function.

pessimistic_nulls bool

Whether or not apply_rows output should be null when any corresponding input is null. If False, all outputs will be non-null, but will be the result of applying func against the underlying column data, which may be garbage.

chunks int or Series-like

If it is an int , it is the chunksize. If it is an array, it contains integer offset for the start of each chunk. The span of a chunk for chunk i-th is data[chunks[i] : chunks[i + 1]] for any i + 1 < chunks.size ; or, data[chunks[i]:] for the i == len(chunks) - 1 .

tpb int; optional

The threads-per-block for the underlying kernel. If not specified (Default), uses Numba .forall(...) built-in to query the CUDA Driver API to determine optimal kernel launch configuration. Specify 1 to emulate serial execution for each chunk. It is a good starting point but inefficient. Its maximum possible value is limited by the available CUDA GPU resources.

blkct int; optional

The number of blocks for the underlying kernel. If not specified (Default) and tpb is not specified (Default), uses Numba .forall(...) built-in to query the CUDA Driver API to determine optimal kernel launch configuration. If not specified (Default) and tpb is specified, uses chunks as the number of blocks.

Examples

For tpb > 1 , func is executed by tpb number of threads concurrently. To access the thread id and count, use numba.cuda.threadIdx.x and numba.cuda.blockDim.x , respectively (See numba CUDA kernel documentation ).

In the example below, the kernel is invoked concurrently on each specified chunk. The kernel computes the corresponding output for the chunk.

By looping over the range range(cuda.threadIdx.x, in1.size, cuda.blockDim.x) , the kernel function can be used with any tpb in a efficient manner.

>>> from numba import cuda
>>> @cuda.jit
... def kernel(in1, in2, in3, out1):
...      for i in range(cuda.threadIdx.x, in1.size, cuda.blockDim.x):
...          x = in1[i]
...          y = in2[i]
...          z = in3[i]
...          out1[i] = x * y + z
apply_rows ( self , func , incols , outcols , kwargs , pessimistic_nulls=True , cache_key=None )

Apply a row-wise user defined function.

Parameters
df DataFrame

The source dataframe.

func function

The transformation function that will be executed on the CUDA GPU.

incols: list or dict

A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as {‘col1’: ‘arg1’}.

outcols: dict

A dictionary of output column names and their dtype.

kwargs: dict

name-value of extra arguments. These values are passed directly into the function.

pessimistic_nulls bool

Whether or not apply_rows output should be null when any corresponding input is null. If False, all outputs will be non-null, but will be the result of applying func against the underlying column data, which may be garbage.

Examples

The user function should loop over the columns and set the output for each row. Loop execution order is arbitrary, so each iteration of the loop MUST be independent of each other.

When func is invoked, the array args corresponding to the input/output are strided so as to improve GPU parallelism. The loop in the function resembles serial code, but executes concurrently in multiple threads.

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame()
>>> nelem = 3
>>> df['in1'] = np.arange(nelem)
>>> df['in2'] = np.arange(nelem)
>>> df['in3'] = np.arange(nelem)

Define input columns for the kernel

>>> in1 = df['in1']
>>> in2 = df['in2']
>>> in3 = df['in3']
>>> def kernel(in1, in2, in3, out1, out2, kwarg1, kwarg2):
...     for i, (x, y, z) in enumerate(zip(in1, in2, in3)):
...         out1[i] = kwarg2 * x - kwarg1 * y
...         out2[i] = y - kwarg1 * z

Call .apply_rows with the name of the input columns, the name and dtype of the output columns, and, optionally, a dict of extra arguments.

>>> df.apply_rows(kernel,
...               incols=['in1', 'in2', 'in3'],
...               outcols=dict(out1=np.float64, out2=np.float64),
...               kwargs=dict(kwarg1=3, kwarg2=4))
   in1  in2  in3 out1 out2
0    0    0    0  0.0  0.0
1    1    1    1  1.0 -2.0
2    2    2    2  2.0 -4.0
as_gpu_matrix ( self , columns=None , order='F' )

Convert to a matrix in device memory.

Parameters
columns sequence of str

List of a column names to be extracted. The order is preserved. If None is specified, all columns are used.

order ‘F’ or ‘C’

Optional argument to determine whether to return a column major (Fortran) matrix or a row major (C) matrix.

Returns
A (nrow x ncol) numpy ndarray in “F” order.
as_matrix ( self , columns=None )

Convert to a matrix in host memory.

Parameters
columns sequence of str

List of a column names to be extracted. The order is preserved. If None is specified, all columns are used.

Returns
A (nrow x ncol) numpy ndarray in “F” order.
assign ( self , **kwargs )

Assign columns to DataFrame from keyword arguments.

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df = df.assign(a=[0, 1, 2], b=[3, 4, 5])
>>> print(df)
   a  b
0  0  3
1  1  4
2  2  5
at ( self )

Alias for DataFrame.loc ; provided for compatibility with Pandas.

property columns

Returns a tuple of columns

copy ( self , deep=True )

Returns a copy of this dataframe

Parameters
deep: bool

Make a full copy of Series columns and Index at the GPU level, or create a new allocation with references.

describe ( self , percentiles=None , include=None , exclude=None )

Compute summary statistics of a DataFrame’s columns. For numeric data, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles. For object data, the output includes the count, number of unique values, the most common value, and the number of occurrences of the most common value.

Parameters
percentiles list-like, optional

The percentiles used to generate the output summary statistics. If None, the default percentiles used are the 25th, 50th and 75th. Values should be within the interval [0, 1].

include: str, list-like, optional

The dtypes to be included in the output summary statistics. Columns of dtypes not included in this list will not be part of the output. If include=’all’, all dtypes are included. Default of None includes all numeric columns.

exclude: str, list-like, optional

The dtypes to be excluded from the output summary statistics. Columns of dtypes included in this list will not be part of the output. Default of None excludes no columns.

Returns
output_frame DataFrame

Summary statistics of relevant columns in the original dataframe.

Examples

Describing a Series containing numeric values.

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> print(s.describe())
   stats   values
0  count     10.0
1   mean      5.5
2    std  3.02765
3    min      1.0
4    25%      2.5
5    50%      5.5
6    75%      7.5
7    max     10.0

Describing a DataFrame . By default all numeric fields are returned.

>>> gdf = cudf.DataFrame()
>>> gdf['a'] = [1,2,3]
>>> gdf['b'] = [1.0, 2.0, 3.0]
>>> gdf['c'] = ['x', 'y', 'z']
>>> gdf['d'] = [1.0, 2.0, 3.0]
>>> gdf['d'] = gdf['d'].astype('float32')
>>> print(gdf.describe())
   stats    a    b    d
0  count  3.0  3.0  3.0
1   mean  2.0  2.0  2.0
2    std  1.0  1.0  1.0
3    min  1.0  1.0  1.0
4    25%  1.5  1.5  1.5
5    50%  1.5  1.5  1.5
6    75%  2.5  2.5  2.5
7    max  3.0  3.0  3.0

Using the include keyword to describe only specific dtypes.

>>> gdf = cudf.DataFrame()
>>> gdf['a'] = [1,2,3]
>>> gdf['b'] = [1.0, 2.0, 3.0]
>>> gdf['c'] = ['x', 'y', 'z']
>>> print(gdf.describe(include='int'))
   stats    a
0  count  3.0
1   mean  2.0
2    std  1.0
3    min  1.0
4    25%  1.5
5    50%  1.5
6    75%  2.5
7    max  3.0
drop ( self , labels=None , axis=None , columns=None , errors='raise' )

Drop column(s)

Parameters
labels str or sequence of strings

Name of column(s) to be dropped.

axis {0 or ‘index’, 1 or ‘columns’}, default 0

Only axis=1 is currently supported.

columns: array of column names, the same as using labels and axis=1
errors {‘ignore’, ‘raise’}, default ‘raise’

This parameter is currently ignored.

Returns
A dataframe without dropped column(s)

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]
>>> df_new = df.drop('val')
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0
>>> print(df_new)
   key
0    0
1    1
2    2
3    3
4    4
drop_column ( self , name )

Drop a column by name

drop_duplicates ( self , subset=None , keep='first' , inplace=False )

Return DataFrame with duplicate rows removed, optionally only considering certain subset of columns.

dropna ( self , axis=0 , how='any' , subset=None , thresh=None )

Drops rows (or columns) containing nulls from a Column.

Parameters
axis {0, 1}, optional

Whether to drop rows (axis=0, default) or columns (axis=1) containing nulls.

how {“any”, “all”}, optional

Specifies how to decide whether to drop a row (or column). any (default) drops rows (or columns) containing at least one null value. all drops only rows (or columns) containing all null values.

subset list, optional

List of columns to consider when dropping rows (all columns are considered by default). Alternatively, when dropping columns, subset is a list of rows to consider.

thresh: int, optional

If specified, then drops every row (or column) containing less than thresh non-null values

Returns
Copy of the DataFrame with rows/columns containing nulls dropped.
property dtypes

Return the dtypes in this object.

fillna ( self , value , method=None , axis=None , inplace=False , limit=None )

Fill null values with value .

Parameters
value scalar, Series-like or dict

Value to use to fill nulls. If Series-like, null values are filled with values in corresponding indices. A dict can be used to provide different values to fill nulls in different columns.

Returns
result DataFrame

Copy with nulls filled.

Examples

>>> import cudf
>>> gdf = cudf.DataFrame({'a': [1, 2, None], 'b': [3, None, 5]})
>>> gdf.fillna(4).to_pandas()
a  b
0  1  3
1  2  4
2  4  5
>>> gdf.fillna({'a': 3, 'b': 4}).to_pandas()
a  b
0  1  3
1  2  4
2  3  5
classmethod from_arrow ( table )

Convert from a PyArrow Table.

Raises
TypeError for invalid input type.
Notes
Does not support automatically setting index column(s) similar to how
to_pandas works for PyArrow Tables.

Examples

>>> import pyarrow as pa
>>> import cudf
>>> data = [pa.array([1, 2, 3]), pa.array([4, 5, 6])]
>>> batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1'])
>>> table = pa.Table.from_batches([batch])
>>> cudf.DataFrame.from_arrow(table)
<cudf.DataFrame ncols=2 nrows=3 >
classmethod from_gpu_matrix ( data , index=None , columns=None , nan_as_null=False )

Convert from a numba gpu ndarray.

Parameters
data numba gpu ndarray
index str

The name of the index column in data . If None, the default index is used.

columns list of str

List of column names to include.

Returns
DataFrame
classmethod from_pandas ( dataframe , nan_as_null=True )

Convert from a Pandas DataFrame.

Raises
TypeError for invalid input type.

Examples

>>> import cudf
>>> import pandas as pd
>>> data = [[0,1], [1,2], [3,4]]
>>> pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
>>> cudf.from_pandas(pdf)
<cudf.DataFrame ncols=2 nrows=3 >
classmethod from_records ( data , index=None , columns=None , nan_as_null=False )

Convert from a numpy recarray or structured array.

Parameters
data numpy structured dtype or recarray of ndim=2
index str

The name of the index column in data . If None, the default index is used.

columns list of str

List of column names to include.

Returns
DataFrame
groupby ( self , by=None , sort=True , as_index=True , method='hash' , level=None , group_keys=True , dropna=True )

Groupby

Parameters
by list-of-str or str

Column name(s) to form that groups by.

sort bool, default True

Force sorting group keys.

as_index bool, default True

Indicates whether the grouped by columns become the index of the returned DataFrame

method str, optional

A string indicating the method to use to perform the group by. Valid values are “hash” or “cudf”. “cudf” method may be deprecated in the future, but is currently the only method supporting group UDFs via the apply function.

dropna bool, optional

If True (default), drop null keys. If False, perform grouping by keys containing null(s).

Returns
The groupby object

Notes

No empty rows are returned. (For categorical keys, pandas returns rows for all categories even if they are no corresponding values.)

hash_columns ( self , columns=None )

Hash the given columns and return a new Series

Parameters
column sequence of str; optional

Sequence of column names. If columns is None (unspecified), all columns in the frame are used.

head ( self , n=5 )

Returns the first n rows as a new DataFrame

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df.head(2))
   key   val
0    0  10.0
1    1  11.0
iat ( self )

Alias for DataFrame.iloc ; provided for compatibility with Pandas.

property iloc

Selecting rows and column by position.

Examples

>>> df = DataFrame([('a', list(range(20))),
...                 ('b', list(range(20))),
...                 ('c', list(range(20)))])

Select a single row using an integer index.

>>> print(df.iloc[1])
a    1
b    1
c    1

Select multiple rows using a list of integers.

>>> print(df.iloc[[0, 2, 9, 18]])
      a    b    c
 0    0    0    0
 2    2    2    2
 9    9    9    9
18   18   18   18

Select rows using a slice.

>>> print(df.iloc[3:10:2])
     a    b    c
3    3    3    3
5    5    5    5
7    7    7    7
9    9    9    9

Select both rows and columns.

>>> print(df.iloc[[1, 3, 5, 7], 2])
1    1
3    3
5    5
7    7
Name: c, dtype: int64

Setting values in a column using iloc.

>>> df.iloc[:4] = 0
>>> print(df)
   a  b  c
0  0  0  0
1  0  0  0
2  0  0  0
3  0  0  0
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9
[10 more rows]
property index

Returns the index of the DataFrame

isna ( self )

Identify missing values in a DataFrame. Alias for isnull.

isnull ( self )

Identify missing values in a DataFrame.

iteritems ( self )

Iterate over column names and series pairs

join ( self , other , on=None , how='left' , lsuffix='' , rsuffix='' , sort=False , type='' , method='hash' )

Join columns with other DataFrame on index or on a key column.

Parameters
other DataFrame
how str

Only accepts “left”, “right”, “inner”, “outer”

lsuffix, rsuffix str

The suffices to add to the left ( lsuffix ) and right ( rsuffix ) column names when avoiding conflicts.

sort bool

Set to True to ensure sorted ordering.

Returns
joined DataFrame

Notes

Difference from pandas:

  • other must be a single DataFrame for now.

  • on is not supported yet due to lack of multi-index support.

label_encoding ( self , column , prefix , cats , prefix_sep='_' , dtype=None , na_sentinel=-1 )

Encode labels in a column with label encoding.

Parameters
column str

the source column with binary encoding for the data.

prefix str

the new column name prefix.

cats sequence of ints

the sequence of categories as integers.

prefix_sep str

the separator between the prefix and the category.

dtype :

the dtype for the outputs; see Series.label_encoding

na_sentinel number

Value to indicate missing category.

Returns
——-
a new dataframe with a new column append for the coded values.
property loc

Selecting rows and columns by label or boolean mask.

Examples

DataFrame with string index.

>>> print(df)
   a  b
a  0  5
b  1  6
c  2  7
d  3  8
e  4  9

Select a single row by label.

>>> print(df.loc['a'])
a    0
b    5
Name: a, dtype: int64

Select multiple rows and a single column.

>>> print(df.loc[['a', 'c', 'e'], 'b'])
a    5
c    7
e    9
Name: b, dtype: int64

Selection by boolean mask.

>>> print(df.loc[df.a > 2])
   a  b
d  3  8
e  4  9

Setting values using loc.

>>> df.loc[['a', 'c', 'e'], 'a'] = 0
>>> print(df)
   a  b
a  0  5
b  1  6
c  0  7
d  3  8
e  0  9
mean ( self , numeric_only=None , **kwargs )

Return the mean of the values for the requested axis.

Parameters
axis {index (0), columns (1)}

Axis for the function to be applied on.

skipna bool, default True

Exclude NA/null values when computing the result.

level int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only bool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns
mean Series or DataFrame (if level specified)
melt ( self , **kwargs )

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frame DataFrame
id_vars tuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_vars tuple, list, or ndarray, optional

Column(s) to unpivot. default: all columns that are not set as id_vars .

var_name scalar

Name to use for the variable column. default: frame.columns.name or ‘variable’

value_name str

Name to use for the value column. default: ‘value’

Returns
out DataFrame

Melted result

merge ( self , right , on=None , how='inner' , left_on=None , right_on=None , left_index=False , right_index=False , sort=False , lsuffix=None , rsuffix=None , type='' , method='hash' , indicator=False , suffixes=('_x' , '_y') )

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

Parameters
right DataFrame
on label or list; defaults to None

Column or index level names to join on. These must be found in both DataFrames.

If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

how {‘left’, ‘outer’, ‘inner’}, default ‘inner’

Type of merge to be performed.

  • left : use only keys from left frame, similar to a SQL left outer join; preserve key order.

  • right : not supported.

  • outer : use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

left_on label or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_on label or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_index bool, default False

Use the index from the left DataFrame as the join key(s).

right_index bool, default False

Use the index from the right DataFrame as the join key.

sort bool, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (see the how keyword).

suffixes: Tuple[str, str], defaults to (‘_x’, ‘_y’)

Suffixes applied to overlapping column names on the left and right sides

method {‘hash’, ‘sort’}, default ‘hash’

The implementation method to be used for the operation.

Returns
merged DataFrame

Examples

>>> import cudf
>>> df_a = cudf.DataFrame()
>>> df_a['key'] = [0, 1, 2, 3, 4]
>>> df_a['vals_a'] = [float(i + 10) for i in range(5)]
>>> df_b = cudf.DataFrame()
>>> df_b['key'] = [1, 2, 4]
>>> df_b['vals_b'] = [float(i+10) for i in range(3)]
>>> df_merged = df_a.merge(df_b, on=['key'], how='left')
>>> df_merged.sort_values('key')  
   key  vals_a  vals_b
3    0    10.0
0    1    11.0    10.0
1    2    12.0    11.0
4    3    13.0
2    4    14.0    12.0
nans_to_nulls ( self )

Convert nans (if any) to nulls.

property ndim

Dimension of the data. DataFrame ndim is always 2.

nlargest ( self , n , columns , keep='first' )

Get the rows of the DataFrame sorted by the n largest value of columns

Notes

Difference from pandas: * Only a single column is supported in columns

notna ( self )

Identify non-missing values in a DataFrame.

notnull ( self )

Identify non-missing values in a DataFrame. Alias for notna.

nsmallest ( self , n , columns , keep='first' )

Get the rows of the DataFrame sorted by the n smallest value of columns

Difference from pandas: * Only a single column is supported in columns

one_hot_encoding ( self , column , prefix , cats , prefix_sep='_' , dtype='float64' )

Expand a column with one-hot-encoding.

Parameters
column str

the source column with binary encoding for the data.

prefix str

the new column name prefix.

cats sequence of ints

the sequence of categories as integers.

prefix_sep str

the separator between the prefix and the category.

dtype :

the dtype for the outputs; defaults to float64.

Returns
a new dataframe with new columns append for each category.

Examples

>>> import pandas as pd
>>> import cudf
>>> pet_owner = [1, 2, 3, 4, 5]
>>> pet_type = ['fish', 'dog', 'fish', 'bird', 'fish']
>>> df = pd.DataFrame({'pet_owner': pet_owner, 'pet_type': pet_type})
>>> df.pet_type = df.pet_type.astype('category')

Create a column with numerically encoded category values

>>> df['pet_codes'] = df.pet_type.cat.codes
>>> gdf = cudf.from_pandas(df)

Create the list of category codes to use in the encoding

>>> codes = gdf.pet_codes.unique()
>>> gdf.one_hot_encoding('pet_codes', 'pet_dummy', codes).head()
  pet_owner  pet_type  pet_codes  pet_dummy_0  pet_dummy_1  pet_dummy_2
0         1      fish          2          0.0          0.0          1.0
1         2       dog          1          0.0          1.0          0.0
2         3      fish          2          0.0          0.0          1.0
3         4      bird          0          1.0          0.0          0.0
4         5      fish          2          0.0          0.0          1.0
partition_by_hash ( self , columns , nparts )

Partition the dataframe by the hashed value of data in columns .

Parameters
columns sequence of str

The names of the columns to be hashed. Must have at least one name.

nparts int

Number of output partitions

Returns
partitioned: list of DataFrame
pop ( self , item )

Return a column and drop it from the DataFrame.

quantile ( self , q=0.5 , axis=0 , numeric_only=True , interpolation='linear' , columns=None , exact=True )

Return values at the given quantile.

Parameters
q float or array-like

0 <= q <= 1, the quantile(s) to compute

axis int

axis is a NON-FUNCTIONAL parameter

numeric_only boolean

numeric_only is a NON-FUNCTIONAL parameter

interpolation { linear , lower , higher , midpoint , nearest }

This parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j. Default ‘linear’.

columns list of str

List of column names to include.

exact boolean

Whether to use approximate or exact quantile algorithm.

Returns
DataFrame
query ( self , expr , local_dict={} )

Query with a boolean expression using Numba to compile a GPU kernel.

See pandas.DataFrame.query.

Parameters
expr str

A boolean expression. Names in expression refer to columns.

Names starting with @ refer to Python variables.

An output value will be null if any of the input values are null regardless of expression.

local_dict dict

Containing the local variable to be used in query.

Returns
filtered DataFrame

Examples

>>> import cudf
>>> a = ('a', [1, 2, 2])
>>> b = ('b', [3, 4, 5])
>>> df = cudf.DataFrame([a, b])
>>> expr = "(a == 2 and b == 4) or (b == 3)"
>>> print(df.query(expr))
   a  b
0  1  3
1  2  4

DateTime conditionals:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> print(df.query('datetimes==@search_date'))
                datetimes
1 2018-10-08T00:00:00.000

Using local_dict:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date2 = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> print(df.query('datetimes==@search_date',
>>>         local_dict={'search_date':search_date2}))
                datetimes
1 2018-10-08T00:00:00.000
reindex ( self , labels=None , axis=0 , index=None , columns=None , copy=True )

Return a new DataFrame whose axes conform to a new index

DataFrame.reindex supports two calling conventions * (index=index_labels, columns=column_names) * (labels, axis={0 or 'index', 1 or 'columns'})

Parameters
labels Index, Series-convertible, optional, default None
axis {0 or ‘index’, 1 or ‘columns’}, optional, default 0
index Index, Series-convertible, optional, default None

Shorthand for df.reindex(labels=index_labels, axis=0)

columns array-like, optional, default None

Shorthand for df.reindex(labels=column_names, axis=1)

copy boolean, optional, default True
Returns
A DataFrame whose axes conform to the new index(es)

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]
>>> df_new = df.reindex(index=[0, 3, 4, 5],
                        columns=['key', 'val', 'sum'])
>>> print(df)
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0
>>> print(df_new)
   key   val  sum
0    0  10.0  NaN
3    3  13.0  NaN
4    4  14.0  NaN
5   -1   NaN  NaN
rename ( self , mapper=None , columns=None , copy=True , inplace=False )

Alter column labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Parameters
mapper, columns dict-like or function, optional

dict-like or functions transformations to apply to the column axis’ values.

copy boolean, default True

Also copy underlying data

inplace: boolean, default False

Return new DataFrame. If True, assign columns without copy

Returns
DataFrame

Notes

Difference from pandas:
  • Support axis=’columns’ only.

  • Not supporting: index, level

Rename will not overwite column names. If a list with duplicates it passed, column names will be postfixed.

replace ( self , to_replace , replacement )

Replace values given in to_replace with replacement .

Parameters
to_replace numeric, str, list-like or dict

Value(s) to replace.

  • numeric or str:

    • values equal to to_replace will be replaced with replacement

  • list of numeric or str:

    • If replacement is also list-like, to_replace and replacement must be of same length.

  • dict:

    • Dicts can be used to replace different values in different columns. For example, {‘a’: 1, ‘z’: 2} specifies that the value 1 in column a and the value 2 in column z should be replaced with replacement*.

replacement numeric, str, list-like, or dict

Value(s) to replace to_replace with. If a dict is provided, then its keys must match the keys in to_replace , and correponding values must be compatible (e.g., if they are lists, then they must match in length).

Returns
result DataFrame

DataFrame after replacement.

rolling ( self , window , min_periods=None , center=False , axis=0 , win_type=None )

Rolling window calculations.

Parameters
window int or offset

Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset.

min_periods int, optional

The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or None , min_periods is equal to the window size.

center bool, optional

If True , the result is set at the center of the window. If False (default), the result is set at the right edge of the window.

Returns
Rolling object.

Examples

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4])

Rolling sum with window size 2.

>>> print(a.rolling(2).sum())
0
1    3
2    5
3
4
dtype: int64

Rolling sum with window size 2 and min_periods 1.

>>> print(a.rolling(2, min_periods=1).sum())
0    1
1    3
2    5
3    3
4    4
dtype: int64

Rolling count with window size 3.

>>> print(a.rolling(3).count())
0    1
1    2
2    3
3    2
4    2
dtype: int64

Rolling count with window size 3, but with the result set at the center of the window.

>>> print(a.rolling(3, center=True).count())
0    2
1    3
2    2
3    2
4    1 dtype: int64

Rolling max with variable window size specified by an offset; only valid for datetime index.

>>> a = cudf.Series(
...     [1, 9, 5, 4, np.nan, 1],
...     index=[
...         pd.Timestamp('20190101 09:00:00'),
...         pd.Timestamp('20190101 09:00:01'),
...         pd.Timestamp('20190101 09:00:02'),
...         pd.Timestamp('20190101 09:00:04'),
...         pd.Timestamp('20190101 09:00:07'),
...         pd.Timestamp('20190101 09:00:08')
...     ]
... )
>>> print(a.rolling('2s').max())
2019-01-01T09:00:00.000    1
2019-01-01T09:00:01.000    9
2019-01-01T09:00:02.000    9
2019-01-01T09:00:04.000    4
2019-01-01T09:00:07.000
2019-01-01T09:00:08.000    1
dtype: int64

Apply custom function on the window with the apply method

>>> import numpy as np
>>> import math
>>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64)
>>> def some_func(A):
...     b = 0
...     for a in A:
...         b = b + math.sqrt(a)
...     return b
...
>>> print(b.rolling(3, min_periods=1).apply(some_func))
0     4.0
1     9.0
2    15.0
3    18.0
4    21.0
5    24.0
dtype: float64

And this also works for window rolling set by an offset

>>> import pandas as pd
>>> c = cudf.Series(
...     [16, 25, 36, 49, 64, 81],
...     index=[
...          pd.Timestamp('20190101 09:00:00'),
...          pd.Timestamp('20190101 09:00:01'),
...          pd.Timestamp('20190101 09:00:02'),
...          pd.Timestamp('20190101 09:00:04'),
...          pd.Timestamp('20190101 09:00:07'),
...          pd.Timestamp('20190101 09:00:08')
...      ],
...     dtype=np.float64
... )
>>> print(c.rolling('2s').apply(some_func))
2019-01-01T09:00:00.000     4.0
2019-01-01T09:00:01.000     9.0
2019-01-01T09:00:02.000    11.0
2019-01-01T09:00:04.000     7.0
2019-01-01T09:00:07.000     8.0
2019-01-01T09:00:08.000    17.0
dtype: float64
scatter_by_map ( self , map_index , map_size=None )

Scatter to a list of dataframes.

Uses map_index to determine the destination of each row of the original DataFrame.

Parameters
map_index Series, str or list-like

Scatter assignment for each row

map_size int

Length of output list. Must be >= uniques in map_index

Returns
A list of cudf.DataFrame objects.
select_dtypes ( self , include=None , exclude=None )

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters
include str or list

which columns to include based on dtypes

exclude str or list

which columns to exclude based on dtypes

set_index ( self , index , drop=True )

Return a new DataFrame with a new index

Parameters
index Index, Series-convertible, or str

Index : the new index. Series-convertible : values for the new index. str : name of column to be used as series

drop boolean

whether to drop corresponding column for str index argument

property shape

Returns a tuple representing the dimensionality of the DataFrame.

sort_index ( self , ascending=True )

Sort by the index

sort_values ( self , by , ascending=True , na_position='last' )

Sort by the values row-wise.

Parameters
by str or list of str

Name or list of names to sort by.

ascending bool or list of bool, default True

Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

na_position {‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end

Returns
——-
sorted_obj cuDF DataFrame

Notes

Difference from pandas:
  • Support axis=’index’ only.

  • Not supporting: inplace, kind

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> print(df.sort_values('b'))
   a  b
0  0 -3
2  2  0
1  1  2
tail ( self , n=5 )

Returns the last n rows as a new DataFrame

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> print(df.tail(2))
   key   val
3    3  13.0
4    4  14.0
to_arrow ( self , preserve_index=True )

Convert to a PyArrow Table.

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> df.to_arrow()
pyarrow.Table
None: int64
a: int64
b: int64
to_csv ( self , path=None , sep=' , ' , na_rep='' , columns=None , header=True , index=True , line_terminator='n' , chunksize=None )

Write a dataframe to csv file format.

Parameters
df DataFrame

DataFrame object to be written to csv

path str, default None

Path of file where DataFrame will be written

sep char, default ‘,’

Delimiter to be used.

na_rep str, default ‘’

String to use for null entries

columns list of str, optional

Columns to write

header bool, default True

Write out the column names

index bool, default True

Write out the index as a column

line_terminator char, default ‘n’
chunksize int or None, default None

Rows to write at a time

Notes

  • Follows the standard of Pandas csv.QUOTE_NONNUMERIC for all output.

  • If to_csv leads to memory errors consider setting the chunksize argument.

Examples

Write a dataframe to csv.

>>> import cudf
>>> filename = 'foo.csv'
>>> df = cudf.DataFrame({'x': [0, 1, 2, 3],
                         'y': [1.0, 3.3, 2.2, 4.4],
                         'z': ['a', 'b', 'c', 'd']})
>>> df = df.set_index([3, 2, 1, 0])
>>> df.to_csv(filename)
to_dlpack ( self )

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack .

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_obj DataFrame, Series, Index, or Column
Returns
pycapsule_obj PyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_feather ( self , path , *args , **kwargs )

Write a DataFrame to the feather format.

Parameters
path str

File path

to_gpu_matrix ( self )

Convert to a numba gpu ndarray

Returns
numba gpu ndarray
to_hdf ( self , path_or_buf , key , *args , **kwargs )

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide .

Parameters
path_or_buf str or pandas.HDFStore

File path or HDFStore object.

key str

Identifier for the group in the store.

mode {‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format {‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.

  • ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

append bool, default False

For Table formats, append the input data to the existing.

data_columns list of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns . Applicable only to format=’table’.

complevel {0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib {‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32 bool, default False

If applying compression use the fletcher32 checksum.

dropna bool, default False

If true, ALL nan rows will not be written to store.

errors str, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather.to_feather

Write out feather-format for DataFrames.

to_json ( self , path_or_buf=None , *args , **kwargs )

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_buf string or file handle, optional

File path or object. If not specified, the result is returned as a string.

orient string

Indication of expected JSON string format.

  • Series
    • default is ‘index’

    • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are: {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ‘records’ : list like [{column -> value}, … , {column -> value}]

    • ‘index’ : dict like {index -> {column -> value}}

    • ‘columns’ : dict like {column -> {index -> value}}

    • ‘values’ : just the values array

    • ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records' .

date_format {None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient . For orient='table' , the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precision int, default 10

The number of decimal places to use when encoding floating point values.

force_ascii bool, default True

Force encoded string to be ASCII.

date_unit string, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handler callable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

lines bool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

index bool, default True

Whether to include the index values in the JSON string. Not including the index ( index=False ) is only supported when orient is ‘split’ or ‘table’.

to_orc ( self , fname , compression=None , *args , **kwargs )

Write a DataFrame to the ORC format.

Parameters
fname str

File path or object where the ORC dataset will be stored.

compression {{ ‘snappy’, None }}, default None

Name of the compression to use. Use None for no compression.

to_pandas ( self )

Convert to a Pandas DataFrame.

Examples

>>> import cudf
>>> a = ('a', [0, 1, 2])
>>> b = ('b', [-3, 2, 0])
>>> df = cudf.DataFrame([a, b])
>>> type(df.to_pandas())
<class 'pandas.core.frame.DataFrame'>
to_parquet ( self , path , *args , **kwargs )

Write a DataFrame to the parquet format.

Parameters
path str

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

compression {‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

index bool, default None

If True , include the dataframe’s index(es) in the file output. If False , they will not be written to the file. If None , the engine’s default behavior will be used.

partition_cols list, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

to_records ( self , index=True )

Convert to a numpy recarray

Parameters
index bool

Whether to include the index in the output.

Returns
numpy recarray
to_string ( self )

Convert to string

cuDF uses Pandas internals for efficient string formatting. Set formatting options using pandas string formatting options and cuDF objects will print identically to Pandas objects.

cuDF supports null/None as a value in any column type, which is transparently supported during this output process.

Examples

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2]
>>> df['val'] = [float(i + 10) for i in range(3)]
>>> df.to_string()
'   key   val\n0    0  10.0\n1    1  11.0\n2    2  12.0'
transpose ( self )

Transpose index and columns.

Returns
a new (ncol x nrow) dataframe. self is (nrow x ncol)

Notes

Difference from pandas: Not supporting copy because default and only behaviour is copy=True

cudf.core.reshape. concat ( objs , axis=0 , ignore_index=False , sort=None )

Concatenate DataFrames, Series, or Indices row-wise.

Parameters
objs list of DataFrame, Series, or Index
axis {0/’index’, 1/’columns’}, default 0

The axis to concatenate along.

ignore_index bool, default False

Set True to ignore the index of the objs and provide a default range index instead.

Returns
A new object of like type with rows from each object in objs .
cudf.core.reshape. get_dummies ( df , prefix=None , prefix_sep='_' , dummy_na=False , columns=None , cats={} , sparse=False , drop_first=False , dtype='int8' )

Returns a dataframe whose columns are the one hot encodings of all columns in df

Parameters
df cudf.DataFrame

dataframe to encode

prefix str, dict, or sequence, optional

prefix to append. Either a str (to apply a constant prefix), dict mapping column names to prefixes, or sequence of prefixes to apply with the same length as the number of columns. If not supplied, defaults to the empty string

prefix_sep str, dict, or sequence, optional, default ‘_’

separator to use when appending prefixes

dummy_na boolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

cats dict, optional

dictionary mapping column names to sequences of integers representing that column’s category. See cudf.DataFrame.one_hot_encoding for more information. if not supplied, it will be computed

sparse boolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

drop_first boolean, optional

Right now this is NON-FUNCTIONAL argument in rapids.

columns sequence of str, optional

Names of columns to encode. If not provided, will attempt to encode all columns. Note this is different from pandas default behavior, which encodes all columns with dtype object or categorical

dtype str, optional

output dtype, default ‘int8’

cudf.core.reshape. melt ( frame , id_vars=None , value_vars=None , var_name=None , value_name='value' , col_level=None )

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters
frame DataFrame
id_vars tuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_vars tuple, list, or ndarray, optional

Column(s) to unpivot. default: all columns that are not set as id_vars .

var_name scalar

Name to use for the variable column. default: frame.columns.name or ‘variable’

value_name str

Name to use for the value column. default: ‘value’

Returns
out DataFrame

Melted result

Difference from pandas:
  • Does not support ‘col_level’ because cuDF does not have multi-index

Examples

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame({'A': {0: 1, 1: 1, 2: 5},
...                      'B': {0: 1, 1: 3, 2: 6},
...                      'C': {0: 1.0, 1: np.nan, 2: 4.0},
...                      'D': {0: 2.0, 1: 5.0, 2: 6.0}})
>>> cudf.melt(frame=df, id_vars=['A', 'B'], value_vars=['C', 'D'])
     A    B variable value
0    1    1        C   1.0
1    1    3        C
2    5    6        C   4.0
3    1    1        D   2.0
4    1    3        D   5.0
5    5    6        D   6.0

Series

class cudf.core.series. Series ( data=None , index=None , name=None , nan_as_null=True , dtype=None )

Data and null-masks.

Series objects are used as columns of DataFrame .

Attributes
cat
data

The gpu buffer for the data

dt
dtype

dtype of the Series

empty
has_null_mask

A boolean indicating whether a null-mask is needed

iloc

Select values by position.

index

The index object

is_monotonic
is_monotonic_decreasing
is_monotonic_increasing
is_unique
loc

Select values by label.

name

Returns name of the Series.

ndim

Dimension of the data.

null_count

Number of null values

nullmask

The gpu buffer for the null-mask

shape

Returns a tuple representing the dimensionality of the Series.

str
valid_count

Number of non-null values

values
values_host

Methods

abs (self)

Absolute value of each element of the series.

add (self, other[, fill_value])

Addition of series and other, element-wise (binary operator add).

append (self, other[, ignore_index])

Append values from another Series or array-like object.

applymap (self, udf[, out_dtype])

Apply a elemenwise function to transform the values in the Column.

argsort (self[, ascending, na_position])

Returns a Series of int64 index that will sort the series.

as_mask (self)

Convert booleans to bitmask

astype (self, dtype, \*\*kwargs)

Cast the Series to the given dtype

ceil (self)

Rounds each value upward to the smallest integral value not less than the original.

corr (self, other[, method, min_periods])

Calculates the sample correlation between two Series, excluding missing values.

count (self[, axis, skipna])

The number of non-null values

cov (self, other[, min_periods])

Calculates the sample covariance between two Series, excluding missing values.

cummax (self[, axis, skipna])

Compute the cumulative maximum of the series

cummin (self[, axis, skipna])

Compute the cumulative minimum of the series

cumprod (self[, axis, skipna])

Compute the cumulative product of the series

cumsum (self[, axis, skipna])

Compute the cumulative sum of the series

describe (self[, percentiles, include, exclude])

Compute summary statistics of a Series.

diff (self[, periods])

Calculate the difference between values at positions i and i - N in an array and store the output in a new array.

digitize (self, bins[, right])

Return the indices of the bins to which each value in series belongs.

drop_duplicates (self[, keep, inplace])

Return Series with duplicate values removed

dropna (self)

Return a Series with null values removed.

eq (self, other[, fill_value])

Equal to of series and other, element-wise (binary operator eq).

factorize (self[, na_sentinel])

Encode the input values as integer labels

fillna (self, value[, method, axis, inplace, …])

Fill null values with value .

find_first_value (self, value)

Returns offset of first value that matches

find_last_value (self, value)

Returns offset of last value that matches

floor (self)

Rounds each value downward to the largest integral value not greater than the original.

floordiv (self, other[, fill_value])

Integer division of series and other, element-wise (binary operator floordiv).

from_categorical (categorical[, codes])

Creates from a pandas.Categorical

from_masked_array (data, mask[, null_count])

Create a Series with null-mask.

ge (self, other[, fill_value])

Greater than or equal to of series and other, element-wise (binary operator ge).

gt (self, other[, fill_value])

Greater than of series and other, element-wise (binary operator gt).

hash_encode (self, stop[, use_name])

Encode column values as ints in [0, stop) using hash function.

hash_values (self)

Compute the hash of values in this column.

isna (self)

Identify missing values in a Series.

isnull (self)

Identify missing values in a Series.

kurtosis (self[, axis, skipna, level, …])

Calculates Fisher’s unbiased kurtosis of a sample.

label_encoding (self, cats[, dtype, na_sentinel])

Perform label encoding

le (self, other[, fill_value])

Less than or equal to of series and other, element-wise (binary operator le).

lt (self, other[, fill_value])

Less than of series and other, element-wise (binary operator lt).

max (self[, axis, skipna, dtype])

Compute the max of the series

mean (self[, axis, skipna])

Compute the mean of the series

min (self[, axis, skipna, dtype])

Compute the min of the series

mod (self, other[, fill_value])

Modulo of series and other, element-wise (binary operator mod).

mul (self, other[, fill_value])

Multiplication of series and other, element-wise (binary operator mul).

nans_to_nulls (self)

Convert nans (if any) to nulls

ne (self, other[, fill_value])

Not equal to of series and other, element-wise (binary operator ne).

nlargest (self[, n, keep])

Returns a new Series of the n largest element.

notna (self)

Identify non-missing values in a Series.

notnull (self)

Identify non-missing values in a Series.

nsmallest (self[, n, keep])

Returns a new Series of the n smallest element.

nunique (self[, method, dropna])

Returns the number of unique values of the Series: approximate version, and exact version to be moved to libgdf

one_hot_encoding (self, cats[, dtype])

Perform one-hot-encoding

pow (self, other[, fill_value])

Exponential power of series and other, element-wise (binary operator pow).

product (self[, axis, skipna, dtype])

Compute the product of the series

quantile (self[, q, interpolation, exact, …])

Return values at the given quantile.

radd (self, other[, fill_value])

Addition of series and other, element-wise (binary operator radd).

reindex (self[, index, copy])

Return a Series that conforms to a new index

rename (self[, index, copy])

Alter Series name.

replace (self, to_replace, replacement)

Replace values given in to_replace with replacement .

reset_index (self[, drop])

Reset index to RangeIndex

reverse (self)

Reverse the Series

rfloordiv (self, other[, fill_value])

Integer division of series and other, element-wise (binary operator rfloordiv).

rmod (self, other[, fill_value])

Modulo of series and other, element-wise (binary operator rmod).

rmul (self, other[, fill_value])

Multiplication of series and other, element-wise (binary operator rmul).

rolling (self, window[, min_periods, center, …])

Rolling window calculations.

round (self[, decimals])

Round a Series to a configurable number of decimal places.

rpow (self, other[, fill_value])

Exponential power of series and other, element-wise (binary operator rpow).

rsub (self, other[, fill_value])

Subtraction of series and other, element-wise (binary operator rsub).

rtruediv (self, other[, fill_value])

Floating division of series and other, element-wise (binary operator rtruediv).

scale (self)

Scale values to [0, 1] in float64

searchsorted (self, value[, side])

Find indices where elements should be inserted to maintain order

set_index (self, index)

Returns a new Series with a different index.

set_mask (self, mask[, null_count])

Create new Series by setting a mask array.

shift (self[, periods, freq, axis, fill_value])

Shift values of an input array by periods positions and store the output in a new array.

skew (self[, axis, skipna, level, numeric_only])

Calculates the unbiased Fisher-Pearson skew of a sample.

sort_index (self[, ascending])

Sort by the index.

sort_values (self[, ascending, na_position])

Sort by the values.

std (self[, ddof, axis, skipna])

Compute the standard deviation of the series

sub (self, other[, fill_value])

Subtraction of series and other, element-wise (binary operator sub).

sum (self[, axis, skipna, dtype])

Compute the sum of the series

tail (self[, n])

Returns the last n rows as a new Series

take (self, indices[, ignore_index])

Return Series by taking values from the corresponding indices .

to_array (self[, fillna])

Get a dense numpy array for the data.

to_dlpack (self)

Converts a cuDF object into a DLPack tensor.

to_frame (self[, name])

Convert Series into a DataFrame

to_gpu_array (self[, fillna])

Get a dense numba device array for the data.

to_hdf (self, path_or_buf, key, \*args, …)

Write the contained data to an HDF5 file using HDFStore.

to_json (self[, path_or_buf])

Convert the cuDF object to a JSON string.

to_string (self)

Convert to string

tolist (self)

Return a list type from series data.

truediv (self, other[, fill_value])

Floating division of series and other, element-wise (binary operator truediv).

unique (self[, method, sort])

Returns unique values of this Series.

value_counts (self[, sort])

Returns unique values of this Series.

values_to_string (self[, nrows])

Returns a list of string for each element.

var (self[, ddof, axis, skipna])

Compute the variance of the series

where (self, cond[, other, axis])

Replace values with other where the condition is False.

acos

all

any

as_index

asin

atan

copy

cos

deserialize

equals

exp

from_arrow

from_pandas

groupby

head

isin

log

logical_and

logical_not

logical_or

repeat

serialize

sin

sqrt

sum_of_squares

tan

to_arrow

to_pandas

unique_k

abs ( self )

Absolute value of each element of the series.

Returns a new Series.

add ( self , other , fill_value=None )

Addition of series and other, element-wise (binary operator add).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

all ( self , axis=0 , skipna=True , level=None )
any ( self , axis=0 , skipna=True , level=None )
append ( self , other , ignore_index=False )

Append values from another Series or array-like object. If ignore_index=True , the index is reset.

Parameters
other Series or array-like object
ignore_index boolean, default False. If true, the index is reset.
Returns
A new Series equivalent to self concatenated with other
applymap ( self , udf , out_dtype=None )

Apply a elemenwise function to transform the values in the Column.

The user function is expected to take one argument and return the result, which will be stored to the output Series. The function cannot reference globals except for other simple scalar objects.

Parameters
udf Either a callable python function or a python function already
decorated by ``numba.cuda.jit`` for call on the GPU as a device
out_dtype numpy.dtype; optional

The dtype for use in the output. Only used for numba.cuda.jit decorated udf. By default, the result will have the same dtype as the source.

Returns
result Series

The mask and index are preserved.

Notes

The supported Python features are listed in

with these exceptions:

  • Math functions in cmath are not supported since libcudf does not have complex number support and output of cmath functions are most likely complex numbers.

  • These five functions in math are not supported since numba generates multiple PTX functions from them

    • math.sin()

    • math.cos()

    • math.tan()

    • math.gamma()

    • math.lgamma()

argsort ( self , ascending=True , na_position='last' )

Returns a Series of int64 index that will sort the series.

Uses Thrust sort.

Returns
result: Series
as_mask ( self )

Convert booleans to bitmask

Returns
device array
astype ( self , dtype , **kwargs )

Cast the Series to the given dtype

Parameters
dtype data type
**kwargs extra arguments to pass on to the constructor
Returns
out Series

Copy of self cast to the given dtype. Returns self if dtype is the same as self.dtype .

ceil ( self )

Rounds each value upward to the smallest integral value not less than the original.

Returns a new Series.

corr ( self , other , method='pearson' , min_periods=None )

Calculates the sample correlation between two Series, excluding missing values.

count ( self , axis=None , skipna=True )

The number of non-null values

cov ( self , other , min_periods=None )

Calculates the sample covariance between two Series, excluding missing values.

cummax ( self , axis=0 , skipna=True )

Compute the cumulative maximum of the series

cummin ( self , axis=0 , skipna=True )

Compute the cumulative minimum of the series

cumprod ( self , axis=0 , skipna=True )

Compute the cumulative product of the series

cumsum ( self , axis=0 , skipna=True )

Compute the cumulative sum of the series

property data

The gpu buffer for the data

describe ( self , percentiles=None , include=None , exclude=None )

Compute summary statistics of a Series. For numeric data, the output includes the minimum, maximum, mean, median, standard deviation, and various quantiles. For object data, the output includes the count, number of unique values, the most common value, and the number of occurrences of the most common value.

Parameters
percentiles list-like, optional

The percentiles used to generate the output summary statistics. If None, the default percentiles used are the 25th, 50th and 75th. Values should be within the interval [0, 1].

Returns
A DataFrame containing summary statistics of relevant columns from
the input DataFrame.

Examples

Describing a Series containing numeric values.

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> print(s.describe())
   stats   values
0  count     10.0
1   mean      5.5
2    std  3.02765
3    min      1.0
4    25%      2.5
5    50%      5.5
6    75%      7.5
7    max     10.0
diff ( self , periods=1 )

Calculate the difference between values at positions i and i - N in an array and store the output in a new array.

Notes

Diff currently only supports float and integer dtype columns with no null values.

digitize ( self , bins , right=False )

Return the indices of the bins to which each value in series belongs.

Parameters
bins np.array

1-D monotonically, increasing array with same type as this series.

right bool

Indicates whether interval contains the right or left bin edge.

Returns
A new Series containing the indices.

Notes

Monotonicity of bins is assumed and not checked.

drop_duplicates ( self , keep='first' , inplace=False )

Return Series with duplicate values removed

dropna ( self )

Return a Series with null values removed.

property dtype

dtype of the Series

eq ( self , other , fill_value=None )

Equal to of series and other, element-wise (binary operator eq).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

factorize ( self , na_sentinel=-1 )

Encode the input values as integer labels

Parameters
na_sentinel number

Value to indicate missing category.

Returns
(labels, cats) (Series, Series)
  • labels contains the encoded values

  • cats contains the categories in order that the N-th item corresponds to the (N-1) code.

fillna ( self , value , method=None , axis=None , inplace=False , limit=None )

Fill null values with value .

Parameters
value scalar or Series-like

Value to use to fill nulls. If Series-like, null values are filled with the values in corresponding indices of the given Series.

Returns
result Series

Copy with nulls filled.

find_first_value ( self , value )

Returns offset of first value that matches

find_last_value ( self , value )

Returns offset of last value that matches

floor ( self )

Rounds each value downward to the largest integral value not greater than the original.

Returns a new Series.

floordiv ( self , other , fill_value=None )

Integer division of series and other, element-wise (binary operator floordiv).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

classmethod from_categorical ( categorical , codes=None )

Creates from a pandas.Categorical

If codes is defined, use it instead of categorical.codes

classmethod from_masked_array ( data , mask , null_count=None )

Create a Series with null-mask. This is equivalent to:

Series(data).set_mask(mask, null_count=null_count)

Parameters
data 1D array-like

The values. Null values must not be skipped. They can appear as garbage values.

mask 1D array-like of numpy.uint8

The null-mask. Valid values are marked as 1 ; otherwise 0 . The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_count int, optional

The number of null values. If None, it is calculated automatically.

ge ( self , other , fill_value=None )

Greater than or equal to of series and other, element-wise (binary operator ge).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

gt ( self , other , fill_value=None )

Greater than of series and other, element-wise (binary operator gt).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

property has_null_mask

A boolean indicating whether a null-mask is needed

hash_encode ( self , stop , use_name=False )

Encode column values as ints in [0, stop) using hash function.

Parameters
stop int

The upper bound on the encoding range.

use_name bool

If True then combine hashed column values with hashed column name. This is useful for when the same values in different columns should be encoded with different hashed values.

Returns
——-
result: Series

The encoded Series.

hash_values ( self )

Compute the hash of values in this column.

property iloc

Select values by position.

See DataFrame.iloc

property index

The index object

isna ( self )

Identify missing values in a Series. Alias for isnull.

isnull ( self )

Identify missing values in a Series.

kurtosis ( self , axis=None , skipna=None , level=None , numeric_only=None )

Calculates Fisher’s unbiased kurtosis of a sample.

label_encoding ( self , cats , dtype=None , na_sentinel=-1 )

Perform label encoding

Parameters
values sequence of input values
dtype: numpy.dtype; optional

Specifies the output dtype. If None is given, the smallest possible integer dtype (starting with np.int32) is used.

na_sentinel number

Value to indicate missing category.

Returns
——-
A sequence of encoded labels with value between 0 and n-1 classes(cats)
le ( self , other , fill_value=None )

Less than or equal to of series and other, element-wise (binary operator le).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

property loc

Select values by label.

See DataFrame.loc

lt ( self , other , fill_value=None )

Less than of series and other, element-wise (binary operator lt).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

max ( self , axis=None , skipna=True , dtype=None )

Compute the max of the series

mean ( self , axis=None , skipna=True )

Compute the mean of the series

min ( self , axis=None , skipna=True , dtype=None )

Compute the min of the series

mod ( self , other , fill_value=None )

Modulo of series and other, element-wise (binary operator mod).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

mul ( self , other , fill_value=None )

Multiplication of series and other, element-wise (binary operator mul).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

property name

Returns name of the Series.

nans_to_nulls ( self )

Convert nans (if any) to nulls

property ndim

Dimension of the data. Series ndim is always 1.

ne ( self , other , fill_value=None )

Not equal to of series and other, element-wise (binary operator ne).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

nlargest ( self , n=5 , keep='first' )

Returns a new Series of the n largest element.

notna ( self )

Identify non-missing values in a Series.

notnull ( self )

Identify non-missing values in a Series. Alias for notna.

nsmallest ( self , n=5 , keep='first' )

Returns a new Series of the n smallest element.

property null_count

Number of null values

property nullmask

The gpu buffer for the null-mask

nunique ( self , method='sort' , dropna=True )

Returns the number of unique values of the Series: approximate version, and exact version to be moved to libgdf

one_hot_encoding ( self , cats , dtype='float64' )

Perform one-hot-encoding

Parameters
cats sequence of values

values representing each category.

dtype numpy.dtype

specifies the output dtype.

Returns
A sequence of new series for each category. Its length is determined
by the length of cats .
pow ( self , other , fill_value=None )

Exponential power of series and other, element-wise (binary operator pow).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

product ( self , axis=None , skipna=True , dtype=None )

Compute the product of the series

quantile ( self , q=0.5 , interpolation='linear' , exact=True , quant_index=True )

Return values at the given quantile.

Parameters
q float or array-like, default 0.5 (50% quantile)

0 <= q <= 1, the quantile(s) to compute

interpolation {’linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

columns list of str

List of column names to include.

exact boolean

Whether to use approximate or exact quantile algorithm.

quant_index boolean

Whether to use the list of quantiles as index.

Returns
DataFrame
radd ( self , other , fill_value=None )

Addition of series and other, element-wise (binary operator radd).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

reindex ( self , index=None , copy=True )

Return a Series that conforms to a new index

Parameters
index Index, Series-convertible, default None
copy boolean, default True
Returns
A new Series that conforms to the supplied index
rename ( self , index=None , copy=True )

Alter Series name.

Change Series.name with a scalar value.

Parameters
index Scalar, optional

Scalar to alter the Series.name attribute

copy boolean, default True

Also copy underlying data

Returns
Series
Difference from pandas:
  • Supports scalar values only for changing name attribute

  • Not supporting: inplace, level

replace ( self , to_replace , replacement )

Replace values given in to_replace with replacement .

Parameters
to_replace numeric, str or list-like

Value(s) to replace.

  • numeric or str:
    • values equal to to_replace will be replaced with value

  • list of numeric or str:
    • If replacement is also list-like, to_replace and replacement must be of same length.

replacement numeric, str, list-like, or dict

Value(s) to replace to_replace with.

Returns
result Series

Series after replacement. The mask and index are preserved.

reset_index ( self , drop=False )

Reset index to RangeIndex

reverse ( self )

Reverse the Series

rfloordiv ( self , other , fill_value=None )

Integer division of series and other, element-wise (binary operator rfloordiv).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rmod ( self , other , fill_value=None )

Modulo of series and other, element-wise (binary operator rmod).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rmul ( self , other , fill_value=None )

Multiplication of series and other, element-wise (binary operator rmul).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rolling ( self , window , min_periods=None , center=False , axis=0 , win_type=None )

Rolling window calculations.

Parameters
window int or offset

Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset.

min_periods int, optional

The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or None , min_periods is equal to the window size.

center bool, optional

If True , the result is set at the center of the window. If False (default), the result is set at the right edge of the window.

Returns
Rolling object.

Examples

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4])

Rolling sum with window size 2.

>>> print(a.rolling(2).sum())
0
1    3
2    5
3
4
dtype: int64

Rolling sum with window size 2 and min_periods 1.

>>> print(a.rolling(2, min_periods=1).sum())
0    1
1    3
2    5
3    3
4    4
dtype: int64

Rolling count with window size 3.

>>> print(a.rolling(3).count())
0    1
1    2
2    3
3    2
4    2
dtype: int64

Rolling count with window size 3, but with the result set at the center of the window.

>>> print(a.rolling(3, center=True).count())
0    2
1    3
2    2
3    2
4    1 dtype: int64

Rolling max with variable window size specified by an offset; only valid for datetime index.

>>> a = cudf.Series(
...     [1, 9, 5, 4, np.nan, 1],
...     index=[
...         pd.Timestamp('20190101 09:00:00'),
...         pd.Timestamp('20190101 09:00:01'),
...         pd.Timestamp('20190101 09:00:02'),
...         pd.Timestamp('20190101 09:00:04'),
...         pd.Timestamp('20190101 09:00:07'),
...         pd.Timestamp('20190101 09:00:08')
...     ]
... )
>>> print(a.rolling('2s').max())
2019-01-01T09:00:00.000    1
2019-01-01T09:00:01.000    9
2019-01-01T09:00:02.000    9
2019-01-01T09:00:04.000    4
2019-01-01T09:00:07.000
2019-01-01T09:00:08.000    1
dtype: int64

Apply custom function on the window with the apply method

>>> import numpy as np
>>> import math
>>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64)
>>> def some_func(A):
...     b = 0
...     for a in A:
...         b = b + math.sqrt(a)
...     return b
...
>>> print(b.rolling(3, min_periods=1).apply(some_func))
0     4.0
1     9.0
2    15.0
3    18.0
4    21.0
5    24.0
dtype: float64

And this also works for window rolling set by an offset

>>> import pandas as pd
>>> c = cudf.Series(
...     [16, 25, 36, 49, 64, 81],
...     index=[
...          pd.Timestamp('20190101 09:00:00'),
...          pd.Timestamp('20190101 09:00:01'),
...          pd.Timestamp('20190101 09:00:02'),
...          pd.Timestamp('20190101 09:00:04'),
...          pd.Timestamp('20190101 09:00:07'),
...          pd.Timestamp('20190101 09:00:08')
...      ],
...     dtype=np.float64
... )
>>> print(c.rolling('2s').apply(some_func))
2019-01-01T09:00:00.000     4.0
2019-01-01T09:00:01.000     9.0
2019-01-01T09:00:02.000    11.0
2019-01-01T09:00:04.000     7.0
2019-01-01T09:00:07.000     8.0
2019-01-01T09:00:08.000    17.0
dtype: float64
round ( self , decimals=0 )

Round a Series to a configurable number of decimal places.

rpow ( self , other , fill_value=None )

Exponential power of series and other, element-wise (binary operator rpow).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rsub ( self , other , fill_value=None )

Subtraction of series and other, element-wise (binary operator rsub).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

rtruediv ( self , other , fill_value=None )

Floating division of series and other, element-wise (binary operator rtruediv).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

scale ( self )

Scale values to [0, 1] in float64

searchsorted ( self , value , side='left' )

Find indices where elements should be inserted to maintain order

Parameters
value array_like

Column of values to search for

side str {‘left’, ‘right’} optional

If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index

Returns
A Column of insertion points with the same shape as value
set_index ( self , index )

Returns a new Series with a different index.

Parameters
index Index, Series-convertible

the new index or values for the new index

set_mask ( self , mask , null_count=None )

Create new Series by setting a mask array.

This will override the existing mask. The returned Series will reference the same data buffer as this Series.

Parameters
mask 1D array-like of numpy.uint8

The null-mask. Valid values are marked as 1 ; otherwise 0 . The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_count int, optional

The number of null values. If None, it is calculated automatically.

property shape

Returns a tuple representing the dimensionality of the Series.

shift ( self , periods=1 , freq=None , axis=0 , fill_value=None )

Shift values of an input array by periods positions and store the output in a new array.

Notes

Shift currently only supports float and integer dtype columns with no null values.

skew ( self , axis=None , skipna=None , level=None , numeric_only=None )

Calculates the unbiased Fisher-Pearson skew of a sample.

sort_index ( self , ascending=True )

Sort by the index.

sort_values ( self , ascending=True , na_position='last' )

Sort by the values.

Sort a Series in ascending or descending order by some criterion.

Parameters
ascending bool, default True

If True, sort values in ascending order, otherwise descending.

na_position {‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end.

Returns
——-
sorted_obj cuDF Series
Difference from pandas:
  • Not supporting: inplace, kind

Examples

>>> import cudf
>>> s = cudf.Series([1, 5, 2, 4, 3])
>>> s.sort_values()
0    1
2    2
4    3
3    4
1    5
std ( self , ddof=1 , axis=None , skipna=True )

Compute the standard deviation of the series

sub ( self , other , fill_value=None )

Subtraction of series and other, element-wise (binary operator sub).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

sum ( self , axis=None , skipna=True , dtype=None )

Compute the sum of the series

tail ( self , n=5 )

Returns the last n rows as a new Series

Examples

>>> import cudf
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> print(ser.tail(2))
3    1
4    0
take ( self , indices , ignore_index=False )

Return Series by taking values from the corresponding indices .

to_array ( self , fillna=None )

Get a dense numpy array for the data.

Parameters
fillna str or None

Defaults to None, which will skip null values. If it equals “pandas”, null values are filled with NaNs. Non integral dtype is promoted to np.float64.

Notes

if fillna is None , null values are skipped. Therefore, the output size could be smaller.

to_dlpack ( self )

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack .

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_obj DataFrame, Series, Index, or Column
Returns
pycapsule_obj PyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_frame ( self , name=None )

Convert Series into a DataFrame

Parameters
name str, default None

Name to be used for the column

Returns
DataFrame

cudf DataFrame

to_gpu_array ( self , fillna=None )

Get a dense numba device array for the data.

Parameters
fillna str or None

See fillna in .to_array .

Notes

if fillna is None , null values are skipped. Therefore, the output size could be smaller.

to_hdf ( self , path_or_buf , key , *args , **kwargs )

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide .

Parameters
path_or_buf str or pandas.HDFStore

File path or HDFStore object.

key str

Identifier for the group in the store.

mode {‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format {‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.

  • ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

append bool, default False

For Table formats, append the input data to the existing.

data_columns list of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns . Applicable only to format=’table’.

complevel {0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib {‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32 bool, default False

If applying compression use the fletcher32 checksum.

dropna bool, default False

If true, ALL nan rows will not be written to store.

errors str, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather.to_feather

Write out feather-format for DataFrames.

to_json ( self , path_or_buf=None , *args , **kwargs )

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_buf string or file handle, optional

File path or object. If not specified, the result is returned as a string.

orient string

Indication of expected JSON string format.

  • Series
    • default is ‘index’

    • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are: {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ‘records’ : list like [{column -> value}, … , {column -> value}]

    • ‘index’ : dict like {index -> {column -> value}}

    • ‘columns’ : dict like {column -> {index -> value}}

    • ‘values’ : just the values array

    • ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records' .

date_format {None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient . For orient='table' , the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precision int, default 10

The number of decimal places to use when encoding floating point values.

force_ascii bool, default True

Force encoded string to be ASCII.

date_unit string, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handler callable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

lines bool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

index bool, default True

Whether to include the index values in the JSON string. Not including the index ( index=False ) is only supported when orient is ‘split’ or ‘table’.

to_string ( self )

Convert to string

Uses Pandas formatting internals to produce output identical to Pandas. Use the Pandas formatting settings directly in Pandas to control cuDF output.

tolist ( self )

Return a list type from series data.

Returns
list
truediv ( self , other , fill_value=None )

Floating division of series and other, element-wise (binary operator truediv).

Parameters
other: Series or scalar value
fill_value None or value

Value to fill nulls with before computation. If data in both corresponding Series locations is null the result will be null

unique ( self , method='sort' , sort=True )

Returns unique values of this Series. default=’sort’ will be changed to ‘hash’ when implemented.

property valid_count

Number of non-null values

value_counts ( self , sort=True )

Returns unique values of this Series.

values_to_string ( self , nrows=None )

Returns a list of string for each element.

var ( self , ddof=1 , axis=None , skipna=True )

Compute the variance of the series

where ( self , cond , other=None , axis=None )

Replace values with other where the condition is False.

Parameters
cond boolean

Where cond is True, keep the original value. Where False, replace with corresponding value from other.

other: scalar, default None

Entries where cond is False are replaced with corresponding value from other.

Returns
result Series

Examples

>>> import cudf
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> print(ser.where(ser > 2, 10))
0     4
1     3
2    10
3    10
4    10
>>> print(ser.where(ser > 2))
0    4
1    3
2
3
4

Groupby

DataFrameGroupBy. agg ( self , func )
DataFrameGroupBy. count ( self )
DataFrameGroupBy. max ( self )
DataFrameGroupBy. mean ( self )
DataFrameGroupBy. min ( self )
DataFrameGroupBy. quantile ( self , q=0.5 , interpolation='linear' )
DataFrameGroupBy. size ( self )
DataFrameGroupBy. sum ( self )

Legacy Groupby

class cudf.core.groupby.legacy_groupby. Groupby ( df , by )

Groupby object returned by cudf.DataFrame.groupby(method=”cudf”). method=cudf uses numba kernels to compute aggregations and allows custom UDFs via the apply and apply_grouped methods.

Notes

  • method=cudf may be deprecated in the future.

  • Grouping and aggregating over columns with null values will return incorrect results.

  • Grouping by or aggregating over string columns is currently not supported.

Methods

agg (self, args)

Invoke aggregation functions on the groups.

apply (self, function)

Apply a python transformation function over the grouped chunk.

apply_grouped (self, function, \*\*kwargs)

Apply a transformation function over the grouped chunk.

as_df (self)

Get the intermediate dataframe after shuffling the rows into groups.

count (self)

Compute the count of each group

max (self)

Compute the max of each group

mean (self)

Compute the mean of each group

min (self)

Compute the min of each group

std (self)

Compute the std of each group

sum (self)

Compute the sum of each group

sum_of_squares (self)

Compute the sum_of_squares of each group

var (self)

Compute the var of each group

agg ( self , args )

Invoke aggregation functions on the groups.

Parameters
args: dict, list, str, callable
  • str

    The aggregate function name.

  • callable

    The aggregate function.

  • list

    List of str or callable of the aggregate function.

  • dict

    key-value pairs of source column name and list of aggregate functions as str or callable .

Returns
result DataFrame
apply ( self , function )

Apply a python transformation function over the grouped chunk.

Parameters
func function

The python transformation function that will be applied on the grouped chunk.

Examples

from cudf import DataFrame
df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

# Define a function to apply to each row in a group
def mult(df):
  df['out'] = df['key'] * df['val']
  return df

result = groups.apply(mult)
print(result)

Output:

   key  val  out
0    0    0    0
1    0    1    0
2    1    2    2
3    1    3    3
4    2    4    8
5    2    5   10
6    2    6   12
apply_grouped ( self , function , **kwargs )

Apply a transformation function over the grouped chunk.

This uses numba’s CUDA JIT compiler to convert the Python transformation function into a CUDA kernel, thus will have a compilation overhead during the first run.

Parameters
func function

The transformation function that will be executed on the CUDA GPU.

incols: list

A list of names of input columns.

outcols: list

A dictionary of output column names and their dtype.

kwargs dict

name-value of extra arguments. These values are passed directly into the function.

Examples

from cudf import DataFrame
from numba import cuda
import numpy as np

df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

# Define a function to apply to each group
def mult_add(key, val, out1, out2):
    for i in range(cuda.threadIdx.x, len(key), cuda.blockDim.x):
        out1[i] = key[i] * val[i]
        out2[i] = key[i] + val[i]

result = groups.apply_grouped(mult_add,
                              incols=['key', 'val'],
                              outcols={'out1': np.int32,
                                       'out2': np.int32},
                              # threads per block
                              tpb=8)

print(result)

Output:

   key  val out1 out2
0    0    0    0    0
1    0    1    0    1
2    1    2    2    3
3    1    3    3    4
4    2    4    8    6
5    2    5   10    7
6    2    6   12    8
import cudf
import numpy as np
from numba import cuda
import pandas as pd
from random import randint

# Create a random 15 row dataframe with one categorical
# feature and one random integer valued feature
df = cudf.DataFrame(
        {
            "cat": [1] * 5 + [2] * 5 + [3] * 5,
            "val": [randint(0, 100) for _ in range(15)],
        }
     )

# Group the dataframe by its categorical feature
groups = df.groupby("cat", method="cudf")

# Define a kernel which takes the moving average of a
# sliding window
def rolling_avg(val, avg):
    win_size = 3
    for row, i in enumerate(range(cuda.threadIdx.x,
                                  len(val), cuda.blockDim.x)):
        if row < win_size - 1:
            # If there is not enough data to fill the window,
            # take the average to be NaN
            avg[i] = np.nan
        else:
            total = 0
            for j in range(i - win_size + 1, i + 1):
                total += val[j]
            avg[i] = total / win_size

# Compute moving avgs on all groups
results = groups.apply_grouped(rolling_avg,
                               incols=['val'],
                               outcols=dict(avg=np.float64))
print("Results:", results)

# Note this gives the same result as its pandas equivalent
pdf = df.to_pandas()
pd_results = pdf.groupby('cat')['val'].rolling(3).mean()

Output:

Results:
     cat  val                 avg
0    1   16
1    1   45
2    1   62                41.0
3    1   45  50.666666666666664
4    1   26  44.333333333333336
5    2    5
6    2   51
7    2   77  44.333333333333336
8    2    1                43.0
9    2   46  41.333333333333336
[5 more rows]

This is functionally equivalent to pandas.DataFrame.Rolling

as_df ( self )

Get the intermediate dataframe after shuffling the rows into groups.

Returns
(df, segs) namedtuple
  • df : DataFrame

  • segs Series

    Beginning offsets of each group.

Examples

from cudf import DataFrame

df = DataFrame()
df['key'] = [0, 0, 1, 1, 2, 2, 2]
df['val'] = [0, 1, 2, 3, 4, 5, 6]
groups = df.groupby(['key'], method='cudf')

df_groups = groups.as_df()

# DataFrame indexes of group starts
print(df_groups[1])

# DataFrame itself
print(df_groups[0])

Output:

# DataFrame indexes of group starts
0    0
1    2
2    4

# DataFrame itself
   key  val
0    0    0
1    0    1
2    1    2
3    1    3
4    2    4
5    2    5
6    2    6
count ( self )

Compute the count of each group

Returns
result DataFrame
max ( self )

Compute the max of each group

Returns
result DataFrame
mean ( self )

Compute the mean of each group

Returns
result DataFrame
min ( self )

Compute the min of each group

Returns
result DataFrame
std ( self )

Compute the std of each group

Returns
result DataFrame
sum ( self )

Compute the sum of each group

Returns
result DataFrame
sum_of_squares ( self )

Compute the sum_of_squares of each group

Returns
result DataFrame
var ( self )

Compute the var of each group

Returns
result DataFrame

IO

cudf.io.csv. read_csv ( filepath_or_buffer , lineterminator='n' , quotechar='"' , quoting=0 , doublequote=True , header='infer' , mangle_dupe_cols=True , usecols=None , sep=' , ' , delimiter=None , delim_whitespace=False , skipinitialspace=False , names=None , dtype=None , skipfooter=0 , skiprows=0 , dayfirst=False , compression='infer' , thousands=None , decimal='.' , true_values=None , false_values=None , nrows=None , byte_range=None , skip_blank_lines=True , parse_dates=None , comment=None , na_values=None , keep_default_na=True , na_filter=True , prefix=None , index_col=None , **kwargs )

Load a comma-seperated-values (CSV) dataset into a DataFrame

Parameters
filepath_or_buffer str, path object, or file-like object

Either a path to a file (a str , pathlib.Path , or py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), or any object with a read() method (such as builtin open() file handler function or StringIO ).

sep char, default ‘,’

Delimiter to be used.

delimiter char, default None

Alternative argument name for sep.

delim_whitespace bool, default False

Determines whether to use whitespace as delimiter.

lineterminator char, default ‘n’

Character to indicate end of line.

skipinitialspace bool, default False

Skip spaces after delimiter.

names list of str, default None

List of column names to be used.

dtype type, list of types, or dict of column -> type, default None

Data type(s) for data or columns. If list, types are applied in the same order as the column names. If dict, types are mapped to the column names. E.g. {‘a’: np.float64, ‘b’: int32, ‘c’: ‘float’} If None , dtypes are inferred from the dataset. Use str to preserve data and not infer or interpret to dtype.

quotechar char, default ‘”’

Character to indicate start and end of quote item.

quoting str or int, default 0

Controls quoting behavior. Set to one of 0 (csv.QUOTE_MINIMAL), 1 (csv.QUOTE_ALL), 2 (csv.QUOTE_NONNUMERIC) or 3 (csv.QUOTE_NONE). Quoting is enabled with all values except 3.

doublequote bool, default True

When quoting is enabled, indicates whether to interpret two consecutive quotechar inside fields as single quotechar

header int, default ‘infer’

Row number to use as the column names. Default behavior is to infer the column names: if no names are passed, header=0; if column names are passed explicitly, header=None.

usecols list of int or str, default None

Returns subset of the columns given in the list. All elements must be either integer indices (column number) or strings that correspond to column names

mangle_dupe_cols boolean, default True

Duplicate columns will be specified as ‘X’,’X.1’,…’X.N’.

skiprows int, default 0

Number of rows to be skipped from the start of file.

skipfooter int, default 0

Number of rows to be skipped at the bottom of file.

compression {‘infer’, ‘gzip’, ‘zip’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then detect compression from the following extensions: ‘.gz’,‘.zip’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in, otherwise the first non-zero-sized file will be used. Set to None for no decompression.

decimal char, default ‘.’

Character used as a decimal point.

thousands char, default None

Character used as a thousands delimiter.

true_values list, default None

Values to consider as boolean True

false_values list, default None

Values to consider as boolean False

nrows int, default None

If specified, maximum number of rows to read

byte_range list or tuple, default None

Byte range within the input file to be read. The first number is the offset in bytes, the second number is the range size in bytes. Set the size to zero to read all data after the offset location. Reads the row that starts before or at the end of the range, even if it ends after the end of the range.

skip_blank_lines bool, default True

If True, discard and do not parse empty lines If False, interpret empty lines as NaN values

parse_dates list of int or names, default None

If list of columns, then attempt to parse each entry as a date. Columns may not always be recognized as dates, for instance due to unusual or non-standard formats. To guarantee a date and increase parsing speed, explicitly specify dtype=’date’ for the desired columns.

comment char, default None

Character used as a comments indicator. If found at the beginning of a line, the line will be ignored altogether.

na_values list, default None

Values to consider as invalid

keep_default_na bool, default True

Whether or not to include the default NA values when parsing the data.

na_filter bool, default True

Detect missing values (empty strings and the values in na_values). Passing False can improve performance.

prefix str, default None

Prefix to add to column numbers when parsing without a header row

index_col int, string or False, default None

Column to use as the row labels of the DataFrame. Passing index_col=False explicitly disables index column inference and discards the last column.

Returns
GPU DataFrame object.

Notes

  • cuDF supports local and remote data stores. See configuration details for available sources here .

Examples

Create a test csv file

>>> import cudf
>>> filename = 'foo.csv'
>>> lines = [
...   "num1,datetime,text",
...   "123,2018-11-13T12:00:00,abc",
...   "456,2018-11-14T12:35:01,def",
...   "789,2018-11-15T18:02:59,ghi"
... ]
>>> with open(filename, 'w') as fp:
...     fp.write('\n'.join(lines)+'\n')

Read the file with cudf.read_csv

>>> cudf.read_csv(filename)
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.csv. to_csv ( df , path=None , sep=' , ' , na_rep='' , columns=None , header=True , index=True , line_terminator='n' , chunksize=None )

Write a dataframe to csv file format.

Parameters
df DataFrame

DataFrame object to be written to csv

path str, default None

Path of file where DataFrame will be written

sep char, default ‘,’

Delimiter to be used.

na_rep str, default ‘’

String to use for null entries

columns list of str, optional

Columns to write

header bool, default True

Write out the column names

index bool, default True

Write out the index as a column

line_terminator char, default ‘n’
chunksize int or None, default None

Rows to write at a time

Notes

  • Follows the standard of Pandas csv.QUOTE_NONNUMERIC for all output.

  • If to_csv leads to memory errors consider setting the chunksize argument.

Examples

Write a dataframe to csv.

>>> import cudf
>>> filename = 'foo.csv'
>>> df = cudf.DataFrame({'x': [0, 1, 2, 3],
                         'y': [1.0, 3.3, 2.2, 4.4],
                         'z': ['a', 'b', 'c', 'd']})
>>> df = df.set_index([3, 2, 1, 0])
>>> df.to_csv(filename)
cudf.io.parquet. read_parquet ( filepath_or_buffer , engine='cudf' , columns=None , row_group=None , skip_rows=None , num_rows=None , strings_to_categorical=False , use_pandas_metadata=True , *args , **kwargs )

Load a Parquet dataset into a DataFrame

Parameters
filepath_or_buffer str, path object, bytes, or file-like object

Either a path to a file (a str , pathlib.Path , or py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO ).

engine { ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columns list, default None

If not None, only these columns will be read.

row_group int, default None

If not None, only the row group with the specified index will be read.

skip_rows int, default None

If not None, the nunber of rows to skip from the start of the file.

num_rows int, default None

If not None, the total number of rows to read.

strings_to_categorical boolean, default False

If True, return string columns as GDF_CATEGORY dtype; if False, return a as GDF_STRING dtype.

use_pandas_metadata boolean, default True

If True and dataset has custom PANDAS schema metadata, ensure that index columns are also loaded.

Returns
DataFrame

Notes

  • cuDF supports local and remote data stores. See configuration details for available sources here .

Examples

>>> import cudf
>>> df = cudf.read_parquet(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet. read_parquet_metadata ( path )

Read a Parquet file’s metadata and schema

Parameters
path string or path object

Path of file to be read

Returns
Total number of rows
Number of row groups
List of column names

Examples

>>> import cudf
>>> num_rows, num_row_groups, names = cudf.io.read_parquet_metadata(filename)
>>> df = [cudf.read_parquet(fname, row_group=i) for i in range(row_groups)]
>>> df = cudf.concat(df)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.parquet. to_parquet ( df , path , *args , **kwargs )

Write a DataFrame to the parquet format.

Parameters
path str

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

compression {‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

index bool, default None

If True , include the dataframe’s index(es) in the file output. If False , they will not be written to the file. If None , the engine’s default behavior will be used.

partition_cols list, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

cudf.io.orc. read_orc ( filepath_or_buffer , engine='cudf' , columns=None , stripe=None , skip_rows=None , num_rows=None , use_index=True , **kwargs )

Load an ORC dataset into a DataFrame

Parameters
filepath_or_buffer str, path object, bytes, or file-like object

Either a path to a file (a str , pathlib.Path , or py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO ).

engine { ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columns list, default None

If not None, only these columns will be read from the file.

stripe: int, default None

If not None, only the stripe with the specified index will be read.

skip_rows int, default None

If not None, the number of rows to skip from the start of the file.

num_rows int, default None

If not None, the total number of rows to read.

use_index bool, default True

If True, use row index if available for faster seeking.

kwargs are passed to the engine
Returns
DataFrame

Notes

  • cuDF supports local and remote data stores. See configuration details for available sources here .

Examples

>>> import cudf
>>> df = cudf.read_orc(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.orc. read_orc_metadata ( path )

Read an ORC file’s metadata and schema

Parameters
path string or path object

Path of file to be read

Returns
Total number of rows
Number of stripes
List of column names

Examples

>>> import cudf
>>> num_rows, stripes, names = cudf.io.read_orc_metadata(filename)
>>> df = [cudf.read_orc(fname, stripe=i) for i in range(stripes)]
>>> df = cudf.concat(df)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.orc. to_orc ( df , fname , compression=None , *args , **kwargs )

Write a DataFrame to the ORC format.

Parameters
fname str

File path or object where the ORC dataset will be stored.

compression {{ ‘snappy’, None }}, default None

Name of the compression to use. Use None for no compression.

cudf.io.json. read_json ( path_or_buf , engine='auto' , dtype=True , lines=False , compression='infer' , byte_range=None , *args , **kwargs )

Load a JSON dataset into a DataFrame

Parameters
path_or_buf str, path object, or file-like object

Either JSON data in a str , path to a file (a str , pathlib.Path , or py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), or any object with a read() method (such as builtin open() file handler function or StringIO ).

engine {{ ‘auto’, ‘cudf’, ‘pandas’ }}, default ‘auto’

Parser engine to use. If ‘auto’ is passed, the engine will be automatically selected based on the other parameters.

orient string,

Indication of expected JSON string format (pandas engine only). Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is:

  • 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

  • 'records' : list like [{column -> value}, ... , {column -> value}]

  • 'index' : dict like {index -> {column -> value}}

  • 'columns' : dict like {column -> {index -> value}}

  • 'values' : just the values array

The allowed and default values depend on the value of the typ parameter.

  • when typ == 'series' ,

    • allowed orients are {'split','records','index'}

    • default is 'index'

    • The Series index must be unique for orient 'index' .

  • when typ == 'frame' ,

    • allowed orients are {'split','records','index', 'columns','values', 'table'}

    • default is 'columns'

    • The DataFrame index must be unique for orients 'index' and 'columns' .

    • The DataFrame columns must be unique for orients 'index' , 'columns' , and 'records' .

typ type of object to recover (series or frame), default ‘frame’

With cudf engine, only frame output is supported.

dtype boolean or dict, default True

If True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, applies only to the data.

convert_axes boolean, default True

Try to convert the axes to the proper dtypes (pandas engine only).

convert_dates boolean, default True

List of columns to parse for dates (pandas engine only); If True, then try to parse datelike columns default is True; a column label is datelike if

  • it ends with '_at' ,

  • it ends with '_time' ,

  • it begins with 'timestamp' ,

  • it is 'modified' , or

  • it is 'date'

keep_default_dates boolean, default True

If parsing dates, parse the default datelike columns (pandas engine only)

numpy boolean, default False

Direct decoding to numpy arrays (pandas engine only). Supports numeric data only, but non-numeric column and index labels are supported. Note also that the JSON ordering MUST be the same for each term if numpy=True.

precise_float boolean, default False

Set to enable usage of higher precision (strtod) function when decoding string to double values (pandas engine only). Default (False) is to use fast but less precise builtin functionality

date_unit string, default None

The timestamp unit to detect if converting dates (pandas engine only). The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds.

encoding str, default is ‘utf-8’

The encoding to use to decode py3 bytes. With cudf engine, only utf-8 is supported.

lines boolean, default False

Read the file as a json object per line.

chunksize integer, default None

Return JsonReader object for iteration (pandas engine only). See the line-delimted json docs for more information on chunksize . This can only be passed if lines=True . If this is None, the file will be read into memory all at once.

compression {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip or xz if path_or_buf is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

byte_range list or tuple, default None

Byte range within the input file to be read (cudf engine only). The first number is the offset in bytes, the second number is the range size in bytes. Set the size to zero to read all data after the offset location. Reads the row that starts before or at the end of the range, even if it ends after the end of the range.

Returns
result Series or DataFrame, depending on the value of typ .
cudf.io.json. to_json ( cudf_val , path_or_buf=None , *args , **kwargs )

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters
path_or_buf string or file handle, optional

File path or object. If not specified, the result is returned as a string.

orient string

Indication of expected JSON string format.

  • Series
    • default is ‘index’

    • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are: {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ‘records’ : list like [{column -> value}, … , {column -> value}]

    • ‘index’ : dict like {index -> {column -> value}}

    • ‘columns’ : dict like {column -> {index -> value}}

    • ‘values’ : just the values array

    • ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records' .

date_format {None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient . For orient='table' , the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precision int, default 10

The number of decimal places to use when encoding floating point values.

force_ascii bool, default True

Force encoded string to be ASCII.

date_unit string, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handler callable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.

lines bool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

index bool, default True

Whether to include the index values in the JSON string. Not including the index ( index=False ) is only supported when orient is ‘split’ or ‘table’.

cudf.io.avro. read_avro ( filepath_or_buffer , engine='cudf' , columns=None , skip_rows=None , num_rows=None , **kwargs )

Load an Avro dataset into a DataFrame

Parameters
filepath_or_buffer str, path object, bytes, or file-like object

Either a path to a file (a str , pathlib.Path , or py._path.local.LocalPath ), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO ).

engine { ‘cudf’, ‘fastavro’ }, default ‘cudf’

Parser engine to use.

columns list, default None

If not None, only these columns will be read.

skip_rows int, default None

If not None, the nunber of rows to skip from the start of the file.

num_rows int, default None

If not None, the total number of rows to read.

Returns
DataFrame

Notes

  • cuDF supports local and remote data stores. See configuration details for available sources here .

Examples

>>> import cudf
>>> df = cudf.read_avro(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.dlpack. from_dlpack ( pycapsule_obj )

Converts from a DLPack tensor to a cuDF object.

DLPack is an open-source memory tensor structure: dmlc/dlpack .

This function takes a PyCapsule object which contains a pointer to a DLPack tensor as input, and returns a cuDF object. This function deep copies the data in the DLPack tensor into a cuDF object.

Parameters
pycapsule_obj PyCapsule

Input DLPack tensor pointer which is encapsulated in a PyCapsule object.

Returns
A cuDF DataFrame or Series depending on if the input DLPack tensor is 1D
or 2D.
cudf.io.dlpack. to_dlpack ( cudf_obj )

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack .

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters
cudf_obj DataFrame, Series, Index, or Column
Returns
pycapsule_obj PyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

cudf.io.feather. read_feather ( path , *args , **kwargs )

Load an feather object from the file path, returning a DataFrame.

Parameters
path string

File path

columns list, default=None

If not None, only these columns will be read from the file.

Returns
DataFrame

Examples

>>> import cudf
>>> df = cudf.read_feather(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117
cudf.io.feather. to_feather ( df , path , *args , **kwargs )

Write a DataFrame to the feather format.

Parameters
path str

File path

cudf.io.hdf. read_hdf ( path_or_buf , *args , **kwargs )

Read from the store, close it if we opened it.

Retrieve pandas object stored in file, optionally based on where criteria

Parameters
path_or_buf string, buffer or path object

Path to the file to open, or an open HDFStore . object. Supports any object implementing the __fspath__ protocol. This includes pathlib.Path and py._path.local.LocalPath objects.

key object, optional

The group identifier in the store. Can be omitted if the HDF file contains a single pandas object.

mode {‘r’, ‘r+’, ‘a’}, optional

Mode to use when opening the file. Ignored if path_or_buf is a Pandas HDFS . Default is ‘r’.

where list, optional

A list of Term (or convertible) objects.

start int, optional

Row number to start selection.

stop int, optional

Row number to stop selection.

columns list, optional

A list of columns names to return.

iterator bool, optional

Return an iterator object.

chunksize int, optional

Number of rows to include in an iteration when using an iterator.

errors str, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

**kwargs

Additional keyword arguments passed to HDFStore.

Returns
item object

The selected object. Return type depends on the object stored.

See also

cudf.io.hdf.to_hdf

Write a HDF file from a DataFrame.

cudf.io.hdf. to_hdf ( path_or_buf , key , value , *args , **kwargs )

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide .

Parameters
path_or_buf str or pandas.HDFStore

File path or HDFStore object.

key str

Identifier for the group in the store.

mode {‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format {‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.

  • ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

append bool, default False

For Table formats, append the input data to the existing.

data_columns list of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns . Applicable only to format=’table’.

complevel {0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib {‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32 bool, default False

If applying compression use the fletcher32 checksum.

dropna bool, default False

If true, ALL nan rows will not be written to store.

errors str, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See also

cudf.io.hdf.read_hdf

Read from HDF file.

cudf.io.parquet.to_parquet

Write a DataFrame to the binary parquet format.

cudf.io.feather.to_feather

Write out feather-format for DataFrames.

GpuArrowReader

class cudf.comm.gpuarrow. GpuArrowReader ( schema , dev_ary )

Methods

to_dict (self)

Return a dictionary of Series object

schema

to_dict ( self )

Return a dictionary of Series object