cudf.DataFrame#

class cudf.DataFrame(data=None, index=None, columns=None, dtype=None, nan_as_null=True)#

A GPU Dataframe object.

Parameters:
dataarray-like, Iterable, dict, or DataFrame.

Dict can contain Series, arrays, constants, or list-like objects.

indexIndex or array-like

Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

columnsIndex or array-like

Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.

dtypedtype, default None

Data type to force. Only a single dtype is allowed. If None, infer.

nan_as_nullbool, Default True

If None/True, converts np.nan values to null values. If False, leaves np.nan values as is.

Examples

Build dataframe with __setitem__:

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> df
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0

Build DataFrame via dict of columns:

>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
>>> n = 5
>>> df = cudf.DataFrame({
...     'id': np.arange(n),
...     'datetimes': np.array(
...     [(t0+ timedelta(seconds=x)) for x in range(n)])
... })
>>> df
    id            datetimes
0    0  2018-10-07 12:00:00
1    1  2018-10-07 12:00:01
2    2  2018-10-07 12:00:02
3    3  2018-10-07 12:00:03
4    4  2018-10-07 12:00:04

Build DataFrame via list of rows as tuples:

>>> df = cudf.DataFrame([
...     (5, "cats", "jump", np.nan),
...     (2, "dogs", "dig", 7.5),
...     (3, "cows", "moo", -2.1, "occasionally"),
... ])
>>> df
   0     1     2     3             4
0  5  cats  jump  <NA>          <NA>
1  2  dogs   dig   7.5          <NA>
2  3  cows   moo  -2.1  occasionally

Convert from a Pandas DataFrame:

>>> import pandas as pd
>>> pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
>>> pdf
   a    b
0  0  0.1
1  1  0.2
2  2  NaN
3  3  0.3
>>> df = cudf.from_pandas(pdf)
>>> df
   a     b
0  0   0.1
1  1   0.2
2  2  <NA>
3  3   0.3

Attributes

T

Transpose index and columns.

at

Alias for DataFrame.loc; provided for compatibility with Pandas.

axes

Return a list representing the axes of the DataFrame.

columns

Returns a tuple of columns

dtypes

Return the dtypes in this object.

empty

Indicator whether DataFrame or Series is empty.

iat

Alias for DataFrame.iloc; provided for compatibility with Pandas.

index

Get the labels for the rows.

ndim

Dimension of the data.

shape

Returns a tuple representing the dimensionality of the DataFrame.

size

Return the number of elements in the underlying data.

values

Return a CuPy representation of the DataFrame.

values_host

Return a NumPy representation of the data.

iloc

Select values by position. Examples ——– Series >>> import cudf >>> s = cudf.Series([10, 20, 30]) >>> s 0 10 1 20 2 30 dtype: int64 >>> s.iloc[2] 30 DataFrame Selecting rows and column by position. >>> df = cudf.DataFrame({‘a’: range(20), … ‘b’: range(20), … ‘c’: range(20)}) Select a single row using an integer index. >>> df.iloc[1] a 1 b 1 c 1 Name: 1, dtype: int64 Select multiple rows using a list of integers. >>> df.iloc[[0, 2, 9, 18]] a b c 0 0 0 0 2 2 2 2 9 9 9 9 18 18 18 18 Select rows using a slice. >>> df.iloc[3:10:2] a b c 3 3 3 3 5 5 5 5 7 7 7 7 9 9 9 9 Select both rows and columns. >>> df.iloc[[1, 3, 5, 7], 2] 1 1 3 3 5 5 7 7 Name: c, dtype: int64 Setting values in a column using iloc. >>> df.iloc[:4] = 0 >>> df a b c 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 [10 more rows]

loc

Select rows and columns by label or boolean mask. Examples ——– Series >>> import cudf >>> series = cudf.Series([10, 11, 12], index=[‘a’, ‘b’, ‘c’]) >>> series a 10 b 11 c 12 dtype: int64 >>> series.loc[‘b’] 11 DataFrame DataFrame with string index. >>> df a b a 0 5 b 1 6 c 2 7 d 3 8 e 4 9 Select a single row by label. >>> df.loc[‘a’] a 0 b 5 Name: a, dtype: int64 Select multiple rows and a single column. >>> df.loc[[‘a’, ‘c’, ‘e’], ‘b’] a 5 c 7 e 9 Name: b, dtype: int64 Selection by boolean mask. >>> df.loc[df.a > 2] a b d 3 8 e 4 9 Setting values using loc. >>> df.loc[[‘a’, ‘c’, ‘e’], ‘a’] = 0 >>> df a b a 0 5 b 1 6 c 0 7 d 3 8 e 0 9

Methods

abs()

Return a Series/DataFrame with absolute numeric value of each element.

add(other[, axis, level, fill_value])

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

add_prefix(prefix)

Prefix labels with string prefix.

add_suffix(suffix)

Suffix labels with string suffix.

agg(aggs[, axis])

Aggregate using one or more operations over the specified axis.

all([axis, bool_only, skipna])

Return whether all elements are True in DataFrame.

any([axis, bool_only, skipna])

Return whether any elements is True in DataFrame.

apply(func[, axis, raw, result_type, args])

Apply a function along an axis of the DataFrame.

apply_chunks(func, incols, outcols[, ...])

Transform user-specified chunks using the user-provided function.

apply_rows(func, incols, outcols, kwargs[, ...])

Apply a row-wise user defined function.

applymap(func[, na_action])

Apply a function to a Dataframe elementwise.

argsort([by, axis, kind, order, ascending, ...])

Return the integer indices that would sort the Series values.

assign(**kwargs)

Assign columns to DataFrame from keyword arguments.

astype(dtype[, copy, errors])

Cast the object to the given dtype.

backfill([value, axis, inplace, limit])

Synonym for Series.fillna() with method='bfill'.

bfill([value, axis, inplace, limit])

Synonym for Series.fillna() with method='bfill'.

clip([lower, upper, inplace, axis])

Trim values at input threshold(s).

convert_dtypes([infer_objects, ...])

Convert columns to the best possible nullable dtypes.

copy([deep])

Make a copy of this object's indices and data.

corr([method, min_periods])

Compute the correlation matrix of a DataFrame.

count([axis, numeric_only])

Count non-NA cells for each column or row.

cov(**kwargs)

Compute the covariance matrix of a DataFrame.

cummax([axis])

Return cumulative max of the IndexedFrame.

cummin([axis])

Return cumulative min of the IndexedFrame.

cumprod([axis])

Return cumulative product of the IndexedFrame.

cumsum([axis])

Return cumulative sum of the IndexedFrame.

describe([percentiles, include, exclude])

Generate descriptive statistics.

deserialize(header, frames)

Generate an object from a serialized representation.

device_deserialize(header, frames)

Perform device-side deserialization tasks.

device_serialize()

Serialize data and metadata associated with device memory.

diff([periods, axis])

First discrete difference of element.

div(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

divide(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

dot(other[, reflect])

Get dot product of frame and other, (binary operator dot).

drop([labels, axis, index, columns, level, ...])

Drop specified labels from rows or columns.

drop_duplicates([subset, keep, inplace, ...])

Return DataFrame with duplicate rows removed.

dropna([axis, how, thresh, subset, inplace])

Drop rows (or columns) containing nulls from a Column.

duplicated([subset, keep])

Return boolean Series denoting duplicate rows.

eq(other[, axis, level, fill_value])

Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).

equals(other)

Test whether two objects contain the same elements.

eval(expr[, inplace])

Evaluate a string describing operations on DataFrame columns.

explode(column[, ignore_index])

Transform each element of a list-like to a row, replicating index values.

ffill([value, axis, inplace, limit])

Synonym for Series.fillna() with method='ffill'.

fillna([value, method, axis, inplace, limit])

Fill null values with value or specified method.

first(offset)

Select initial periods of time series data based on a date offset.

floordiv(other[, axis, level, fill_value])

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

from_arrow(table)

Convert from PyArrow Table to DataFrame.

from_dict(data[, orient, dtype, columns])

Construct DataFrame from dict of array-like or dicts.

from_pandas(dataframe[, nan_as_null])

Convert from a Pandas DataFrame.

from_records(data[, index, columns, nan_as_null])

Convert structured or record ndarray to DataFrame.

ge(other[, axis, level, fill_value])

Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).

groupby([by, axis, level, as_index, sort, ...])

Group using a mapper or by a Series of columns.

gt(other[, axis, level, fill_value])

Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).

hash_values([method, seed])

Compute the hash of values in this column.

head([n])

Return the first n rows.

host_deserialize(header, frames)

Perform device-side deserialization tasks.

host_serialize()

Serialize data and metadata associated with host memory.

info([verbose, buf, max_cols, memory_usage, ...])

Print a concise summary of a DataFrame.

insert(loc, name, value[, nan_as_null])

Add a column to DataFrame at the index specified by loc.

interleave_columns()

Interleave Series columns of a table into a single column.

interpolate([method, axis, limit, inplace, ...])

Interpolate data values between some points.

isin(values)

Whether each element in the DataFrame is contained in values.

isna()

Identify missing values.

isnull()

Identify missing values.

items()

Iterate over column names and series pairs

iterrows()

Iteration is unsupported.

itertuples([index, name])

Iteration is unsupported.

join(other[, on, how, lsuffix, rsuffix, sort])

Join columns with other DataFrame on index or on a key column.

keys()

Get the columns.

kurt([axis, skipna, numeric_only])

Return Fisher's unbiased kurtosis of a sample.

kurtosis([axis, skipna, numeric_only])

Return Fisher's unbiased kurtosis of a sample.

last(offset)

Select final periods of time series data based on a date offset.

le(other[, axis, level, fill_value])

Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).

lt(other[, axis, level, fill_value])

Get Less than of DataFrame or Series and other, element-wise (binary operator lt).

map(func[, na_action])

Apply a function to a Dataframe elementwise.

mask(cond[, other, inplace])

Replace values where the condition is True.

max([axis, skipna, numeric_only])

Return the maximum of the values in the DataFrame.

mean([axis, skipna, numeric_only])

Return the mean of the values for the requested axis.

median([axis, skipna, level, numeric_only])

Return the median of the values for the requested axis.

melt(**kwargs)

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

memory_usage([index, deep])

Return the memory usage of an object.

merge(right[, on, left_on, right_on, ...])

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

min([axis, skipna, numeric_only])

Return the minimum of the values in the DataFrame.

mod(other[, axis, level, fill_value])

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

mode([axis, numeric_only, dropna])

Get the mode(s) of each element along the selected axis.

mul(other[, axis, level, fill_value])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

multiply(other[, axis, level, fill_value])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

nans_to_nulls()

Convert nans (if any) to nulls

ne(other[, axis, level, fill_value])

Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).

nlargest(n, columns[, keep])

Return the first n rows ordered by columns in descending order.

notna()

Identify non-missing values.

notnull()

Identify non-missing values.

nsmallest(n, columns[, keep])

Return the first n rows ordered by columns in ascending order.

nunique([axis, dropna])

Count number of distinct elements in specified axis.

pad([value, axis, inplace, limit])

Synonym for Series.fillna() with method='ffill'.

partition_by_hash(columns, nparts[, keep_index])

Partition the dataframe by the hashed value of data in columns.

pct_change([periods, fill_method, limit, freq])

Calculates the percent change between sequential elements in the DataFrame.

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs).

pivot(*, columns[, index, values])

Return reshaped DataFrame organized by the given index and column values.

pivot_table([values, index, columns, ...])

Create a spreadsheet-style pivot table as a DataFrame.

pop(item)

Return a column and drop it from the DataFrame.

pow(other[, axis, level, fill_value])

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

prod([axis, skipna, dtype, numeric_only, ...])

Return product of the values in the DataFrame.

product([axis, skipna, dtype, numeric_only, ...])

Return product of the values in the DataFrame.

quantile([q, axis, numeric_only, ...])

Return values at the given quantile.

query(expr[, local_dict])

Query with a boolean expression using Numba to compile a GPU kernel.

radd(other[, axis, level, fill_value])

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

rank([axis, method, numeric_only, ...])

Compute numerical data ranks (1 through n) along axis.

rdiv(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

reindex([labels, index, columns, axis, ...])

Conform DataFrame to new index.

rename([mapper, index, columns, axis, copy, ...])

Alter column and index labels.

repeat(repeats[, axis])

Repeats elements consecutively.

replace([to_replace, value, inplace, limit, ...])

Replace values given in to_replace with value.

resample(rule[, axis, closed, label, ...])

Convert the frequency of ("resample") the given time series data.

reset_index([level, drop, inplace, ...])

Reset the index of the DataFrame, or a level of it.

rfloordiv(other[, axis, level, fill_value])

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

rmod(other[, axis, level, fill_value])

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

rmul(other[, axis, level, fill_value])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

rolling(window[, min_periods, center, axis, ...])

Rolling window calculations.

round([decimals, how])

Round to a variable number of decimal places.

rpow(other[, axis, level, fill_value])

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

rsub(other[, axis, level, fill_value])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

rtruediv(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

sample([n, frac, replace, weights, ...])

Return a random sample of items from an axis of object.

scale()

Scale values to [0, 1] in float64

scatter_by_map(map_index[, map_size, ...])

Scatter to a list of dataframes.

searchsorted(values[, side, ascending, ...])

Find indices where elements should be inserted to maintain order

select_dtypes([include, exclude])

Return a subset of the DataFrame's columns based on the column dtypes.

serialize()

Generate an equivalent serializable representation of an object.

set_index(keys[, drop, append, inplace, ...])

Return a new DataFrame with a new index

shift([periods, freq, axis, fill_value])

Shift values by periods positions.

skew([axis, skipna, numeric_only])

Return unbiased Fisher-Pearson skew of a sample.

sort_index([axis, level, ascending, ...])

Sort object by labels (along an axis).

sort_values(by[, axis, ascending, inplace, ...])

Sort by the values along either axis.

squeeze([axis])

Squeeze 1 dimensional axis objects into scalars.

stack([level, dropna, future_stack])

Stack the prescribed level(s) from columns to index

std([axis, skipna, ddof, numeric_only])

Return sample standard deviation of the DataFrame.

sub(other[, axis, level, fill_value])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

subtract(other[, axis, level, fill_value])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

sum([axis, skipna, dtype, numeric_only, ...])

Return sum of the values in the DataFrame.

swaplevel([i, j, axis])

Swap level i with level j.

tail([n])

Returns the last n rows as a new DataFrame or Series

take(indices[, axis])

Return a new frame containing the rows specified by indices.

tile(count)

Repeats the rows count times to form a new Frame.

to_arrow([preserve_index])

Convert to a PyArrow Table.

to_csv([path_or_buf, sep, na_rep, columns, ...])

Write a dataframe to csv file format.

to_cupy([dtype, copy, na_value])

Convert the Frame to a CuPy array.

to_dict([orient, into])

Convert the DataFrame to a dictionary.

to_dlpack()

Converts a cuDF object into a DLPack tensor.

to_feather(path, *args, **kwargs)

Write a DataFrame to the feather format.

to_hdf(path_or_buf, key, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

to_json([path_or_buf])

Convert the cuDF object to a JSON string.

to_numpy([dtype, copy, na_value])

Convert the Frame to a NumPy array.

to_orc(fname[, compression, statistics, ...])

Write a DataFrame to the ORC format.

to_pandas(*[, nullable, arrow_type])

Convert to a Pandas DataFrame.

to_parquet(path[, engine, compression, ...])

Write a DataFrame to the parquet format.

to_records([index])

Convert to a numpy recarray

to_string()

Convert to string

to_struct([name])

Return a struct Series composed of the columns of the DataFrame.

transpose()

Transpose index and columns.

truediv(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

truncate([before, after, axis, copy])

Truncate a Series or DataFrame before and after some index value.

unstack([level, fill_value])

Pivot one or more levels of the (necessarily hierarchical) index labels.

update(other[, join, overwrite, ...])

Modify a DataFrame in place using non-NA values from another DataFrame.

value_counts([subset, normalize, sort, ...])

Return a Series containing counts of unique rows in the DataFrame.

var([axis, skipna, ddof, numeric_only])

Return unbiased variance of the DataFrame.

where(cond[, other, inplace])

Replace values where the condition is False.