Supported Data Types#
cuDF supports many data types supported by NumPy and Pandas, including numeric, datetime, timedelta, categorical and string data types. We also provide special data types for working with decimals, list-like, and dictionary-like data.
All data types in cuDF are nullable.
Kind of data |
Data type(s) |
---|---|
Signed integer |
|
Unsigned integer |
|
Floating-point |
|
Datetime |
|
Timedelta (duration) |
|
Category |
|
String |
|
Decimal |
|
List |
|
Struct |
NumPy data types#
We use NumPy data types for integer, floating, datetime, timedelta,
and string data types. Thus, just like in NumPy,
np.dtype("float32")
, np.float32
, and "float32"
are all acceptable
ways to specify the float32
data type:
>>> import cudf
>>> s = cudf.Series([1, 2, 3], dtype="float32")
>>> s
0 1.0
1 2.0
2 3.0
dtype: float32
A note on object
#
The data type associated with string data in cuDF is "np.object"
.
>>> import cudf
>>> s = cudf.Series(["abc", "def", "ghi"])
>>> s.dtype
dtype("object")
This is for compatibility with Pandas, but it can be misleading. In
both NumPy and Pandas, "object"
is the data type associated data
composed of arbitrary Python objects (not just strings). However,
cuDF does not support storing arbitrary Python objects.
Decimal data types#
We provide special data types for working with decimal data, namely
Decimal32Dtype
,
Decimal64Dtype
, and
Decimal128Dtype
. Use these data types when you
need to store values with greater precision than allowed by floating-point
representation.
Decimal data types in cuDF are based on fixed-point representation. A
decimal data type is composed of a precision and a scale. The
precision represents the total number of digits in each value of this
dtype. For example, the precision associated with the decimal value
1.023
is 4
. The scale is the total number of digits to the right
of the decimal point. The scale associated with the value 1.023
is
3.
Each decimal data type is associated with a maximum precision:
>>> cudf.Decimal32Dtype.MAX_PRECISION
9.0
>>> cudf.Decimal64Dtype.MAX_PRECISION
18.0
>>> cudf.Decimal128Dtype.MAX_PRECISION
38
One way to create a decimal Series is from values of type decimal.Decimal.
>>> from decimal import Decimal
>>> s = cudf.Series([Decimal("1.01"), Decimal("4.23"), Decimal("0.5")])
>>> s
0 1.01
1 4.23
2 0.50
dtype: decimal128
>>> s.dtype
Decimal128Dtype(precision=3, scale=2)
Notice the data type of the result: 1.01
, 4.23
, 0.50
can all be
represented with a precision of at least 3 and a scale of at least 2.
However, the value 1.234
needs a precision of at least 4, and a
scale of at least 3, and cannot be fully represented using this data
type:
>>> s[1] = Decimal("1.234") # raises an error
Nested data types (List
and Struct
)#
ListDtype
and
StructDtype
are special data types in cuDF for
working with list-like and dictionary-like data. These are referred to as
“nested” data types, because they enable you to store a list of lists, or a
struct of lists, or a struct of list of lists, etc.,
You can create lists and struct Series from existing Pandas Series of lists and dictionaries respectively:
>>> psr = pd.Series([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
>>> psr
0 {'a': 1, 'b': 2}
1 {'a': 3, 'b': 4}
dtype: object
>>> gsr = cudf.from_pandas(psr)
>>> gsr
0 {'a': 1, 'b': 2}
1 {'a': 3, 'b': 4}
dtype: struct
>>> gsr.dtype
StructDtype({'a': dtype('int64'), 'b': dtype('int64')})
Or by reading them from disk, using a file format that supports nested data.
>>> pdf = pd.DataFrame({"a": [[1, 2], [3, 4, 5], [6, 7, 8]]})
>>> pdf.to_parquet("lists.pq")
>>> gdf = cudf.read_parquet("lists.pq")
>>> gdf
a
0 [1, 2]
1 [3, 4, 5]
2 [6, 7, 8]
>>> gdf["a"].dtype
ListDtype(int64)