Working with missing data#
In this section, we will discuss missing (also referred to as NA
) values in cudf. cudf supports having missing values in all dtypes. These missing values are represented by <NA>
. These values are also referenced as “null values”.
How to Detect missing values#
To detect missing values, you can use isna()
and notna()
functions.
import numpy as np
import cudf
df = cudf.DataFrame({"a": [1, 2, None, 4], "b": [0.1, None, 2.3, 17.17]})
df
a | b | |
---|---|---|
0 | 1 | 0.1 |
1 | 2 | <NA> |
2 | <NA> | 2.3 |
3 | 4 | 17.17 |
df.isna()
a | b | |
---|---|---|
0 | False | False |
1 | False | True |
2 | True | False |
3 | False | False |
df["a"].notna()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
One has to be mindful that in Python (and NumPy), the nan’s don’t compare equal, but None’s do. Note that cudf/NumPy uses the fact that np.nan != np.nan
, and treats None
like np.nan
.
None == None
True
np.nan == np.nan
False
So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.
df["b"] == np.nan
0 False
1 <NA>
2 False
3 False
Name: b, dtype: bool
s = cudf.Series([None, 1, 2])
s
0 <NA>
1 1
2 2
dtype: int64
s == None
0 <NA>
1 <NA>
2 <NA>
dtype: bool
s = cudf.Series([1, 2, np.nan], nan_as_null=False)
s
0 1.0
1 2.0
2 NaN
dtype: float64
s == np.nan
0 False
1 False
2 False
dtype: bool
Float dtypes and missing data#
Because NaN
is a float, a column of integers with even one missing values is cast to floating-point dtype. However this doesn’t happen by default.
By default if a NaN
value is passed to Series
constructor, it is treated as <NA>
value.
cudf.Series([1, 2, np.nan])
0 1
1 2
2 <NA>
dtype: int64
Hence to consider a NaN
as NaN
you will have to pass nan_as_null=False
parameter into Series
constructor.
cudf.Series([1, 2, np.nan], nan_as_null=False)
0 1.0
1 2.0
2 NaN
dtype: float64
Datetimes#
For datetime64
types, cudf doesn’t support having NaT
values. Instead these values which are specific to numpy and pandas are considered as null values(<NA>
) in cudf. The actual underlying value of NaT
is min(int64)
and cudf retains the underlying value when converting a cudf object to pandas object.
import pandas as pd
datetime_series = cudf.Series(
[pd.Timestamp("20120101"), pd.NaT, pd.Timestamp("20120101")]
)
datetime_series
0 2012-01-01 00:00:00.000000000
1 NaT
2 2012-01-01 00:00:00.000000000
dtype: datetime64[ns]
datetime_series.to_pandas()
0 2012-01-01
1 NaT
2 2012-01-01
dtype: datetime64[ns]
any operations on rows having <NA>
values in datetime
column will result in <NA>
value at the same location in resulting column:
datetime_series - datetime_series
0 0 days 00:00:00
1 NaT
2 0 days 00:00:00
dtype: timedelta64[ns]
Calculations with missing data#
Null values propagate naturally through arithmetic operations between pandas objects.
df1 = cudf.DataFrame(
{
"a": [1, None, 2, 3, None],
"b": cudf.Series([np.nan, 2, 3.2, 0.1, 1], nan_as_null=False),
}
)
df2 = cudf.DataFrame(
{"a": [1, 11, 2, 34, 10], "b": cudf.Series([0.23, 22, 3.2, None, 1])}
)
df1
a | b | |
---|---|---|
0 | 1 | NaN |
1 | <NA> | 2.0 |
2 | 2 | 3.2 |
3 | 3 | 0.1 |
4 | <NA> | 1.0 |
df2
a | b | |
---|---|---|
0 | 1 | 0.23 |
1 | 11 | 22.0 |
2 | 2 | 3.2 |
3 | 34 | <NA> |
4 | 10 | 1.0 |
df1 + df2
a | b | |
---|---|---|
0 | 2 | NaN |
1 | <NA> | 24.0 |
2 | 4 | 6.4 |
3 | 37 | <NA> |
4 | <NA> | 2.0 |
While summing the data along a series, NA
values will be treated as 0
.
df1["a"]
0 1
1 <NA>
2 2
3 3
4 <NA>
Name: a, dtype: int64
df1["a"].sum()
6
Since NA
values are treated as 0
, the mean would result to 2 in this case (1 + 0 + 2 + 3 + 0)/5 = 2
df1["a"].mean()
2.0
To preserve NA
values in the above calculations, sum
& mean
support skipna
parameter.
By default it’s value is
set to True
, we can change it to False
to preserve NA
values.
df1["a"].sum(skipna=False)
nan
df1["a"].mean(skipna=False)
nan
Cumulative methods like cumsum
and cumprod
ignore NA
values by default.
df1["a"].cumsum()
0 1
1 <NA>
2 3
3 6
4 <NA>
Name: a, dtype: int64
To preserve NA
values in cumulative methods, provide skipna=False
.
df1["a"].cumsum(skipna=False)
0 1
1 <NA>
2 <NA>
3 <NA>
4 <NA>
Name: a, dtype: int64
Sum/product of Null/nans#
The sum of an empty or all-NA Series of a DataFrame is 0.
cudf.Series([np.nan], nan_as_null=False).sum()
0.0
cudf.Series([np.nan], nan_as_null=False).sum(skipna=False)
nan
cudf.Series([], dtype="float64").sum()
0.0
The product of an empty or all-NA Series of a DataFrame is 1.
cudf.Series([np.nan], nan_as_null=False).prod()
1.0
cudf.Series([np.nan], nan_as_null=False).prod(skipna=False)
nan
cudf.Series([], dtype="float64").prod()
1.0
NA values in GroupBy#
NA
groups in GroupBy are automatically excluded. For example:
df1
a | b | |
---|---|---|
0 | 1 | NaN |
1 | <NA> | 2.0 |
2 | 2 | 3.2 |
3 | 3 | 0.1 |
4 | <NA> | 1.0 |
df1.groupby("a").mean()
b | |
---|---|
a | |
3 | 0.1 |
1 | NaN |
2 | 3.2 |
It is also possible to include NA
in groups by passing dropna=False
df1.groupby("a", dropna=False).mean()
b | |
---|---|
a | |
3 | 0.1 |
1 | NaN |
2 | 3.2 |
<NA> | 1.5 |
Inserting missing data#
All dtypes support insertion of missing value by assignment. Any specific location in series can made null by assigning it to None
.
series = cudf.Series([1, 2, 3, 4])
series
0 1
1 2
2 3
3 4
dtype: int64
series[2] = None
series
0 1
1 2
2 <NA>
3 4
dtype: int64
Filling missing values: fillna#
fillna()
can fill in NA
& NaN
values with non-NA data.
df1
a | b | |
---|---|---|
0 | 1 | NaN |
1 | <NA> | 2.0 |
2 | 2 | 3.2 |
3 | 3 | 0.1 |
4 | <NA> | 1.0 |
df1["b"].fillna(10)
0 10.0
1 2.0
2 3.2
3 0.1
4 1.0
Name: b, dtype: float64
Filling with cudf Object#
You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column.
import cupy as cp
dff = cudf.DataFrame(cp.random.randn(10, 3), columns=list("ABC"))
dff.iloc[3:5, 0] = np.nan
dff.iloc[4:6, 1] = np.nan
dff.iloc[5:8, 2] = np.nan
dff
A | B | C | |
---|---|---|---|
0 | -0.000980 | 0.551560 | -1.443735 |
1 | -1.832599 | -0.513116 | -0.681301 |
2 | 0.359273 | 0.166505 | -0.269154 |
3 | NaN | -0.149245 | -1.083291 |
4 | NaN | NaN | 0.588775 |
5 | -0.382155 | NaN | NaN |
6 | 2.042128 | 1.788889 | NaN |
7 | 0.269661 | 1.122631 | NaN |
8 | 0.624938 | -1.074668 | -0.098387 |
9 | 0.064479 | 1.647954 | 0.556011 |
dff.fillna(dff.mean())
A | B | C | |
---|---|---|---|
0 | -0.000980 | 0.551560 | -1.443735 |
1 | -1.832599 | -0.513116 | -0.681301 |
2 | 0.359273 | 0.166505 | -0.269154 |
3 | 0.143093 | -0.149245 | -1.083291 |
4 | 0.143093 | 0.442564 | 0.588775 |
5 | -0.382155 | 0.442564 | -0.347297 |
6 | 2.042128 | 1.788889 | -0.347297 |
7 | 0.269661 | 1.122631 | -0.347297 |
8 | 0.624938 | -1.074668 | -0.098387 |
9 | 0.064479 | 1.647954 | 0.556011 |
dff.fillna(dff.mean()[1:3])
A | B | C | |
---|---|---|---|
0 | -0.000980 | 0.551560 | -1.443735 |
1 | -1.832599 | -0.513116 | -0.681301 |
2 | 0.359273 | 0.166505 | -0.269154 |
3 | NaN | -0.149245 | -1.083291 |
4 | NaN | 0.442564 | 0.588775 |
5 | -0.382155 | 0.442564 | -0.347297 |
6 | 2.042128 | 1.788889 | -0.347297 |
7 | 0.269661 | 1.122631 | -0.347297 |
8 | 0.624938 | -1.074668 | -0.098387 |
9 | 0.064479 | 1.647954 | 0.556011 |
Dropping axis labels with missing data: dropna#
Missing data can be excluded using dropna()
:
df1
a | b | |
---|---|---|
0 | 1 | NaN |
1 | <NA> | 2.0 |
2 | 2 | 3.2 |
3 | 3 | 0.1 |
4 | <NA> | 1.0 |
df1.dropna(axis=0)
a | b | |
---|---|---|
2 | 2 | 3.2 |
3 | 3 | 0.1 |
df1.dropna(axis=1)
0 |
---|
1 |
2 |
3 |
4 |
An equivalent dropna()
is available for Series.
df1["a"].dropna()
0 1
2 2
3 3
Name: a, dtype: int64
Replacing generic values#
Often times we want to replace arbitrary values with other values.
replace()
in Series and replace()
in DataFrame provides an efficient yet flexible way to perform such replacements.
series = cudf.Series([0.0, 1.0, 2.0, 3.0, 4.0])
series
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
series.replace(0, 5)
0 5.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
We can also replace any value with a <NA>
value.
series.replace(0, None)
0 <NA>
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
You can replace a list of values by a list of other values:
series.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
0 4.0
1 3.0
2 2.0
3 1.0
4 0.0
dtype: float64
You can also specify a mapping dict:
series.replace({0: 10, 1: 100})
0 10.0
1 100.0
2 2.0
3 3.0
4 4.0
dtype: float64
For a DataFrame, you can specify individual values by column:
df = cudf.DataFrame({"a": [0, 1, 2, 3, 4], "b": [5, 6, 7, 8, 9]})
df
a | b | |
---|---|---|
0 | 0 | 5 |
1 | 1 | 6 |
2 | 2 | 7 |
3 | 3 | 8 |
4 | 4 | 9 |
df.replace({"a": 0, "b": 5}, 100)
a | b | |
---|---|---|
0 | 100 | 100 |
1 | 1 | 6 |
2 | 2 | 7 |
3 | 3 | 8 |
4 | 4 | 9 |
String/regular expression replacement#
cudf supports replacing string values using replace
API:
d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", None, "d"]}
df = cudf.DataFrame(d)
df
a | b | c | |
---|---|---|---|
0 | 0 | a | a |
1 | 1 | b | b |
2 | 2 | . | <NA> |
3 | 3 | . | d |
df.replace(".", "A Dot")
a | b | c | |
---|---|---|---|
0 | 0 | a | a |
1 | 1 | b | b |
2 | 2 | A Dot | <NA> |
3 | 3 | A Dot | d |
df.replace([".", "b"], ["A Dot", None])
a | b | c | |
---|---|---|---|
0 | 0 | a | a |
1 | 1 | <NA> | <NA> |
2 | 2 | A Dot | <NA> |
3 | 3 | A Dot | d |
Replace a few different values (list -> list):
df.replace(["a", "."], ["b", "--"])
a | b | c | |
---|---|---|---|
0 | 0 | b | b |
1 | 1 | b | b |
2 | 2 | -- | <NA> |
3 | 3 | -- | d |
Only search in column ‘b’ (dict -> dict):
df.replace({"b": "."}, {"b": "replacement value"})
a | b | c | |
---|---|---|---|
0 | 0 | a | a |
1 | 1 | b | b |
2 | 2 | replacement value | <NA> |
3 | 3 | replacement value | d |
Numeric replacement#
replace()
can also be used similar to fillna()
.
df = cudf.DataFrame(cp.random.randn(10, 2))
df[np.random.rand(df.shape[0]) > 0.5] = 1.5
df.replace(1.5, None)
0 | 1 | |
---|---|---|
0 | <NA> | <NA> |
1 | <NA> | <NA> |
2 | 1.579393975 | 1.824094745 |
3 | -1.818202592 | -1.076327404 |
4 | -1.005274936 | -0.993429099 |
5 | 1.916061967 | 0.093062522 |
6 | 0.023991821 | 1.07013605 |
7 | -0.599455766 | -0.566379641 |
8 | 0.029506187 | 0.847134773 |
9 | -0.750757995 | -0.868897574 |
Replacing more than one value is possible by passing a list.
df00 = df.iloc[0, 0]
df.replace([1.5, df00], [5, 10])
0 | 1 | |
---|---|---|
0 | 10.000000 | 10.000000 |
1 | 10.000000 | 10.000000 |
2 | 1.579394 | 1.824095 |
3 | -1.818203 | -1.076327 |
4 | -1.005275 | -0.993429 |
5 | 1.916062 | 0.093063 |
6 | 0.023992 | 1.070136 |
7 | -0.599456 | -0.566380 |
8 | 0.029506 | 0.847135 |
9 | -0.750758 | -0.868898 |
You can also operate on the DataFrame in place:
df.replace(1.5, None, inplace=True)
df
0 | 1 | |
---|---|---|
0 | <NA> | <NA> |
1 | <NA> | <NA> |
2 | 1.579393975 | 1.824094745 |
3 | -1.818202592 | -1.076327404 |
4 | -1.005274936 | -0.993429099 |
5 | 1.916061967 | 0.093062522 |
6 | 0.023991821 | 1.07013605 |
7 | -0.599455766 | -0.566379641 |
8 | 0.029506187 | 0.847134773 |
9 | -0.750757995 | -0.868897574 |