cudf.core.groupby.groupby.GroupBy.cov#

GroupBy.cov(min_periods=0, ddof=1)#

Compute the pairwise covariance among the columns of a DataFrame, excluding NA/null values.

The returned DataFrame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. See the note below about bias from missing values.

A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NA.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:

min_periods: int, optional: Minimum number of observations required per pair of columns to have a valid result.
ddof: int, optional: Delta degrees of freedom, default is 1.

Returns:

DataFrame: Covariance matrix.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices <https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices> for more details.

Examples

>>> import cudf
>>> gdf = cudf.DataFrame({
...     "id": ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
...     "val1": [5, 4, 6, 4, 8, 7, 4, 5, 2],
...     "val2": [4, 5, 6, 1, 2, 9, 8, 5, 1],
...     "val3": [4, 5, 6, 1, 2, 9, 8, 5, 1],
... })
>>> gdf
  id  val1  val2  val3
0  a     5     4     4
1  a     4     5     5
2  a     6     6     6
3  b     4     1     1
4  b     8     2     2
5  b     7     9     9
6  c     4     8     8
7  c     5     5     5
8  c     2     1     1
>>> gdf.groupby("id").cov()
            val1       val2       val3
id
a  val1  1.000000   0.500000   0.500000
   val2  0.500000   1.000000   1.000000
   val3  0.500000   1.000000   1.000000
b  val1  4.333333   3.500000   3.500000
   val2  3.500000  19.000000  19.000000
   val3  3.500000  19.000000  19.000000
c  val1  2.333333   3.833333   3.833333
   val2  3.833333  12.333333  12.333333
   val3  3.833333  12.333333  12.333333