{ "cells": [ { "cell_type": "markdown", "id": "4c6c548b", "metadata": {}, "source": [ "# 10 Minutes to cuDF and Dask-cuDF\n", "\n", "Modelled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly towards new users.\n", "\n", "## What are these Libraries?\n", "\n", "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of [pandas](https://pandas.pydata.org).\n", "\n", "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.\n", "\n", "[Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call `dask_cudf.read_csv(...)`, your cluster's GPUs do the work of parsing the CSV file(s) by calling [`cudf.read_csv()`](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.read_csv.html).\n", "\n", "\n", "## When to use cuDF and Dask-cuDF\n", "\n", "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 1, "id": "36307e42", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import cupy as cp\n", "import pandas as pd\n", "\n", "import cudf\n", "import dask_cudf\n", "\n", "cp.random.seed(12)\n", "\n", "#### Portions of this were borrowed and adapted from the\n", "#### cuDF cheatsheet, existing cuDF documentation,\n", "#### and 10 Minutes to Pandas." ] }, { "cell_type": "markdown", "id": "eff5fc19", "metadata": {}, "source": [ "## Object Creation" ] }, { "cell_type": "markdown", "id": "0a747886", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." ] }, { "cell_type": "code", "execution_count": 2, "id": "f5e303df", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 4\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([1, 2, 3, None, 4])\n", "s" ] }, { "cell_type": "code", "execution_count": 3, "id": "9a893956", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", "# Note the call to head here to show the first few entries, unlike\n", "# cuDF objects, dask-cuDF objects do not have a printing\n", "# representation that shows values since they may not be in local\n", "# memory.\n", "ds.head(n=3)" ] }, { "cell_type": "markdown", "id": "d934af4e", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." ] }, { "cell_type": "code", "execution_count": 4, "id": "3f53fb8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
44154
55145
66136
77127
88118
99109
1010910
1111811
1212712
1313613
1414514
1515415
1616316
1717217
1818118
1919019
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4\n", "5 5 14 5\n", "6 6 13 6\n", "7 7 12 7\n", "8 8 11 8\n", "9 9 10 9\n", "10 10 9 10\n", "11 11 8 11\n", "12 12 7 12\n", "13 13 6 13\n", "14 14 5 14\n", "15 15 4 15\n", "16 16 3 16\n", "17 17 2 17\n", "18 18 1 18\n", "19 19 0 19" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.DataFrame(\n", " {\n", " \"a\": list(range(20)),\n", " \"b\": list(reversed(range(20))),\n", " \"c\": list(range(20)),\n", " }\n", ")\n", "df" ] }, { "cell_type": "markdown", "id": "b17db919", "metadata": {}, "source": [ "Now we will convert our cuDF dataframe into a dask-cuDF equivalent. Here we call out a key difference: to inspect the data we must call a method (here `.head()` to look at the first few values). In the general case (see the end of this notebook), the data in `ddf` will be distributed across multiple GPUs.\n", "\n", "In this small case, we could call `ddf.compute()` to obtain a cuDF object from the dask-cuDF object. In general, we should avoid calling `.compute()` on large dataframes, and restrict ourselves to using it when we have some (relatively) small postprocessed result that we wish to inspect. Hence, throughout this notebook we will generally call `.head()` to inspect the first few values of a dask-cuDF dataframe, occasionally calling out places where we use `.compute()` and why.\n", "\n", "*To understand more of the differences between how cuDF and dask-cuDF behave here, visit the [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) tutorial after this one.*" ] }, { "cell_type": "code", "execution_count": 5, "id": "8904b5ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
44154
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3\n", "4 4 15 4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.from_cudf(df, npartitions=2)\n", "ddf.head()" ] }, { "cell_type": "markdown", "id": "0573b7bb", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", "\n", "*Note that best practice for using dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with `read_csv` or other builtin I/O routines (discussed below).*" ] }, { "cell_type": "code", "execution_count": 6, "id": "06a42f3a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000.1
110.2
22<NA>
330.3
\n", "
" ], "text/plain": [ " a b\n", "0 0 0.1\n", "1 1 0.2\n", "2 2 \n", "3 3 0.3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pdf = pd.DataFrame({\"a\": [0, 1, 2, 3], \"b\": [0.1, 0.2, None, 0.3]})\n", "gdf = cudf.DataFrame.from_pandas(pdf)\n", "gdf" ] }, { "cell_type": "code", "execution_count": 7, "id": "c67de344", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
000.1
110.2
\n", "
" ], "text/plain": [ " a b\n", "0 0 0.1\n", "1 1 0.2" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask_gdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "dask_gdf.head(n=2)" ] }, { "cell_type": "markdown", "id": "5820795f", "metadata": {}, "source": [ "## Viewing Data" ] }, { "cell_type": "markdown", "id": "b3008757", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." ] }, { "cell_type": "code", "execution_count": 8, "id": "0c329914", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "code", "execution_count": 9, "id": "b989e208", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head(2)" ] }, { "cell_type": "markdown", "id": "16798a32", "metadata": {}, "source": [ "Sorting by values." ] }, { "cell_type": "code", "execution_count": 10, "id": "2190856d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1919019
1818118
1717217
1616316
1515415
1414514
1313613
1212712
1111811
1010910
99109
88118
77127
66136
55145
44154
33163
22172
11181
00190
\n", "
" ], "text/plain": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15\n", "14 14 5 14\n", "13 13 6 13\n", "12 12 7 12\n", "11 11 8 11\n", "10 10 9 10\n", "9 9 10 9\n", "8 8 11 8\n", "7 7 12 7\n", "6 6 13 6\n", "5 5 14 5\n", "4 4 15 4\n", "3 3 16 3\n", "2 2 17 2\n", "1 1 18 1\n", "0 0 19 0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by=\"b\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "6594bd6f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1919019
1818118
1717217
1616316
1515415
\n", "
" ], "text/plain": [ " a b c\n", "19 19 0 19\n", "18 18 1 18\n", "17 17 2 17\n", "16 16 3 16\n", "15 15 4 15" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.sort_values(by=\"b\").head()" ] }, { "cell_type": "markdown", "id": "3302a647", "metadata": {}, "source": [ "## Selecting a Column" ] }, { "cell_type": "markdown", "id": "aebc72ab", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." ] }, { "cell_type": "code", "execution_count": 12, "id": "4dafb4ed", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "10 10\n", "11 11\n", "12 12\n", "13 13\n", "14 14\n", "15 15\n", "16 16\n", "17 17\n", "18 18\n", "19 19\n", "Name: a, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"a\"]" ] }, { "cell_type": "code", "execution_count": 13, "id": "b38f05fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "Name: a, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"a\"].head()" ] }, { "cell_type": "markdown", "id": "a5160dd1", "metadata": {}, "source": [ "## Selecting Rows by Label" ] }, { "cell_type": "markdown", "id": "51ff2093", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." ] }, { "cell_type": "code", "execution_count": 14, "id": "e8870657", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2217
3316
4415
5514
\n", "
" ], "text/plain": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[2:5, [\"a\", \"b\"]]" ] }, { "cell_type": "code", "execution_count": 15, "id": "f041e661", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
2217
3316
4415
5514
\n", "
" ], "text/plain": [ " a b\n", "2 2 17\n", "3 3 16\n", "4 4 15\n", "5 5 14" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.loc[2:5, [\"a\", \"b\"]].head()" ] }, { "cell_type": "markdown", "id": "d8e07162", "metadata": {}, "source": [ "## Selecting Rows by Position" ] }, { "cell_type": "markdown", "id": "435eb76a", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." ] }, { "cell_type": "code", "execution_count": 16, "id": "bc337d5d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 19\n", "c 0\n", "Name: 0, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0]" ] }, { "cell_type": "code", "execution_count": 17, "id": "03671456", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
0019
1118
2217
\n", "
" ], "text/plain": [ " a b\n", "0 0 19\n", "1 1 18\n", "2 2 17" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0:3, 0:2]" ] }, { "cell_type": "markdown", "id": "aa935d5b", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." ] }, { "cell_type": "code", "execution_count": 18, "id": "79883c37", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
33163
44154
\n", "
" ], "text/plain": [ " a b c\n", "3 3 16 3\n", "4 4 15 4" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[3:5]" ] }, { "cell_type": "code", "execution_count": 19, "id": "2f761695", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 \n", "4 4\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[3:5]" ] }, { "cell_type": "markdown", "id": "9bf9c0b0", "metadata": {}, "source": [ "## Boolean Indexing" ] }, { "cell_type": "markdown", "id": "9b08b2b9", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." ] }, { "cell_type": "code", "execution_count": 20, "id": "1eb08f0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
33163
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2\n", "3 3 16 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.b > 15]" ] }, { "cell_type": "code", "execution_count": 21, "id": "324dd036", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
11181
22172
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", "2 2 17 2" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[ddf.b > 15].head(n=3)" ] }, { "cell_type": "markdown", "id": "f95c9ab7", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." ] }, { "cell_type": "code", "execution_count": 22, "id": "fa643410", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query(\"b == 3\")" ] }, { "cell_type": "markdown", "id": "7aa0089f", "metadata": {}, "source": [ "Note here we call `compute()` rather than `head()` on the dask-cuDF dataframe since we are happy that the number of matching rows will be small (and hence it is reasonable to bring the entire result back)." ] }, { "cell_type": "code", "execution_count": 23, "id": "e2706a02", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.query(\"b == 3\").compute()" ] }, { "cell_type": "markdown", "id": "694dcbad", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." ] }, { "cell_type": "code", "execution_count": 24, "id": "353b0250", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cudf_comparator = 3\n", "df.query(\"b == @cudf_comparator\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "a35c8a5a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
1616316
\n", "
" ], "text/plain": [ " a b c\n", "16 16 3 16" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dask_cudf_comparator = 3\n", "ddf.query(\"b == @val\", local_dict={\"val\": dask_cudf_comparator}).compute()" ] }, { "cell_type": "markdown", "id": "8e004749", "metadata": {}, "source": [ "Using the `isin` method for filtering." ] }, { "cell_type": "code", "execution_count": 26, "id": "20936418", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
00190
55145
\n", "
" ], "text/plain": [ " a b c\n", "0 0 19 0\n", "5 5 14 5" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.a.isin([0, 5])]" ] }, { "cell_type": "markdown", "id": "8e456f03", "metadata": {}, "source": [ "## MultiIndex" ] }, { "cell_type": "markdown", "id": "e494bd0b", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." ] }, { "cell_type": "code", "execution_count": 27, "id": "4ae70724", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MultiIndex([('a', 1),\n", " ('a', 2),\n", " ('b', 3),\n", " ('b', 4)],\n", " )" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arrays = [[\"a\", \"a\", \"b\", \"b\"], [1, 2, 3, 4]]\n", "tuples = list(zip(*arrays))\n", "idx = cudf.MultiIndex.from_tuples(tuples)\n", "idx" ] }, { "cell_type": "markdown", "id": "ab232727", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." ] }, { "cell_type": "code", "execution_count": 28, "id": "cb1d1633", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstsecond
a10.0826540.967955
20.3994170.441425
b30.7842970.793582
40.0703030.271711
\n", "
" ], "text/plain": [ " first second\n", "a 1 0.082654 0.967955\n", " 2 0.399417 0.441425\n", "b 3 0.784297 0.793582\n", " 4 0.070303 0.271711" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1 = cudf.DataFrame(\n", " {\"first\": cp.random.rand(4), \"second\": cp.random.rand(4)}\n", ")\n", "gdf1.index = idx\n", "gdf1" ] }, { "cell_type": "code", "execution_count": 29, "id": "73ba31af", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
1234
first0.3433820.0037000.200430.581614
second0.9078120.1015120.241790.224180
\n", "
" ], "text/plain": [ " a b \n", " 1 2 3 4\n", "first 0.343382 0.003700 0.20043 0.581614\n", "second 0.907812 0.101512 0.24179 0.224180" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf2 = cudf.DataFrame(\n", " {\"first\": cp.random.rand(4), \"second\": cp.random.rand(4)}\n", ").T\n", "gdf2.columns = idx\n", "gdf2" ] }, { "cell_type": "markdown", "id": "c5f33160", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex, both with `.loc`" ] }, { "cell_type": "code", "execution_count": 30, "id": "1048b7cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "first 0.784297\n", "second 0.793582\n", "Name: ('b', 3), dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1.loc[(\"b\", 3)]" ] }, { "cell_type": "markdown", "id": "5123f759", "metadata": {}, "source": [ "And `.iloc`" ] }, { "cell_type": "code", "execution_count": 31, "id": "369d164d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstsecond
a10.0826540.967955
20.3994170.441425
\n", "
" ], "text/plain": [ " first second\n", "a 1 0.082654 0.967955\n", " 2 0.399417 0.441425" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf1.iloc[0:2]" ] }, { "cell_type": "markdown", "id": "8b3b96e9", "metadata": {}, "source": [ "Missing Data\n", "------------" ] }, { "cell_type": "markdown", "id": "d12141eb", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." ] }, { "cell_type": "code", "execution_count": 32, "id": "913a7b5f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 999\n", "4 4\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.fillna(999)" ] }, { "cell_type": "code", "execution_count": 33, "id": "14479a42", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.fillna(999).head(n=3)" ] }, { "cell_type": "markdown", "id": "d97605e6", "metadata": {}, "source": [ "## Stats" ] }, { "cell_type": "markdown", "id": "f29a5de0", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." ] }, { "cell_type": "code", "execution_count": 34, "id": "b1a1666e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.5, 1.666666666666666)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.mean(), s.var()" ] }, { "cell_type": "markdown", "id": "f4879742", "metadata": {}, "source": [ "This serves as a prototypical example of when we might want to call `.compute()`. The result of computing the mean and variance is a single number in each case, so it is definitely reasonable to look at the entire result!" ] }, { "cell_type": "code", "execution_count": 35, "id": "0cb7a207", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2.5, 1.6666666666666667)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.mean().compute(), ds.var().compute()" ] }, { "cell_type": "markdown", "id": "af792a18", "metadata": {}, "source": [ "## Applymap" ] }, { "cell_type": "markdown", "id": "f6094cbe", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." ] }, { "cell_type": "code", "execution_count": 36, "id": "5b154619", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "5 15\n", "6 16\n", "7 17\n", "8 18\n", "9 19\n", "10 20\n", "11 21\n", "12 22\n", "13 23\n", "14 24\n", "15 25\n", "16 26\n", "17 27\n", "18 28\n", "19 29\n", "Name: a, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add_ten(num):\n", " return num + 10\n", "\n", "\n", "df[\"a\"].apply(add_ten)" ] }, { "cell_type": "code", "execution_count": 37, "id": "8da5c3cb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 11\n", "2 12\n", "3 13\n", "4 14\n", "Name: a, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"a\"].map_partitions(add_ten).head()" ] }, { "cell_type": "markdown", "id": "4d4fdde1", "metadata": {}, "source": [ "## Histogramming" ] }, { "cell_type": "markdown", "id": "b98a7daf", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." ] }, { "cell_type": "code", "execution_count": 38, "id": "c7b8ea5d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 1\n", "6 1\n", "1 1\n", "14 1\n", "2 1\n", "5 1\n", "11 1\n", "7 1\n", "17 1\n", "13 1\n", "8 1\n", "16 1\n", "0 1\n", "10 1\n", "4 1\n", "9 1\n", "19 1\n", "18 1\n", "3 1\n", "12 1\n", "Name: a, dtype: int32" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.a.value_counts()" ] }, { "cell_type": "code", "execution_count": 39, "id": "cc9d34f6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15 1\n", "6 1\n", "1 1\n", "14 1\n", "2 1\n", "Name: a, dtype: int64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.a.value_counts().head()" ] }, { "cell_type": "markdown", "id": "437172ba", "metadata": {}, "source": [ "## String Methods" ] }, { "cell_type": "markdown", "id": "fd3fc4f3", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the [cuDF API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/series.html#string-handling) for more information." ] }, { "cell_type": "code", "execution_count": 40, "id": "86974041", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "4 baca\n", "5 \n", "6 caba\n", "7 dog\n", "8 cat\n", "dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([\"A\", \"B\", \"C\", \"Aaba\", \"Baca\", None, \"CABA\", \"dog\", \"cat\"])\n", "s.str.lower()" ] }, { "cell_type": "code", "execution_count": 41, "id": "c6a61a08", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 aaba\n", "dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", "ds.str.lower().head(n=4)" ] }, { "cell_type": "markdown", "id": "44fe1243", "metadata": {}, "source": [ "As well as simple manipulation, We can also match strings using [regular expressions](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.column.string.StringMethods.match.html)." ] }, { "cell_type": "code", "execution_count": 42, "id": "51158a24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", "5 \n", "6 False\n", "7 False\n", "8 True\n", "dtype: bool" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.str.match(\"^[aAc].+\")" ] }, { "cell_type": "code", "execution_count": 43, "id": "4f3e36f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 True\n", "4 False\n", "dtype: bool" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.str.match(\"^[aAc].+\").head()" ] }, { "cell_type": "markdown", "id": "5528afa8", "metadata": {}, "source": [ "## Concat" ] }, { "cell_type": "markdown", "id": "e05b1078", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." ] }, { "cell_type": "code", "execution_count": 44, "id": "6c6d10bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "0 1\n", "1 2\n", "2 3\n", "3 \n", "4 5\n", "dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = cudf.Series([1, 2, 3, None, 5])\n", "cudf.concat([s, s])" ] }, { "cell_type": "code", "execution_count": 45, "id": "d3e5cf87", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n", "dask_cudf.concat([ds2, ds2]).head(n=3)" ] }, { "cell_type": "markdown", "id": "df087d2f", "metadata": {}, "source": [ "## Join" ] }, { "cell_type": "markdown", "id": "89cf5022", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is **not maintained**, but may be restored post-merge by sorting by the index." ] }, { "cell_type": "code", "execution_count": 46, "id": "075c97a7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvals_avals_b
0a10.0100.0
1c12.0101.0
2e14.0102.0
3b11.0<NA>
4d13.0<NA>
\n", "
" ], "text/plain": [ " key vals_a vals_b\n", "0 a 10.0 100.0\n", "1 c 12.0 101.0\n", "2 e 14.0 102.0\n", "3 b 11.0 \n", "4 d 13.0 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_a = cudf.DataFrame()\n", "df_a[\"key\"] = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n", "df_a[\"vals_a\"] = [float(i + 10) for i in range(5)]\n", "\n", "df_b = cudf.DataFrame()\n", "df_b[\"key\"] = [\"a\", \"c\", \"e\"]\n", "df_b[\"vals_b\"] = [float(i + 100) for i in range(3)]\n", "\n", "merged = df_a.merge(df_b, on=[\"key\"], how=\"left\")\n", "merged" ] }, { "cell_type": "code", "execution_count": 47, "id": "b28fc57b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvals_avals_b
0c12.0101.0
1e14.0102.0
2b11.0<NA>
3d13.0<NA>
\n", "
" ], "text/plain": [ " key vals_a vals_b\n", "0 c 12.0 101.0\n", "1 e 14.0 102.0\n", "2 b 11.0 \n", "3 d 13.0 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n", "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n", "\n", "merged = ddf_a.merge(ddf_b, on=[\"key\"], how=\"left\").head(n=4)\n", "merged" ] }, { "cell_type": "markdown", "id": "e4695c6e", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "id": "ecb27b06", "metadata": {}, "source": [ "Like [pandas](https://pandas.pydata.org/docs/user_guide/groupby.html), cuDF and Dask-cuDF support the [Split-Apply-Combine groupby paradigm](https://doi.org/10.18637/jss.v040.i01)." ] }, { "cell_type": "code", "execution_count": 48, "id": "d8db18d9", "metadata": {}, "outputs": [], "source": [ "df[\"agg_col1\"] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n", "df[\"agg_col2\"] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n", "\n", "ddf = dask_cudf.from_cudf(df, npartitions=2)" ] }, { "cell_type": "markdown", "id": "8e2f0961", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." ] }, { "cell_type": "code", "execution_count": 49, "id": "e8a7f1f9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col2
agg_col1
190100904
0100901003
\n", "
" ], "text/plain": [ " a b c agg_col2\n", "agg_col1 \n", "1 90 100 90 4\n", "0 100 90 100 3" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"agg_col1\").sum()" ] }, { "cell_type": "code", "execution_count": 50, "id": "4dd090a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col2
agg_col1
190100904
0100901003
\n", "
" ], "text/plain": [ " a b c agg_col2\n", "agg_col1 \n", "1 90 100 90 4\n", "0 100 90 100 3" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby(\"agg_col1\").sum().compute()" ] }, { "cell_type": "markdown", "id": "5ff1e8bf", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." ] }, { "cell_type": "code", "execution_count": 51, "id": "4738f0ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1agg_col2
10546054
00736073
11364036
01273027
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "1 0 54 60 54\n", "0 0 73 60 73\n", "1 1 36 40 36\n", "0 1 27 30 27" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby([\"agg_col1\", \"agg_col2\"]).sum()" ] }, { "cell_type": "code", "execution_count": 52, "id": "9b07feb1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1agg_col2
11364036
00736073
10546054
01273027
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 agg_col2 \n", "1 1 36 40 36\n", "0 0 73 60 73\n", "1 0 54 60 54\n", "0 1 27 30 27" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby([\"agg_col1\", \"agg_col2\"]).sum().compute()" ] }, { "cell_type": "markdown", "id": "443e179b", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." ] }, { "cell_type": "code", "execution_count": 53, "id": "f196ad8b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1
11810.090
0199.0100
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 \n", "1 18 10.0 90\n", "0 19 9.0 100" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"agg_col1\").agg({\"a\": \"max\", \"b\": \"mean\", \"c\": \"sum\"})" ] }, { "cell_type": "code", "execution_count": 54, "id": "3853483f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
agg_col1
11810.090
0199.0100
\n", "
" ], "text/plain": [ " a b c\n", "agg_col1 \n", "1 18 10.0 90\n", "0 19 9.0 100" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.groupby(\"agg_col1\").agg({\"a\": \"max\", \"b\": \"mean\", \"c\": \"sum\"}).compute()" ] }, { "cell_type": "markdown", "id": "a5bf30e1", "metadata": {}, "source": [ "## Transpose" ] }, { "cell_type": "markdown", "id": "5ac3b004", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 55, "id": "c5fbdb50", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
014
125
236
\n", "
" ], "text/plain": [ " a b\n", "0 1 4\n", "1 2 5\n", "2 3 6" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample = cudf.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6]})\n", "sample" ] }, { "cell_type": "code", "execution_count": 56, "id": "733ed90c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
a123
b456
\n", "
" ], "text/plain": [ " 0 1 2\n", "a 1 2 3\n", "b 4 5 6" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample.transpose()" ] }, { "cell_type": "markdown", "id": "e0915c46", "metadata": {}, "source": [ "## Time Series" ] }, { "cell_type": "markdown", "id": "ec7f0b81", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." ] }, { "cell_type": "code", "execution_count": 57, "id": "a6d45607", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datevalue
02018-11-200.986051
12018-11-210.232034
22018-11-220.397617
32018-11-230.103839
\n", "
" ], "text/plain": [ " date value\n", "0 2018-11-20 0.986051\n", "1 2018-11-21 0.232034\n", "2 2018-11-22 0.397617\n", "3 2018-11-23 0.103839" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import datetime as dt\n", "\n", "date_df = cudf.DataFrame()\n", "date_df[\"date\"] = pd.date_range(\"11/20/2018\", periods=72, freq=\"D\")\n", "date_df[\"value\"] = cp.random.sample(len(date_df))\n", "\n", "search_date = dt.datetime.strptime(\"2018-11-23\", \"%Y-%m-%d\")\n", "date_df.query(\"date <= @search_date\")" ] }, { "cell_type": "code", "execution_count": 58, "id": "fbacaae1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datevalue
02018-11-200.986051
12018-11-210.232034
22018-11-220.397617
32018-11-230.103839
\n", "
" ], "text/plain": [ " date value\n", "0 2018-11-20 0.986051\n", "1 2018-11-21 0.232034\n", "2 2018-11-22 0.397617\n", "3 2018-11-23 0.103839" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n", "date_ddf.query(\n", " \"date <= @search_date\", local_dict={\"search_date\": search_date}\n", ").compute()" ] }, { "cell_type": "markdown", "id": "45f9408b", "metadata": {}, "source": [ "## Categoricals" ] }, { "cell_type": "markdown", "id": "5eb96f98", "metadata": {}, "source": [ "`DataFrames` support categorical columns." ] }, { "cell_type": "code", "execution_count": 59, "id": "d735b5cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idgrade
01a
12b
23b
34a
45a
56e
\n", "
" ], "text/plain": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b\n", "3 4 a\n", "4 5 a\n", "5 6 e" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf = cudf.DataFrame(\n", " {\"id\": [1, 2, 3, 4, 5, 6], \"grade\": [\"a\", \"b\", \"b\", \"a\", \"a\", \"e\"]}\n", ")\n", "gdf[\"grade\"] = gdf[\"grade\"].astype(\"category\")\n", "gdf" ] }, { "cell_type": "code", "execution_count": 60, "id": "9d1ff798", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idgrade
01a
12b
23b
\n", "
" ], "text/plain": [ " id grade\n", "0 1 a\n", "1 2 b\n", "2 3 b" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", "dgdf.head(n=3)" ] }, { "cell_type": "markdown", "id": "a9c2bcac", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." ] }, { "cell_type": "code", "execution_count": 61, "id": "a7135eda", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StringIndex(['a' 'b' 'e'], dtype='object')" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.categories" ] }, { "cell_type": "markdown", "id": "466e1ed2", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." ] }, { "cell_type": "code", "execution_count": 62, "id": "f00c615a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n", "dtype: uint8" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gdf.grade.cat.codes" ] }, { "cell_type": "code", "execution_count": 63, "id": "d209512f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 0\n", "4 0\n", "5 2\n", "dtype: uint8" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf.grade.cat.codes.compute()" ] }, { "cell_type": "markdown", "id": "1b391a0d", "metadata": {}, "source": [ "## Converting to Pandas" ] }, { "cell_type": "markdown", "id": "cfdd172b", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." ] }, { "cell_type": "code", "execution_count": 64, "id": "1fcd9c7f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head().to_pandas()" ] }, { "cell_type": "markdown", "id": "aa8a445b", "metadata": {}, "source": [ "To convert the first few entries to pandas, we similarly call `.head()` on the dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert." ] }, { "cell_type": "code", "execution_count": 65, "id": "786d39d2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head().to_pandas()" ] }, { "cell_type": "markdown", "id": "584c4594", "metadata": {}, "source": [ "In contrast, if we want to convert the entire frame, we need to call `.compute()` on `ddf` to get a local cuDF dataframe, and then call `to_pandas()`, followed by subsequent processing. This workflow is less recommended, since it both puts high memory pressure on a single GPU (the `.compute()` call) and does not take advantage of GPU acceleration for processing (the computation happens on in pandas)." ] }, { "cell_type": "code", "execution_count": 66, "id": "93f06cdc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().to_pandas().head()" ] }, { "cell_type": "markdown", "id": "a104294a", "metadata": {}, "source": [ "## Converting to Numpy" ] }, { "cell_type": "markdown", "id": "c5d3e508", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 67, "id": "2948b577", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 19, 0, 1, 1],\n", " [ 1, 18, 1, 0, 0],\n", " [ 2, 17, 2, 1, 0],\n", " [ 3, 16, 3, 0, 1],\n", " [ 4, 15, 4, 1, 0],\n", " [ 5, 14, 5, 0, 0],\n", " [ 6, 13, 6, 1, 1],\n", " [ 7, 12, 7, 0, 0],\n", " [ 8, 11, 8, 1, 0],\n", " [ 9, 10, 9, 0, 1],\n", " [10, 9, 10, 1, 0],\n", " [11, 8, 11, 0, 0],\n", " [12, 7, 12, 1, 1],\n", " [13, 6, 13, 0, 0],\n", " [14, 5, 14, 1, 0],\n", " [15, 4, 15, 0, 1],\n", " [16, 3, 16, 1, 0],\n", " [17, 2, 17, 0, 0],\n", " [18, 1, 18, 1, 1],\n", " [19, 0, 19, 0, 0]])" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_numpy()" ] }, { "cell_type": "code", "execution_count": 68, "id": "1cff6352", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 19, 0, 1, 1],\n", " [ 1, 18, 1, 0, 0],\n", " [ 2, 17, 2, 1, 0],\n", " [ 3, 16, 3, 0, 1],\n", " [ 4, 15, 4, 1, 0],\n", " [ 5, 14, 5, 0, 0],\n", " [ 6, 13, 6, 1, 1],\n", " [ 7, 12, 7, 0, 0],\n", " [ 8, 11, 8, 1, 0],\n", " [ 9, 10, 9, 0, 1],\n", " [10, 9, 10, 1, 0],\n", " [11, 8, 11, 0, 0],\n", " [12, 7, 12, 1, 1],\n", " [13, 6, 13, 0, 0],\n", " [14, 5, 14, 1, 0],\n", " [15, 4, 15, 0, 1],\n", " [16, 3, 16, 1, 0],\n", " [17, 2, 17, 0, 0],\n", " [18, 1, 18, 1, 1],\n", " [19, 0, 19, 0, 0]])" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute().to_numpy()" ] }, { "cell_type": "markdown", "id": "c1f09303", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." ] }, { "cell_type": "code", "execution_count": 69, "id": "997c89ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"a\"].to_numpy()" ] }, { "cell_type": "code", "execution_count": 70, "id": "243df512", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"a\"].compute().to_numpy()" ] }, { "cell_type": "markdown", "id": "b520acf7", "metadata": {}, "source": [ "## Converting to Arrow" ] }, { "cell_type": "markdown", "id": "050e67e5", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." ] }, { "cell_type": "code", "execution_count": 71, "id": "0ac9e740", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "----\n", "a: [[0,1,2,3,4,...,15,16,17,18,19]]\n", "b: [[19,18,17,16,15,...,4,3,2,1,0]]\n", "c: [[0,1,2,3,4,...,15,16,17,18,19]]\n", "agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]\n", "agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_arrow()" ] }, { "cell_type": "code", "execution_count": 72, "id": "f3170fc3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pyarrow.Table\n", "a: int64\n", "b: int64\n", "c: int64\n", "agg_col1: int64\n", "agg_col2: int64\n", "----\n", "a: [[0,1,2,3,4]]\n", "b: [[19,18,17,16,15]]\n", "c: [[0,1,2,3,4]]\n", "agg_col1: [[1,0,1,0,1]]\n", "agg_col2: [[1,0,0,1,0]]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.head().to_arrow()" ] }, { "cell_type": "markdown", "id": "6f0251c6", "metadata": {}, "source": [ "## Reading/Writing CSV Files" ] }, { "cell_type": "markdown", "id": "2d1935c6", "metadata": {}, "source": [ "Writing to a CSV file." ] }, { "cell_type": "code", "execution_count": 73, "id": "36f5039f", "metadata": {}, "outputs": [], "source": [ "if not os.path.exists(\"example_output\"):\n", " os.mkdir(\"example_output\")\n", "\n", "df.to_csv(\"example_output/foo.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 74, "id": "22c1eb6a", "metadata": {}, "outputs": [], "source": [ "ddf.compute().to_csv(\"example_output/foo_dask.csv\", index=False)" ] }, { "cell_type": "markdown", "id": "320c3968", "metadata": {}, "source": [ "Reading from a csv file." ] }, { "cell_type": "code", "execution_count": 75, "id": "c110a80f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.read_csv(\"example_output/foo.csv\")\n", "df" ] }, { "cell_type": "markdown", "id": "787eae14", "metadata": {}, "source": [ "Note that for the dask-cuDF case, we use `dask_cudf.read_csv` in preference to `dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU." ] }, { "cell_type": "code", "execution_count": 76, "id": "a699dfef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv(\"example_output/foo_dask.csv\")\n", "ddf.head()" ] }, { "cell_type": "markdown", "id": "72857b2c", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." ] }, { "cell_type": "code", "execution_count": 77, "id": "825a0c03", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv(\"example_output/*.csv\")\n", "ddf.head()" ] }, { "cell_type": "markdown", "id": "763c555b", "metadata": {}, "source": [ "## Reading/Writing Parquet Files" ] }, { "cell_type": "markdown", "id": "8766d4ac", "metadata": {}, "source": [ "Writing to parquet files with cuDF's GPU-accelerated parquet writer" ] }, { "cell_type": "code", "execution_count": 78, "id": "5038b284", "metadata": {}, "outputs": [], "source": [ "df.to_parquet(\"example_output/temp_parquet\")" ] }, { "cell_type": "markdown", "id": "b4b49824", "metadata": {}, "source": [ "Reading parquet files with cuDF's GPU-accelerated parquet reader." ] }, { "cell_type": "code", "execution_count": 79, "id": "bb657a69", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = cudf.read_parquet(\"example_output/temp_parquet\")\n", "df" ] }, { "cell_type": "markdown", "id": "e9a29874", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using cuDF's parquet writer under the hood." ] }, { "cell_type": "code", "execution_count": 80, "id": "0c3db7b0", "metadata": {}, "outputs": [], "source": [ "ddf.to_parquet(\"example_output/ddf_parquet_files\")" ] }, { "cell_type": "markdown", "id": "90a49967", "metadata": {}, "source": [ "## Reading/Writing ORC Files" ] }, { "cell_type": "markdown", "id": "de9d03fa", "metadata": {}, "source": [ "Writing ORC files." ] }, { "cell_type": "code", "execution_count": 81, "id": "c387f8f2", "metadata": {}, "outputs": [], "source": [ "df.to_orc(\"example_output/temp_orc\")" ] }, { "cell_type": "markdown", "id": "242c32a2", "metadata": {}, "source": [ "And reading" ] }, { "cell_type": "code", "execution_count": 82, "id": "d4bab6da", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
5514500
6613611
7712700
8811810
9910901
101091010
111181100
121271211
131361300
141451410
151541501
161631610
171721700
181811811
191901900
\n", "
" ], "text/plain": [ " a b c agg_col1 agg_col2\n", "0 0 19 0 1 1\n", "1 1 18 1 0 0\n", "2 2 17 2 1 0\n", "3 3 16 3 0 1\n", "4 4 15 4 1 0\n", "5 5 14 5 0 0\n", "6 6 13 6 1 1\n", "7 7 12 7 0 0\n", "8 8 11 8 1 0\n", "9 9 10 9 0 1\n", "10 10 9 10 1 0\n", "11 11 8 11 0 0\n", "12 12 7 12 1 1\n", "13 13 6 13 0 0\n", "14 14 5 14 1 0\n", "15 15 4 15 0 1\n", "16 16 3 16 1 0\n", "17 17 2 17 0 0\n", "18 18 1 18 1 1\n", "19 19 0 19 0 0" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = cudf.read_orc(\"example_output/temp_orc\")\n", "df2" ] }, { "cell_type": "markdown", "id": "c988553d", "metadata": {}, "source": [ "## Dask Performance Tips\n", "\n", "Like Apache Spark, Dask operations are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation). Instead of being executed immediately, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.\n", "\n", "Sometimes, though, we want to force the execution of operations. Calling `persist` on a Dask collection fully computes it (or actively computes it in the background), persisting the result into memory. When we're using distributed systems, we may want to wait until `persist` is finished before beginning any downstream operations. We can enforce this contract by using `wait`. Wrapping an operation with `wait` will ensure it doesn't begin executing until all necessary upstream operations have finished.\n", "\n", "The snippets below provide basic examples, using `LocalCUDACluster` to create one dask-worker per GPU on the local machine. For more detailed information about `persist` and `wait`, please see the Dask documentation for [persist](https://docs.dask.org/en/latest/api.html#dask.persist) and [wait](https://docs.dask.org/en/latest/futures.html#distributed.wait). Wait relies on the concept of Futures, which is beyond the scope of this tutorial. For more information on Futures, see the Dask [Futures](https://docs.dask.org/en/latest/futures.html) documentation. For more information about multi-GPU clusters, please see the [dask-cuda](https://github.com/rapidsai/dask-cuda) library (documentation is in progress)." ] }, { "cell_type": "markdown", "id": "976a8dca", "metadata": {}, "source": [ "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster." ] }, { "cell_type": "code", "execution_count": 83, "id": "39c82511", "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "from dask.distributed import Client, wait\n", "from dask_cuda import LocalCUDACluster\n", "\n", "cluster = LocalCUDACluster()\n", "client = Client(cluster)" ] }, { "cell_type": "markdown", "id": "819c2d92", "metadata": {}, "source": [ "### Persisting Data\n", "\n", "Next, we create our Dask-cuDF DataFrame and apply a transformation, storing the result as a new column." ] }, { "cell_type": "code", "execution_count": 84, "id": "f5c0ca87", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
npartitions=16
0int64int64int64
625000.........
............
9375000.........
9999999.........
\n", "
\n", "
Dask Name: assign, 4 graph layers
" ], "text/plain": [ "" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nrows = 10000000\n", "\n", "df2 = cudf.DataFrame({\"a\": cp.arange(nrows), \"b\": cp.arange(nrows)})\n", "ddf2 = dask_cudf.from_cudf(df2, npartitions=16)\n", "ddf2[\"c\"] = ddf2[\"a\"] + 5\n", "ddf2" ] }, { "cell_type": "code", "execution_count": 85, "id": "eec23c4d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mon Nov 14 03:05:08 2022 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 4538MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |\n", "| N/A 32C P0 56W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |\n", "| N/A 33C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |\n", "| N/A 31C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |\n", "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |\n", "| N/A 33C P0 56W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |\n", "| N/A 35C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |\n", "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", "| 0 N/A N/A 57132 C .../python 333MiB |\n", "| 1 N/A N/A 57131 C .../python 333MiB |\n", "| 2 N/A N/A 57143 C .../python 333MiB |\n", "| 3 N/A N/A 57124 C .../python 333MiB |\n", "| 4 N/A N/A 57135 C .../python 333MiB |\n", "| 5 N/A N/A 57144 C .../python 333MiB |\n", "| 6 N/A N/A 57126 C .../python 333MiB |\n", "| 7 N/A N/A 57139 C .../python 333MiB |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "id": "578a1698", "metadata": {}, "source": [ "Because Dask is lazy, the computation has not yet occurred. We can see that there are sixty-four tasks in the task graph and we're using about 330 MB of device memory on each GPU. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)." ] }, { "cell_type": "code", "execution_count": 86, "id": "3de4c0cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
npartitions=16
0int64int64int64
625000.........
............
9375000.........
9999999.........
\n", "
\n", "
Dask Name: assign, 1 graph layer
" ], "text/plain": [ "" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf2 = ddf2.persist()\n", "ddf2" ] }, { "cell_type": "code", "execution_count": 87, "id": "64c9f96c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mon Nov 14 03:05:15 2022 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", "| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 4900MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |\n", "| N/A 32C P0 56W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |\n", "| N/A 33C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |\n", "| N/A 32C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |\n", "| N/A 33C P0 56W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |\n", "| N/A 35C P0 55W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", "| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |\n", "| N/A 32C P0 54W / 300W | 698MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: |\n", "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", "| 0 N/A N/A 57132 C .../python 695MiB |\n", "| 1 N/A N/A 57131 C .../python 695MiB |\n", "| 2 N/A N/A 57143 C .../python 695MiB |\n", "| 3 N/A N/A 57124 C .../python 695MiB |\n", "| 4 N/A N/A 57135 C .../python 695MiB |\n", "| 5 N/A N/A 57144 C .../python 695MiB |\n", "| 6 N/A N/A 57126 C .../python 695MiB |\n", "| 7 N/A N/A 57139 C .../python 695MiB |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "# Sleep to ensure the persist finishes and shows in the memory usage\n", "!sleep 5; nvidia-smi" ] }, { "cell_type": "markdown", "id": "154d699f", "metadata": {}, "source": [ "Because we forced computation, we now have a larger object in distributed GPU memory. Note that actual numbers will differ between systems (for example depending on how many devices are available)." ] }, { "cell_type": "markdown", "id": "f45064d7", "metadata": {}, "source": [ "### Wait\n", "Depending on our workflow or distributed computing setup, we may want to `wait` until all upstream tasks have finished before proceeding with a specific function. This section shows an example of this behavior, adapted from the Dask documentation.\n", "\n", "First, we create a new Dask DataFrame and define a function that we'll map to every partition in the dataframe." ] }, { "cell_type": "code", "execution_count": 88, "id": "a021a726", "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "nrows = 10000000\n", "\n", "df1 = cudf.DataFrame({\"a\": cp.arange(nrows), \"b\": cp.arange(nrows)})\n", "ddf1 = dask_cudf.from_cudf(df1, npartitions=100)\n", "\n", "\n", "def func(df):\n", " time.sleep(random.randint(1, 10))\n", " return (df + 5) * 3 - 11" ] }, { "cell_type": "markdown", "id": "93a3ee73", "metadata": {}, "source": [ "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-10 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." ] }, { "cell_type": "code", "execution_count": 89, "id": "8f091ada", "metadata": {}, "outputs": [], "source": [ "results_ddf = ddf2.map_partitions(func)\n", "results_ddf = results_ddf.persist()" ] }, { "cell_type": "markdown", "id": "3c22a1e8", "metadata": {}, "source": [ "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`." ] }, { "cell_type": "code", "execution_count": 90, "id": "fea52d0f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DoneAndNotDoneFutures(done={, , , , , , , , , , , , , , , }, not_done=set())" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wait(results_ddf)" ] }, { "cell_type": "markdown", "id": "db619bec", "metadata": {}, "source": [ "With `wait` completed, we can safely proceed on in our workflow." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "vscode": { "interpreter": { "hash": "8056d08c5310318d9ca4fe60778daf853f02695d9fa19fd0f51ce5f8b089487a" } } }, "nbformat": 4, "nbformat_minor": 5 }