{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4c6c548b",
   "metadata": {},
   "source": [
    "# 10 Minutes to cuDF and Dask cuDF\n",
    "\n",
    "Modelled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask cuDF, geared mainly towards new users.\n",
    "\n",
    "## What are these Libraries?\n",
    "\n",
    "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of [pandas](https://pandas.pydata.org).\n",
    "\n",
    "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.\n",
    "\n",
    "[Dask cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed using cuDF GPU DataFrames instead of Pandas DataFrames. For instance, when you call `dask_cudf.read_csv(...)`, your cluster's GPUs do the work of parsing the CSV file(s) by calling [`cudf.read_csv()`](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.read_csv.html).\n",
    "\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Note:</b> This notebook uses the explicit Dask cuDF API (dask_cudf) for clarity. However, we strongly recommend that you use Dask's <a href=\"https://docs.dask.org/en/latest/configuration.html\">configuration infrastructure</a> to set the \"dataframe.backend\" option to \"cudf\", and work with the Dask DataFrame API directly. Please see the <a href=\"https://github.com/rapidsai/cudf/tree/main/python/dask_cudf\">Dask cuDF documentation</a> for more information.\n",
    "</div>\n",
    "\n",
    "\n",
    "## When to use cuDF and Dask cuDF\n",
    "\n",
    "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "36307e42",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import cupy as cp\n",
    "import pandas as pd\n",
    "\n",
    "import cudf\n",
    "import dask_cudf\n",
    "\n",
    "cp.random.seed(12)\n",
    "\n",
    "#### Portions of this were borrowed and adapted from the\n",
    "#### cuDF cheatsheet, existing cuDF documentation,\n",
    "#### and 10 Minutes to Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eff5fc19",
   "metadata": {},
   "source": [
    "## Object Creation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a747886",
   "metadata": {},
   "source": [
    "Creating a `cudf.Series` and `dask_cudf.Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f5e303df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       1\n",
       "1       2\n",
       "2       3\n",
       "3    <NA>\n",
       "4       4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = cudf.Series([1, 2, 3, None, 4])\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9a893956",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds = dask_cudf.from_cudf(s, npartitions=2)\n",
    "# Note the call to head here to show the first few entries, unlike\n",
    "# cuDF objects, Dask-cuDF objects do not have a printing\n",
    "# representation that shows values since they may not be in local\n",
    "# memory.\n",
    "ds.head(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d934af4e",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3f53fb8b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a   b   c\n",
       "0    0  19   0\n",
       "1    1  18   1\n",
       "2    2  17   2\n",
       "3    3  16   3\n",
       "4    4  15   4\n",
       "5    5  14   5\n",
       "6    6  13   6\n",
       "7    7  12   7\n",
       "8    8  11   8\n",
       "9    9  10   9\n",
       "10  10   9  10\n",
       "11  11   8  11\n",
       "12  12   7  12\n",
       "13  13   6  13\n",
       "14  14   5  14\n",
       "15  15   4  15\n",
       "16  16   3  16\n",
       "17  17   2  17\n",
       "18  18   1  18\n",
       "19  19   0  19"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = cudf.DataFrame(\n",
    "    {\n",
    "        \"a\": list(range(20)),\n",
    "        \"b\": list(reversed(range(20))),\n",
    "        \"c\": list(range(20)),\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b17db919",
   "metadata": {},
   "source": [
    "Now we will convert our cuDF dataframe into a Dask-cuDF equivalent. Here we call out a key difference: to inspect the data we must call a method (here `.head()` to look at the first few values). In the general case (see the end of this notebook), the data in `ddf` will be distributed across multiple GPUs.\n",
    "\n",
    "In this small case, we could call `ddf.compute()` to obtain a cuDF object from the Dask-cuDF object. In general, we should avoid calling `.compute()` on large dataframes, and restrict ourselves to using it when we have some (relatively) small postprocessed result that we wish to inspect. Hence, throughout this notebook we will generally call `.head()` to inspect the first few values of a Dask-cuDF dataframe, occasionally calling out places where we use `.compute()` and why.\n",
    "\n",
    "*To understand more of the differences between how cuDF and Dask cuDF behave here, visit the [10 Minutes to Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) tutorial after this one.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8904b5ad",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "0  0  19  0\n",
       "1  1  18  1\n",
       "2  2  17  2\n",
       "3  3  16  3\n",
       "4  4  15  4"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf = dask_cudf.from_cudf(df, npartitions=2)\n",
    "ddf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0573b7bb",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n",
    "\n",
    "*Note that best practice for using dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with `read_csv` or other builtin I/O routines (discussed below).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "06a42f3a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a     b\n",
       "0  0   0.1\n",
       "1  1   0.2\n",
       "2  2  <NA>\n",
       "3  3   0.3"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pdf = pd.DataFrame({\"a\": [0, 1, 2, 3], \"b\": [0.1, 0.2, None, 0.3]})\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "gdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c67de344",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a    b\n",
       "0  0  0.1\n",
       "1  1  0.2"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dask_gdf = dask_cudf.from_cudf(gdf, npartitions=2)\n",
    "dask_gdf.head(n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5820795f",
   "metadata": {},
   "source": [
    "## Viewing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3008757",
   "metadata": {},
   "source": [
    "Viewing the top rows of a GPU dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0c329914",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "0  0  19  0\n",
       "1  1  18  1"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b989e208",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "0  0  19  0\n",
       "1  1  18  1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16798a32",
   "metadata": {},
   "source": [
    "Sorting by values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2190856d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a   b   c\n",
       "19  19   0  19\n",
       "18  18   1  18\n",
       "17  17   2  17\n",
       "16  16   3  16\n",
       "15  15   4  15\n",
       "14  14   5  14\n",
       "13  13   6  13\n",
       "12  12   7  12\n",
       "11  11   8  11\n",
       "10  10   9  10\n",
       "9    9  10   9\n",
       "8    8  11   8\n",
       "7    7  12   7\n",
       "6    6  13   6\n",
       "5    5  14   5\n",
       "4    4  15   4\n",
       "3    3  16   3\n",
       "2    2  17   2\n",
       "1    1  18   1\n",
       "0    0  19   0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(by=\"b\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6594bd6f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b   c\n",
       "19  19  0  19\n",
       "18  18  1  18\n",
       "17  17  2  17\n",
       "16  16  3  16\n",
       "15  15  4  15"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.sort_values(by=\"b\").head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3302a647",
   "metadata": {},
   "source": [
    "## Selecting a Column"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aebc72ab",
   "metadata": {},
   "source": [
    "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "4dafb4ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      0\n",
       "1      1\n",
       "2      2\n",
       "3      3\n",
       "4      4\n",
       "5      5\n",
       "6      6\n",
       "7      7\n",
       "8      8\n",
       "9      9\n",
       "10    10\n",
       "11    11\n",
       "12    12\n",
       "13    13\n",
       "14    14\n",
       "15    15\n",
       "16    16\n",
       "17    17\n",
       "18    18\n",
       "19    19\n",
       "Name: a, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"a\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "b38f05fc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    1\n",
       "2    2\n",
       "3    3\n",
       "4    4\n",
       "Name: a, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf[\"a\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5160dd1",
   "metadata": {},
   "source": [
    "## Selecting Rows by Label"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51ff2093",
   "metadata": {},
   "source": [
    "Selecting rows from index 2 to index 5 from columns 'a' and 'b'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e8870657",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b\n",
       "2  2  17\n",
       "3  3  16\n",
       "4  4  15\n",
       "5  5  14"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[2:5, [\"a\", \"b\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "f041e661",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b\n",
       "2  2  17\n",
       "3  3  16\n",
       "4  4  15\n",
       "5  5  14"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.loc[2:5, [\"a\", \"b\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8e07162",
   "metadata": {},
   "source": [
    "## Selecting Rows by Position"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "435eb76a",
   "metadata": {},
   "source": [
    "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "bc337d5d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a     0\n",
       "b    19\n",
       "c     0\n",
       "Name: 0, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "03671456",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b\n",
       "0  0  19\n",
       "1  1  18\n",
       "2  2  17"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[0:3, 0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa935d5b",
   "metadata": {},
   "source": [
    "You can also select elements of a `DataFrame` or `Series` with direct index access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "79883c37",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "3  3  16  3\n",
       "4  4  15  4"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[3:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "2f761695",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3    <NA>\n",
       "4       4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[3:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bf9c0b0",
   "metadata": {},
   "source": [
    "## Boolean Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b08b2b9",
   "metadata": {},
   "source": [
    "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "1eb08f0d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "0  0  19  0\n",
       "1  1  18  1\n",
       "2  2  17  2\n",
       "3  3  16  3"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.b > 15]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "324dd036",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "0  0  19  0\n",
       "1  1  18  1\n",
       "2  2  17  2"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf[ddf.b > 15].head(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f95c9ab7",
   "metadata": {},
   "source": [
    "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "fa643410",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b   c\n",
       "16  16  3  16"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query(\"b == 3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7aa0089f",
   "metadata": {},
   "source": [
    "Note here we call `compute()` rather than `head()` on the Dask-cuDF dataframe since we are happy that the number of matching rows will be small (and hence it is reasonable to bring the entire result back)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "e2706a02",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b   c\n",
       "16  16  3  16"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.query(\"b == 3\").compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "694dcbad",
   "metadata": {},
   "source": [
    "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "353b0250",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b   c\n",
       "16  16  3  16"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cudf_comparator = 3\n",
    "df.query(\"b == @cudf_comparator\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "a35c8a5a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a  b   c\n",
       "16  16  3  16"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dask_cudf_comparator = 3\n",
    "ddf.query(\"b == @val\", local_dict={\"val\": dask_cudf_comparator}).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e004749",
   "metadata": {},
   "source": [
    "Using the `isin` method for filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "20936418",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c\n",
       "0  0  19  0\n",
       "5  5  14  5"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.a.isin([0, 5])]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e456f03",
   "metadata": {},
   "source": [
    "## MultiIndex"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e494bd0b",
   "metadata": {},
   "source": [
    "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "4ae70724",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiIndex([('a', 1),\n",
       "            ('a', 2),\n",
       "            ('b', 3),\n",
       "            ('b', 4)],\n",
       "           )"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = [[\"a\", \"a\", \"b\", \"b\"], [1, 2, 3, 4]]\n",
    "tuples = list(zip(*arrays))\n",
    "idx = cudf.MultiIndex.from_tuples(tuples)\n",
    "idx"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab232727",
   "metadata": {},
   "source": [
    "This index can back either axis of a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "cb1d1633",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>first</th>\n",
       "      <th>second</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">a</th>\n",
       "      <th>1</th>\n",
       "      <td>0.082654</td>\n",
       "      <td>0.967955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.399417</td>\n",
       "      <td>0.441425</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">b</th>\n",
       "      <th>3</th>\n",
       "      <td>0.784297</td>\n",
       "      <td>0.793582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.070303</td>\n",
       "      <td>0.271711</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        first    second\n",
       "a 1  0.082654  0.967955\n",
       "  2  0.399417  0.441425\n",
       "b 3  0.784297  0.793582\n",
       "  4  0.070303  0.271711"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf1 = cudf.DataFrame(\n",
    "    {\"first\": cp.random.rand(4), \"second\": cp.random.rand(4)}\n",
    ")\n",
    "gdf1.index = idx\n",
    "gdf1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "73ba31af",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">a</th>\n",
       "      <th colspan=\"2\" halign=\"left\">b</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>first</th>\n",
       "      <td>0.343382</td>\n",
       "      <td>0.003700</td>\n",
       "      <td>0.20043</td>\n",
       "      <td>0.581614</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>second</th>\n",
       "      <td>0.907812</td>\n",
       "      <td>0.101512</td>\n",
       "      <td>0.24179</td>\n",
       "      <td>0.224180</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               a                  b          \n",
       "               1         2        3         4\n",
       "first   0.343382  0.003700  0.20043  0.581614\n",
       "second  0.907812  0.101512  0.24179  0.224180"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf2 = cudf.DataFrame(\n",
    "    {\"first\": cp.random.rand(4), \"second\": cp.random.rand(4)}\n",
    ").T\n",
    "gdf2.columns = idx\n",
    "gdf2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5f33160",
   "metadata": {},
   "source": [
    "Accessing values of a DataFrame with a MultiIndex, both with `.loc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "1048b7cf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "first     0.784297\n",
       "second    0.793582\n",
       "Name: ('b', 3), dtype: float64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf1.loc[(\"b\", 3)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5123f759",
   "metadata": {},
   "source": [
    "And `.iloc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "369d164d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>first</th>\n",
       "      <th>second</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">a</th>\n",
       "      <th>1</th>\n",
       "      <td>0.082654</td>\n",
       "      <td>0.967955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.399417</td>\n",
       "      <td>0.441425</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        first    second\n",
       "a 1  0.082654  0.967955\n",
       "  2  0.399417  0.441425"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf1.iloc[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b3b96e9",
   "metadata": {},
   "source": [
    "Missing Data\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d12141eb",
   "metadata": {},
   "source": [
    "Missing data can be replaced by using the `fillna` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "913a7b5f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      1\n",
       "1      2\n",
       "2      3\n",
       "3    999\n",
       "4      4\n",
       "dtype: int64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.fillna(999)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "14479a42",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.fillna(999).head(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d97605e6",
   "metadata": {},
   "source": [
    "## Stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f29a5de0",
   "metadata": {},
   "source": [
    "Calculating descriptive statistics for a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "b1a1666e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2.5, 1.666666666666666)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.mean(), s.var()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4879742",
   "metadata": {},
   "source": [
    "This serves as a prototypical example of when we might want to call `.compute()`. The result of computing the mean and variance is a single number in each case, so it is definitely reasonable to look at the entire result!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "0cb7a207",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2.5, 1.6666666666666667)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.mean().compute(), ds.var().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af792a18",
   "metadata": {},
   "source": [
    "## Applymap"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6094cbe",
   "metadata": {},
   "source": [
    "Applying functions to a `Series`. Note that applying user defined functions directly with Dask cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "5b154619",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     10\n",
       "1     11\n",
       "2     12\n",
       "3     13\n",
       "4     14\n",
       "5     15\n",
       "6     16\n",
       "7     17\n",
       "8     18\n",
       "9     19\n",
       "10    20\n",
       "11    21\n",
       "12    22\n",
       "13    23\n",
       "14    24\n",
       "15    25\n",
       "16    26\n",
       "17    27\n",
       "18    28\n",
       "19    29\n",
       "Name: a, dtype: int64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def add_ten(num):\n",
    "    return num + 10\n",
    "\n",
    "\n",
    "df[\"a\"].apply(add_ten)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "8da5c3cb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    10\n",
       "1    11\n",
       "2    12\n",
       "3    13\n",
       "4    14\n",
       "Name: a, dtype: int64"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf[\"a\"].map_partitions(add_ten).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d4fdde1",
   "metadata": {},
   "source": [
    "## Histogramming"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b98a7daf",
   "metadata": {},
   "source": [
    "Counting the number of occurrences of each unique value of variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "c7b8ea5d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15    1\n",
       "6     1\n",
       "1     1\n",
       "14    1\n",
       "2     1\n",
       "5     1\n",
       "11    1\n",
       "7     1\n",
       "17    1\n",
       "13    1\n",
       "8     1\n",
       "16    1\n",
       "0     1\n",
       "10    1\n",
       "4     1\n",
       "9     1\n",
       "19    1\n",
       "18    1\n",
       "3     1\n",
       "12    1\n",
       "Name: a, dtype: int32"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.a.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "cc9d34f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "15    1\n",
       "6     1\n",
       "1     1\n",
       "14    1\n",
       "2     1\n",
       "Name: a, dtype: int64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.a.value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "437172ba",
   "metadata": {},
   "source": [
    "## String Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd3fc4f3",
   "metadata": {},
   "source": [
    "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the [cuDF API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/series.html#string-handling) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "86974041",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       a\n",
       "1       b\n",
       "2       c\n",
       "3    aaba\n",
       "4    baca\n",
       "5    <NA>\n",
       "6    caba\n",
       "7     dog\n",
       "8     cat\n",
       "dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = cudf.Series([\"A\", \"B\", \"C\", \"Aaba\", \"Baca\", None, \"CABA\", \"dog\", \"cat\"])\n",
    "s.str.lower()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "c6a61a08",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       a\n",
       "1       b\n",
       "2       c\n",
       "3    aaba\n",
       "dtype: object"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds = dask_cudf.from_cudf(s, npartitions=2)\n",
    "ds.str.lower().head(n=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44fe1243",
   "metadata": {},
   "source": [
    "As well as simple manipulation, We can also match strings using [regular expressions](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.column.string.StringMethods.match.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "51158a24",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3     True\n",
       "4    False\n",
       "5     <NA>\n",
       "6    False\n",
       "7    False\n",
       "8     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.str.match(\"^[aAc].+\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "4f3e36f5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3     True\n",
       "4    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.str.match(\"^[aAc].+\").head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5528afa8",
   "metadata": {},
   "source": [
    "## Concat"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e05b1078",
   "metadata": {},
   "source": [
    "Concatenating `Series` and `DataFrames` row-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "6c6d10bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       1\n",
       "1       2\n",
       "2       3\n",
       "3    <NA>\n",
       "4       5\n",
       "0       1\n",
       "1       2\n",
       "2       3\n",
       "3    <NA>\n",
       "4       5\n",
       "dtype: int64"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = cudf.Series([1, 2, 3, None, 5])\n",
    "cudf.concat([s, s])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "d3e5cf87",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    2\n",
       "2    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n",
    "dask_cudf.concat([ds2, ds2]).head(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df087d2f",
   "metadata": {},
   "source": [
    "## Join"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89cf5022",
   "metadata": {},
   "source": [
    "Performing SQL style merges. Note that the dataframe order is **not maintained**, but may be restored post-merge by sorting by the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "075c97a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>vals_a</th>\n",
       "      <th>vals_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>10.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>c</td>\n",
       "      <td>12.0</td>\n",
       "      <td>101.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>e</td>\n",
       "      <td>14.0</td>\n",
       "      <td>102.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>b</td>\n",
       "      <td>11.0</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>d</td>\n",
       "      <td>13.0</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key  vals_a vals_b\n",
       "0   a    10.0  100.0\n",
       "1   c    12.0  101.0\n",
       "2   e    14.0  102.0\n",
       "3   b    11.0   <NA>\n",
       "4   d    13.0   <NA>"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_a = cudf.DataFrame()\n",
    "df_a[\"key\"] = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n",
    "df_a[\"vals_a\"] = [float(i + 10) for i in range(5)]\n",
    "\n",
    "df_b = cudf.DataFrame()\n",
    "df_b[\"key\"] = [\"a\", \"c\", \"e\"]\n",
    "df_b[\"vals_b\"] = [float(i + 100) for i in range(3)]\n",
    "\n",
    "merged = df_a.merge(df_b, on=[\"key\"], how=\"left\")\n",
    "merged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "b28fc57b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>key</th>\n",
       "      <th>vals_a</th>\n",
       "      <th>vals_b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>c</td>\n",
       "      <td>12.0</td>\n",
       "      <td>101.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>e</td>\n",
       "      <td>14.0</td>\n",
       "      <td>102.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b</td>\n",
       "      <td>11.0</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>d</td>\n",
       "      <td>13.0</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  key  vals_a vals_b\n",
       "0   c    12.0  101.0\n",
       "1   e    14.0  102.0\n",
       "2   b    11.0   <NA>\n",
       "3   d    13.0   <NA>"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n",
    "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n",
    "\n",
    "merged = ddf_a.merge(ddf_b, on=[\"key\"], how=\"left\").head(n=4)\n",
    "merged"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4695c6e",
   "metadata": {},
   "source": [
    "## Grouping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecb27b06",
   "metadata": {},
   "source": [
    "Like [pandas](https://pandas.pydata.org/docs/user_guide/groupby.html), cuDF and Dask-cuDF support the [Split-Apply-Combine groupby paradigm](https://doi.org/10.18637/jss.v040.i01)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "d8db18d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"agg_col1\"] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n",
    "df[\"agg_col2\"] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n",
    "\n",
    "ddf = dask_cudf.from_cudf(df, npartitions=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e2f0961",
   "metadata": {},
   "source": [
    "Grouping and then applying the `sum` function to the grouped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "e8a7f1f9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>90</td>\n",
       "      <td>100</td>\n",
       "      <td>90</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100</td>\n",
       "      <td>90</td>\n",
       "      <td>100</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            a    b    c  agg_col2\n",
       "agg_col1                         \n",
       "1          90  100   90         4\n",
       "0         100   90  100         3"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(\"agg_col1\").sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "4dd090a1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>90</td>\n",
       "      <td>100</td>\n",
       "      <td>90</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100</td>\n",
       "      <td>90</td>\n",
       "      <td>100</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            a    b    c  agg_col2\n",
       "agg_col1                         \n",
       "1          90  100   90         4\n",
       "0         100   90  100         3"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.groupby(\"agg_col1\").sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ff1e8bf",
   "metadata": {},
   "source": [
    "Grouping hierarchically then applying the `sum` function to grouped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "4738f0ef",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>0</th>\n",
       "      <td>54</td>\n",
       "      <td>60</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>0</th>\n",
       "      <td>73</td>\n",
       "      <td>60</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>1</th>\n",
       "      <td>36</td>\n",
       "      <td>40</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>30</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    a   b   c\n",
       "agg_col1 agg_col2            \n",
       "1        0         54  60  54\n",
       "0        0         73  60  73\n",
       "1        1         36  40  36\n",
       "0        1         27  30  27"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby([\"agg_col1\", \"agg_col2\"]).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "9b07feb1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>1</th>\n",
       "      <td>36</td>\n",
       "      <td>40</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>0</th>\n",
       "      <td>73</td>\n",
       "      <td>60</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>0</th>\n",
       "      <td>54</td>\n",
       "      <td>60</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>30</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    a   b   c\n",
       "agg_col1 agg_col2            \n",
       "1        1         36  40  36\n",
       "0        0         73  60  73\n",
       "1        0         54  60  54\n",
       "0        1         27  30  27"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.groupby([\"agg_col1\", \"agg_col2\"]).sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "443e179b",
   "metadata": {},
   "source": [
    "Grouping and applying statistical functions to specific columns, using `agg`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "f196ad8b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>18</td>\n",
       "      <td>10.0</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>9.0</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           a     b    c\n",
       "agg_col1               \n",
       "1         18  10.0   90\n",
       "0         19   9.0  100"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(\"agg_col1\").agg({\"a\": \"max\", \"b\": \"mean\", \"c\": \"sum\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "3853483f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>18</td>\n",
       "      <td>10.0</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>9.0</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           a     b    c\n",
       "agg_col1               \n",
       "1         18  10.0   90\n",
       "0         19   9.0  100"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.groupby(\"agg_col1\").agg({\"a\": \"max\", \"b\": \"mean\", \"c\": \"sum\"}).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5bf30e1",
   "metadata": {},
   "source": [
    "## Transpose"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ac3b004",
   "metadata": {},
   "source": [
    "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "c5fbdb50",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b\n",
       "0  1  4\n",
       "1  2  5\n",
       "2  3  6"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample = cudf.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6]})\n",
    "sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "733ed90c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0  1  2\n",
       "a  1  2  3\n",
       "b  4  5  6"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample.transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0915c46",
   "metadata": {},
   "source": [
    "## Time Series"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec7f0b81",
   "metadata": {},
   "source": [
    "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "a6d45607",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2018-11-20</td>\n",
       "      <td>0.986051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2018-11-21</td>\n",
       "      <td>0.232034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2018-11-22</td>\n",
       "      <td>0.397617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018-11-23</td>\n",
       "      <td>0.103839</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        date     value\n",
       "0 2018-11-20  0.986051\n",
       "1 2018-11-21  0.232034\n",
       "2 2018-11-22  0.397617\n",
       "3 2018-11-23  0.103839"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import datetime as dt\n",
    "\n",
    "date_df = cudf.DataFrame()\n",
    "date_df[\"date\"] = pd.date_range(\"11/20/2018\", periods=72, freq=\"D\")\n",
    "date_df[\"value\"] = cp.random.sample(len(date_df))\n",
    "\n",
    "search_date = dt.datetime.strptime(\"2018-11-23\", \"%Y-%m-%d\")\n",
    "date_df.query(\"date <= @search_date\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "fbacaae1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2018-11-20</td>\n",
       "      <td>0.986051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2018-11-21</td>\n",
       "      <td>0.232034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2018-11-22</td>\n",
       "      <td>0.397617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2018-11-23</td>\n",
       "      <td>0.103839</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        date     value\n",
       "0 2018-11-20  0.986051\n",
       "1 2018-11-21  0.232034\n",
       "2 2018-11-22  0.397617\n",
       "3 2018-11-23  0.103839"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n",
    "date_ddf.query(\n",
    "    \"date <= @search_date\", local_dict={\"search_date\": search_date}\n",
    ").compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45f9408b",
   "metadata": {},
   "source": [
    "## Categoricals"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eb96f98",
   "metadata": {},
   "source": [
    "`DataFrames` support categorical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "d735b5cb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>grade</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>e</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id grade\n",
       "0   1     a\n",
       "1   2     b\n",
       "2   3     b\n",
       "3   4     a\n",
       "4   5     a\n",
       "5   6     e"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf = cudf.DataFrame(\n",
    "    {\"id\": [1, 2, 3, 4, 5, 6], \"grade\": [\"a\", \"b\", \"b\", \"a\", \"a\", \"e\"]}\n",
    ")\n",
    "gdf[\"grade\"] = gdf[\"grade\"].astype(\"category\")\n",
    "gdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "9d1ff798",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>grade</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id grade\n",
       "0   1     a\n",
       "1   2     b\n",
       "2   3     b"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n",
    "dgdf.head(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9c2bcac",
   "metadata": {},
   "source": [
    "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "a7135eda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StringIndex(['a' 'b' 'e'], dtype='object')"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf.grade.cat.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "466e1ed2",
   "metadata": {},
   "source": [
    "Accessing the underlying code values of each categorical observation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "f00c615a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    1\n",
       "2    1\n",
       "3    0\n",
       "4    0\n",
       "5    2\n",
       "dtype: uint8"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf.grade.cat.codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "d209512f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    1\n",
       "2    1\n",
       "3    0\n",
       "4    0\n",
       "5    2\n",
       "dtype: uint8"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dgdf.grade.cat.codes.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b391a0d",
   "metadata": {},
   "source": [
    "## Converting to Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfdd172b",
   "metadata": {},
   "source": [
    "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "1fcd9c7f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c  agg_col1  agg_col2\n",
       "0  0  19  0         1         1\n",
       "1  1  18  1         0         0\n",
       "2  2  17  2         1         0\n",
       "3  3  16  3         0         1\n",
       "4  4  15  4         1         0"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa8a445b",
   "metadata": {},
   "source": [
    "To convert the first few entries to pandas, we similarly call `.head()` on the Dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "786d39d2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c  agg_col1  agg_col2\n",
       "0  0  19  0         1         1\n",
       "1  1  18  1         0         0\n",
       "2  2  17  2         1         0\n",
       "3  3  16  3         0         1\n",
       "4  4  15  4         1         0"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.head().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "584c4594",
   "metadata": {},
   "source": [
    "In contrast, if we want to convert the entire frame, we need to call `.compute()` on `ddf` to get a local cuDF dataframe, and then call `to_pandas()`, followed by subsequent processing. This workflow is less recommended, since it both puts high memory pressure on a single GPU (the `.compute()` call) and does not take advantage of GPU acceleration for processing (the computation happens on in pandas)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "93f06cdc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c  agg_col1  agg_col2\n",
       "0  0  19  0         1         1\n",
       "1  1  18  1         0         0\n",
       "2  2  17  2         1         0\n",
       "3  3  16  3         0         1\n",
       "4  4  15  4         1         0"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.compute().to_pandas().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a104294a",
   "metadata": {},
   "source": [
    "## Converting to Numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5d3e508",
   "metadata": {},
   "source": [
    "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "2948b577",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0, 19,  0,  1,  1],\n",
       "       [ 1, 18,  1,  0,  0],\n",
       "       [ 2, 17,  2,  1,  0],\n",
       "       [ 3, 16,  3,  0,  1],\n",
       "       [ 4, 15,  4,  1,  0],\n",
       "       [ 5, 14,  5,  0,  0],\n",
       "       [ 6, 13,  6,  1,  1],\n",
       "       [ 7, 12,  7,  0,  0],\n",
       "       [ 8, 11,  8,  1,  0],\n",
       "       [ 9, 10,  9,  0,  1],\n",
       "       [10,  9, 10,  1,  0],\n",
       "       [11,  8, 11,  0,  0],\n",
       "       [12,  7, 12,  1,  1],\n",
       "       [13,  6, 13,  0,  0],\n",
       "       [14,  5, 14,  1,  0],\n",
       "       [15,  4, 15,  0,  1],\n",
       "       [16,  3, 16,  1,  0],\n",
       "       [17,  2, 17,  0,  0],\n",
       "       [18,  1, 18,  1,  1],\n",
       "       [19,  0, 19,  0,  0]])"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.to_numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "1cff6352",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0, 19,  0,  1,  1],\n",
       "       [ 1, 18,  1,  0,  0],\n",
       "       [ 2, 17,  2,  1,  0],\n",
       "       [ 3, 16,  3,  0,  1],\n",
       "       [ 4, 15,  4,  1,  0],\n",
       "       [ 5, 14,  5,  0,  0],\n",
       "       [ 6, 13,  6,  1,  1],\n",
       "       [ 7, 12,  7,  0,  0],\n",
       "       [ 8, 11,  8,  1,  0],\n",
       "       [ 9, 10,  9,  0,  1],\n",
       "       [10,  9, 10,  1,  0],\n",
       "       [11,  8, 11,  0,  0],\n",
       "       [12,  7, 12,  1,  1],\n",
       "       [13,  6, 13,  0,  0],\n",
       "       [14,  5, 14,  1,  0],\n",
       "       [15,  4, 15,  0,  1],\n",
       "       [16,  3, 16,  1,  0],\n",
       "       [17,  2, 17,  0,  0],\n",
       "       [18,  1, 18,  1,  1],\n",
       "       [19,  0, 19,  0,  0]])"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.compute().to_numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1f09303",
   "metadata": {},
   "source": [
    "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "997c89ba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,\n",
       "       17, 18, 19])"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"a\"].to_numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "243df512",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,\n",
       "       17, 18, 19])"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf[\"a\"].compute().to_numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b520acf7",
   "metadata": {},
   "source": [
    "## Converting to Arrow"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "050e67e5",
   "metadata": {},
   "source": [
    "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "0ac9e740",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Table\n",
       "a: int64\n",
       "b: int64\n",
       "c: int64\n",
       "agg_col1: int64\n",
       "agg_col2: int64\n",
       "----\n",
       "a: [[0,1,2,3,4,...,15,16,17,18,19]]\n",
       "b: [[19,18,17,16,15,...,4,3,2,1,0]]\n",
       "c: [[0,1,2,3,4,...,15,16,17,18,19]]\n",
       "agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]\n",
       "agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.to_arrow()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "f3170fc3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Table\n",
       "a: int64\n",
       "b: int64\n",
       "c: int64\n",
       "agg_col1: int64\n",
       "agg_col2: int64\n",
       "----\n",
       "a: [[0,1,2,3,4]]\n",
       "b: [[19,18,17,16,15]]\n",
       "c: [[0,1,2,3,4]]\n",
       "agg_col1: [[1,0,1,0,1]]\n",
       "agg_col2: [[1,0,0,1,0]]"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.head().to_arrow()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f0251c6",
   "metadata": {},
   "source": [
    "## Reading/Writing CSV Files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d1935c6",
   "metadata": {},
   "source": [
    "Writing to a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "36f5039f",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(\"example_output\"):\n",
    "    os.mkdir(\"example_output\")\n",
    "\n",
    "df.to_csv(\"example_output/foo.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "22c1eb6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf.compute().to_csv(\"example_output/foo_dask.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "320c3968",
   "metadata": {},
   "source": [
    "Reading from a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "c110a80f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a   b   c  agg_col1  agg_col2\n",
       "0    0  19   0         1         1\n",
       "1    1  18   1         0         0\n",
       "2    2  17   2         1         0\n",
       "3    3  16   3         0         1\n",
       "4    4  15   4         1         0\n",
       "5    5  14   5         0         0\n",
       "6    6  13   6         1         1\n",
       "7    7  12   7         0         0\n",
       "8    8  11   8         1         0\n",
       "9    9  10   9         0         1\n",
       "10  10   9  10         1         0\n",
       "11  11   8  11         0         0\n",
       "12  12   7  12         1         1\n",
       "13  13   6  13         0         0\n",
       "14  14   5  14         1         0\n",
       "15  15   4  15         0         1\n",
       "16  16   3  16         1         0\n",
       "17  17   2  17         0         0\n",
       "18  18   1  18         1         1\n",
       "19  19   0  19         0         0"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = cudf.read_csv(\"example_output/foo.csv\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "787eae14",
   "metadata": {},
   "source": [
    "Note that for the Dask-cuDF case, we use `dask_cudf.read_csv` in preference to `dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "a699dfef",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c  agg_col1  agg_col2\n",
       "0  0  19  0         1         1\n",
       "1  1  18  1         0         0\n",
       "2  2  17  2         1         0\n",
       "3  3  16  3         0         1\n",
       "4  4  15  4         1         0"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf = dask_cudf.read_csv(\"example_output/foo_dask.csv\")\n",
    "ddf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72857b2c",
   "metadata": {},
   "source": [
    "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "825a0c03",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a   b  c  agg_col1  agg_col2\n",
       "0  0  19  0         1         1\n",
       "1  1  18  1         0         0\n",
       "2  2  17  2         1         0\n",
       "3  3  16  3         0         1\n",
       "4  4  15  4         1         0"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf = dask_cudf.read_csv(\"example_output/*.csv\")\n",
    "ddf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "763c555b",
   "metadata": {},
   "source": [
    "## Reading/Writing Parquet Files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8766d4ac",
   "metadata": {},
   "source": [
    "Writing to parquet files with cuDF's GPU-accelerated parquet writer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "5038b284",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_parquet(\"example_output/temp_parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4b49824",
   "metadata": {},
   "source": [
    "Reading parquet files with cuDF's GPU-accelerated parquet reader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "bb657a69",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a   b   c  agg_col1  agg_col2\n",
       "0    0  19   0         1         1\n",
       "1    1  18   1         0         0\n",
       "2    2  17   2         1         0\n",
       "3    3  16   3         0         1\n",
       "4    4  15   4         1         0\n",
       "5    5  14   5         0         0\n",
       "6    6  13   6         1         1\n",
       "7    7  12   7         0         0\n",
       "8    8  11   8         1         0\n",
       "9    9  10   9         0         1\n",
       "10  10   9  10         1         0\n",
       "11  11   8  11         0         0\n",
       "12  12   7  12         1         1\n",
       "13  13   6  13         0         0\n",
       "14  14   5  14         1         0\n",
       "15  15   4  15         0         1\n",
       "16  16   3  16         1         0\n",
       "17  17   2  17         0         0\n",
       "18  18   1  18         1         1\n",
       "19  19   0  19         0         0"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = cudf.read_parquet(\"example_output/temp_parquet\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9a29874",
   "metadata": {},
   "source": [
    "Writing to parquet files from a `dask_cudf.DataFrame` using cuDF's parquet writer under the hood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "0c3db7b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf.to_parquet(\"example_output/ddf_parquet_files\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90a49967",
   "metadata": {},
   "source": [
    "## Reading/Writing ORC Files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de9d03fa",
   "metadata": {},
   "source": [
    "Writing ORC files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "c387f8f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_orc(\"example_output/temp_orc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "242c32a2",
   "metadata": {},
   "source": [
    "And reading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "d4bab6da",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>8</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>6</td>\n",
       "      <td>13</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>14</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>3</td>\n",
       "      <td>16</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a   b   c  agg_col1  agg_col2\n",
       "0    0  19   0         1         1\n",
       "1    1  18   1         0         0\n",
       "2    2  17   2         1         0\n",
       "3    3  16   3         0         1\n",
       "4    4  15   4         1         0\n",
       "5    5  14   5         0         0\n",
       "6    6  13   6         1         1\n",
       "7    7  12   7         0         0\n",
       "8    8  11   8         1         0\n",
       "9    9  10   9         0         1\n",
       "10  10   9  10         1         0\n",
       "11  11   8  11         0         0\n",
       "12  12   7  12         1         1\n",
       "13  13   6  13         0         0\n",
       "14  14   5  14         1         0\n",
       "15  15   4  15         0         1\n",
       "16  16   3  16         1         0\n",
       "17  17   2  17         0         0\n",
       "18  18   1  18         1         1\n",
       "19  19   0  19         0         0"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = cudf.read_orc(\"example_output/temp_orc\")\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c988553d",
   "metadata": {},
   "source": [
    "## Dask Performance Tips\n",
    "\n",
    "Like Apache Spark, Dask operations are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation). Instead of being executed immediately, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.\n",
    "\n",
    "Sometimes, though, we want to force the execution of operations. Calling `persist` on a Dask collection fully computes it (or actively computes it in the background), persisting the result into memory. When we're using distributed systems, we may want to wait until `persist` is finished before beginning any downstream operations. We can enforce this contract by using `wait`. Wrapping an operation with `wait` will ensure it doesn't begin executing until all necessary upstream operations have finished.\n",
    "\n",
    "The snippets below provide basic examples, using `LocalCUDACluster` to create one dask-worker per GPU on the local machine. For more detailed information about `persist` and `wait`, please see the Dask documentation for [persist](https://docs.dask.org/en/latest/api.html#dask.persist) and [wait](https://docs.dask.org/en/latest/futures.html#distributed.wait). Wait relies on the concept of Futures, which is beyond the scope of this tutorial. For more information on Futures, see the Dask [Futures](https://docs.dask.org/en/latest/futures.html) documentation. For more information about multi-GPU clusters, please see the [dask-cuda](https://github.com/rapidsai/dask-cuda) library (documentation is in progress)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "976a8dca",
   "metadata": {},
   "source": [
    "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "39c82511",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "\n",
    "cluster = LocalCUDACluster()\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "819c2d92",
   "metadata": {},
   "source": [
    "### Persisting Data\n",
    "\n",
    "Next, we create our Dask-cuDF DataFrame and apply a transformation, storing the result as a new column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "f5c0ca87",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=16</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>int64</td>\n",
       "      <td>int64</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>625000</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9375000</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9999999</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: assign, 4 graph layers</div>"
      ],
      "text/plain": [
       "<dask_cudf.DataFrame | 64 tasks | 16 npartitions>"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nrows = 10000000\n",
    "\n",
    "df2 = cudf.DataFrame({\"a\": cp.arange(nrows), \"b\": cp.arange(nrows)})\n",
    "ddf2 = dask_cudf.from_cudf(df2, npartitions=16)\n",
    "ddf2[\"c\"] = ddf2[\"a\"] + 5\n",
    "ddf2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "eec23c4d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mon Nov 14 03:05:08 2022       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|                               |                      |               MIG M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    55W / 300W |   4538MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    56W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |\n",
      "| N/A   33C    P0    55W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |\n",
      "| N/A   31C    P0    55W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    54W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |\n",
      "| N/A   33C    P0    56W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |\n",
      "| N/A   35C    P0    55W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    54W / 300W |    336MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                                  |\n",
      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
      "|        ID   ID                                                   Usage      |\n",
      "|=============================================================================|\n",
      "|    0   N/A  N/A     57132      C   .../python                        333MiB |\n",
      "|    1   N/A  N/A     57131      C   .../python                        333MiB |\n",
      "|    2   N/A  N/A     57143      C   .../python                        333MiB |\n",
      "|    3   N/A  N/A     57124      C   .../python                        333MiB |\n",
      "|    4   N/A  N/A     57135      C   .../python                        333MiB |\n",
      "|    5   N/A  N/A     57144      C   .../python                        333MiB |\n",
      "|    6   N/A  N/A     57126      C   .../python                        333MiB |\n",
      "|    7   N/A  N/A     57139      C   .../python                        333MiB |\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "578a1698",
   "metadata": {},
   "source": [
    "Because Dask is lazy, the computation has not yet occurred. We can see that there are sixty-four tasks in the task graph and we're using about 330 MB of device memory on each GPU. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "3de4c0cb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=16</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>int64</td>\n",
       "      <td>int64</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>625000</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9375000</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9999999</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: assign, 1 graph layer</div>"
      ],
      "text/plain": [
       "<dask_cudf.DataFrame | 16 tasks | 16 npartitions>"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf2 = ddf2.persist()\n",
    "ddf2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "64c9f96c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mon Nov 14 03:05:15 2022       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|                               |                      |               MIG M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    55W / 300W |   4900MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    56W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |\n",
      "| N/A   33C    P0    55W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    55W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    55W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |\n",
      "| N/A   33C    P0    56W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |\n",
      "| N/A   35C    P0    55W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |\n",
      "| N/A   32C    P0    54W / 300W |    698MiB / 32768MiB |      0%      Default |\n",
      "|                               |                      |                  N/A |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                                  |\n",
      "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
      "|        ID   ID                                                   Usage      |\n",
      "|=============================================================================|\n",
      "|    0   N/A  N/A     57132      C   .../python                        695MiB |\n",
      "|    1   N/A  N/A     57131      C   .../python                        695MiB |\n",
      "|    2   N/A  N/A     57143      C   .../python                        695MiB |\n",
      "|    3   N/A  N/A     57124      C   .../python                        695MiB |\n",
      "|    4   N/A  N/A     57135      C   .../python                        695MiB |\n",
      "|    5   N/A  N/A     57144      C   .../python                        695MiB |\n",
      "|    6   N/A  N/A     57126      C   .../python                        695MiB |\n",
      "|    7   N/A  N/A     57139      C   .../python                        695MiB |\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "# Sleep to ensure the persist finishes and shows in the memory usage\n",
    "!sleep 5; nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "154d699f",
   "metadata": {},
   "source": [
    "Because we forced computation, we now have a larger object in distributed GPU memory. Note that actual numbers will differ between systems (for example depending on how many devices are available)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f45064d7",
   "metadata": {},
   "source": [
    "### Wait\n",
    "Depending on our workflow or distributed computing setup, we may want to `wait` until all upstream tasks have finished before proceeding with a specific function. This section shows an example of this behavior, adapted from the Dask documentation.\n",
    "\n",
    "First, we create a new Dask DataFrame and define a function that we'll map to every partition in the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "a021a726",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "nrows = 10000000\n",
    "\n",
    "df1 = cudf.DataFrame({\"a\": cp.arange(nrows), \"b\": cp.arange(nrows)})\n",
    "ddf1 = dask_cudf.from_cudf(df1, npartitions=100)\n",
    "\n",
    "\n",
    "def func(df):\n",
    "    time.sleep(random.randint(1, 10))\n",
    "    return (df + 5) * 3 - 11"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93a3ee73",
   "metadata": {},
   "source": [
    "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-10 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "8f091ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "results_ddf = ddf2.map_partitions(func)\n",
    "results_ddf = results_ddf.persist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c22a1e8",
   "metadata": {},
   "source": [
    "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "fea52d0f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DoneAndNotDoneFutures(done={<Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 13)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 1)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 3)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 8)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 10)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 7)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 5)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 6)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 14)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 12)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 9)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 11)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 2)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 4)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 15)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-4a955f5e5fda88923d28b45196632826', 0)>}, not_done=set())"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wait(results_ddf)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db619bec",
   "metadata": {},
   "source": [
    "With `wait` completed, we can safely proceed on in our workflow."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "8056d08c5310318d9ca4fe60778daf853f02695d9fa19fd0f51ce5f8b089487a"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}