Working with JSON data#

This page contains a tutorial about reading and manipulating JSON data in cuDF.

Reading JSON data#

By default, the cuDF JSON reader expects input data using the “records” orientation. Records-oriented JSON data comprises an array of objects at the root level, and each object in the array corresponds to a row. Records-oriented JSON data begins with [, ends with ] and ignores unquoted whitespace. Another common variant for JSON data is “JSON Lines”, where JSON objects are separated by new line characters (\n), and each object corresponds to a row.

>>> j = '''[
    {"a": "v1", "b": 12},
    {"a": "v2", "b": 7},
    {"a": "v3", "b": 5}
]'''
>>> df_records = cudf.read_json(j, engine='cudf')

>>> j = '\n'.join([
...     '{"a": "v1", "b": 12}',
...     '{"a": "v2", "b": 7}',
...     '{"a": "v3", "b": 5}'
... ])
>>> df_lines = cudf.read_json(j, lines=True)

>>> df_lines
    a   b
0  v1  12
1  v2   7
2  v3   5
>>> df_records.equals(df_lines)
True

The cuDF JSON reader also supports arbitrarily-nested combinations of JSON objects and arrays, which map to struct and list data types. The following examples demonstrate the inputs and outputs for reading nested JSON data.

# Example with columns types:
# list<int> and struct<k:string>
>>> j = '''[
    {"list": [0,1,2], "struct": {"k":"v1"}},
    {"list": [3,4,5], "struct": {"k":"v2"}}
]'''
>>> df = cudf.read_json(j, engine='cudf')
>>> df
        list       struct
0  [0, 1, 2]  {'k': 'v1'}
1  [3, 4, 5]  {'k': 'v2'}

# Example with columns types:
# list<struct<k:int>> and struct<k:list<int>, m:int>
>>> j = '\n'.join([
...     '{"a": [{"k": 0}], "b": {"k": [0, 1], "m": 5}}',
...     '{"a": [{"k": 1}, {"k": 2}], "b": {"k": [2, 3], "m": 6}}',
... ])
>>> df = cudf.read_json(j, lines=True)
>>> df
                      a                      b
0            [{'k': 0}]  {'k': [0, 1], 'm': 5}
1  [{'k': 1}, {'k': 2}]  {'k': [2, 3], 'm': 6}

Handling large and small JSON Lines files#

For workloads based on JSON Lines data, cuDF includes reader options to assist with data processing: byte range support for large files, and multi-source support for small files.

Some workflows require processing large JSON Lines files that may exceed GPU memory capacity. The JSON reader in cuDF supports a byte range argument that specifies a starting byte offset and size in bytes. The reader parses each record that begins within the byte range, and for this reason byte ranges do not need to align with record boundaries. To avoid skipping rows or reading duplicate rows, byte ranges should be adjacent, as shown in the following example.

>>> num_rows = 10
>>> j = '\n'.join([
...     '{"id":%s, "distance": %s, "unit": "m/s"}' % x \
...     for x in zip(range(num_rows), cupy.random.rand(num_rows))
... ])

>>> chunk_count = 4
>>> chunk_size = len(j) // chunk_count + 1
>>> data = []
>>> for x in range(chunk_count):
...    d = cudf.read_json(
...         j,
...         lines=True,
...         byte_range=(chunk_size * x, chunk_size),
...     )
...     data.append(d)
>>> df = cudf.concat(data)

By contrast, some workflows require processing many small JSON Lines files. Rather than looping through the sources and concatenating the resulting dataframes, the JSON reader in cuDF accepts an iterable of data sources. Then the raw inputs are concatenated and processed as a single source. Please note that the JSON reader in cuDF accepts sources as file paths, raw strings, or file-like objects, as well as iterables of these sources.

>>> j1 = '{"id":0}\n{"id":1}\n'
>>> j2 = '{"id":2}\n{"id":3}\n'

>>> df = cudf.read_json([j1, j2], lines=True)

Unpacking list and struct data#

After reading JSON data into a cuDF dataframe with list/struct column types, the next step in many workflows extracts or flattens the data into simple types. For struct columns, one solution is extracting the data with the struct.explode accessor and joining the result to the parent dataframe. The following example demonstrates how to extract data from a struct column.

>>> j = '\n'.join([
...    '{"x": "Tokyo", "y": {"country": "Japan", "iso2": "JP"}}',
...    '{"x": "Jakarta", "y": {"country": "Indonesia", "iso2": "ID"}}',
...    '{"x": "Shanghai", "y": {"country": "China", "iso2": "CN"}}'
... ])
>>> df = cudf.read_json(j, lines=True)
>>> df = df.drop(columns='y').join(df['y'].struct.explode())
>>> df
          x    country iso2
0     Tokyo      Japan   JP
1   Jakarta  Indonesia   ID
2  Shanghai      China   CN

For list columns where the order of the elements is meaningful, the list.get accessor extracts the elements from specific positions. The resulting cudf.Series object can then be assigned to a new column in the dataframe. The following example demonstrates how to extract the first and second elements from a list column.

>>> j = '\n'.join([
...    '{"name": "Peabody, MA", "coord": [42.53, -70.98]}',
...    '{"name": "Northampton, MA", "coord": [42.32, -72.66]}',
...    '{"name": "New Bedford, MA", "coord": [41.63, -70.93]}'
... ])

>>> df = cudf.read_json(j, lines=True)
>>> df['latitude'] = df['coord'].list.get(0)
>>> df['longitude'] = df['coord'].list.get(1)
>>> df = df.drop(columns='coord')
>>> df
              name  latitude  longitude
0      Peabody, MA     42.53     -70.98
1  Northampton, MA     42.32     -72.66
2  New Bedford, MA     41.63     -70.93

Finally, for list columns with variable length, the explode method creates a new dataframe with each element as a row. Joining the exploded dataframe on the parent dataframe yields an output with all simple types. The following example flattens a list column and joins it to the index and additional data from the parent dataframe.

>>> j = '\n'.join([
...    '{"product": "socks", "ratings": [2, 3, 4]}',
...    '{"product": "shoes", "ratings": [5, 4, 5, 3]}',
...    '{"product": "shirts", "ratings": [3, 4]}'
... ])

>>> df = cudf.read_json(j, lines=True)
>>> df = df.drop(columns='ratings').join(df['ratings'].explode())
>>> df
  product  ratings
0   socks        2
0   socks        4
0   socks        3
1   shoes        5
1   shoes        5
1   shoes        4
1   shoes        3
2  shirts        3
2  shirts        4

Building JSON data solutions#

Sometimes a workflow needs to process JSON data with an object root and cuDF provides tools to build solutions for this kind of data. If you need to process JSON data with an object root, we recommend reading the data as a single JSON Line and then unpacking the resulting dataframe. The following example reads a JSON object as a single line and then extracts the “results” field into a new dataframe.

>>> j = '''{
    "metadata" : {"vehicle":"car"},
    "results": [
        {"id": 0, "distance": 1.2},
        {"id": 1, "distance": 2.4},
        {"id": 2, "distance": 1.7}
    ]
}'''

# first read the JSON object with line=True
>>> df = cudf.read_json(j, lines=True)
>>> df
             metadata                                            results
0  {'vehicle': 'car'}  [{'id': 0, 'distance': 1.2}, {'id': 1, 'distan...

# then explode the 'results' column
>>> df = df['results'].explode().struct.explode()
>>> df
   id  distance
0   0       1.2
1   1       2.4
2   2       1.7