cudf.io.parquet.ParquetDatasetWriter#

class cudf.io.parquet.ParquetDatasetWriter(path, partition_cols, index=None, compression='snappy', statistics='ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None)#

Write a parquet file or dataset incrementally

Parameters:
pathstr

A local directory path or S3 URL. Will be used as root directory path while writing a partitioned dataset.

partition_colslist

Column names by which to partition the dataset Columns are partitioned in the order they are given

indexbool, default None

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, index(es) other than RangeIndex will be saved as columns.

compression{‘snappy’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’

Level at which column statistics should be included in file.

max_file_sizeint or str, default None

A file size that cannot be exceeded by the writer. It is in bytes, if the input is int. Size can also be a str in form or “10 MB”, “1 GB”, etc. If this parameter is used, it is mandatory to pass file_name_prefix.

file_name_prefixstr

This is a prefix to file names generated only when max_file_size is specified.

storage_optionsdict, optional, default None

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details.

Examples

Using a context

>>> df1 = cudf.DataFrame({"a": [1, 1, 2, 2, 1], "b": [9, 8, 7, 6, 5]})
>>> df2 = cudf.DataFrame({"a": [1, 3, 3, 1, 3], "b": [4, 3, 2, 1, 0]})
>>> with ParquetDatasetWriter("./dataset", partition_cols=["a"]) as cw:
...     cw.write_table(df1)
...     cw.write_table(df2)

By manually calling close()

>>> cw = ParquetDatasetWriter("./dataset", partition_cols=["a"])
>>> cw.write_table(df1)
>>> cw.write_table(df2)
>>> cw.close()

Both the methods will generate the same directory structure

dataset/
    a=1
        <filename>.parquet
    a=2
        <filename>.parquet
    a=3
        <filename>.parquet

Methods

close([return_metadata])

Close all open files and optionally return footer metadata as a binary blob

write_table(df)

Write a dataframe to the file/dataset