cudf.io.parquet.ParquetDatasetWriter#
- class cudf.io.parquet.ParquetDatasetWriter(path, partition_cols, index=None, compression='snappy', statistics='ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None)#
Write a parquet file or dataset incrementally
- Parameters:
- pathstr
A local directory path or S3 URL. Will be used as root directory path while writing a partitioned dataset.
- partition_colslist
Column names by which to partition the dataset Columns are partitioned in the order they are given
- indexbool, default None
If
True
, include the dataframe’s index(es) in the file output. IfFalse
, they will not be written to the file. IfNone
, index(es) other than RangeIndex will be saved as columns.- compression{‘snappy’, None}, default ‘snappy’
Name of the compression to use. Use
None
for no compression.- statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’
Level at which column statistics should be included in file.
- max_file_sizeint or str, default None
A file size that cannot be exceeded by the writer. It is in bytes, if the input is int. Size can also be a str in form or “10 MB”, “1 GB”, etc. If this parameter is used, it is mandatory to pass file_name_prefix.
- file_name_prefixstr
This is a prefix to file names generated only when max_file_size is specified.
- storage_optionsdict, optional, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Request
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open
. Please seefsspec
andurllib
for more details.
Examples
Using a context
>>> df1 = cudf.DataFrame({"a": [1, 1, 2, 2, 1], "b": [9, 8, 7, 6, 5]}) >>> df2 = cudf.DataFrame({"a": [1, 3, 3, 1, 3], "b": [4, 3, 2, 1, 0]}) >>> with ParquetDatasetWriter("./dataset", partition_cols=["a"]) as cw: ... cw.write_table(df1) ... cw.write_table(df2)
By manually calling
close()
>>> cw = ParquetDatasetWriter("./dataset", partition_cols=["a"]) >>> cw.write_table(df1) >>> cw.write_table(df2) >>> cw.close()
Both the methods will generate the same directory structure
dataset/ a=1 <filename>.parquet a=2 <filename>.parquet a=3 <filename>.parquet
Methods
close
([return_metadata])Close all open files and optionally return footer metadata as a binary blob
write_table
(df)Write a dataframe to the file/dataset