cudf.DataFrame.to_parquet#
- DataFrame.to_parquet(path, engine='cudf', compression='snappy', index=None, partition_cols=None, partition_file_name=None, partition_offsets=None, statistics='ROWGROUP', metadata_file_path=None, int96_timestamps=False, row_group_size_bytes=134217728, row_group_size_rows=None, max_page_size_bytes=None, max_page_size_rows=None, storage_options=None, return_metadata=False, use_dictionary=True, header_version='1.0', skip_compression=None, column_encoding=None, column_type_length=None, output_as_binary=None, *args, **kwargs)#
Write a DataFrame to the parquet format.
- Parameters:
- pathstr or list of str
File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset. Use list of str with partition_offsets to write parts of the dataframe to different files.
- compression{‘snappy’, ‘ZSTD’, ‘LZ4’, None}, default ‘snappy’
Name of the compression to use; case insensitive. Use
None
for no compression.- indexbool, default None
If
True
, include the dataframe’s index(es) in the file output. IfFalse
, they will not be written to the file. IfNone
, similar toTrue
the dataframe’s index(es) will be saved, however, instead of being saved as values anyRangeIndex
will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.- partition_colslist, optional, default None
Column names by which to partition the dataset Columns are partitioned in the order they are given
- partition_file_namestr, optional, default None
File name to use for partitioned datasets. Different partitions will be written to different directories, but all files will have this name. If nothing is specified, a random uuid4 hex string will be used for each file. This parameter is only supported by ‘cudf’ engine, and will be ignored by other engines.
- partition_offsetslist, optional, default None
Offsets to partition the dataframe by. Should be used when path is list of str. Should be a list of integers of size
len(path) + 1
- statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’
Level at which column statistics should be included in file.
- metadata_file_pathstr, optional, default None
If specified, this function will return a binary blob containing the footer metadata of the written parquet file. The returned blob will have the
chunk.file_path
field set to themetadata_file_path
for each chunk. When using withpartition_offsets
, should be same size aslen(path)
- int96_timestampsbool, default False
If
True
, write timestamps in int96 format. This will convert timestamps from timestamp[ns], timestamp[ms], timestamp[s], and timestamp[us] to the int96 format, which is the number of Julian days and the number of nanoseconds since midnight of 1970-01-01. IfFalse
, timestamps will not be altered.- row_group_size_bytes: integer, default 134217728
Maximum size of each stripe of the output. If None, 134217728 (128.0 MB) will be used.
- row_group_size_rows: integer or None, default None
Maximum number of rows of each stripe of the output. If None, 1000000 will be used.
- max_page_size_bytes: integer or None, default None
Maximum uncompressed size of each page of the output. If None, 524288 (512KB) will be used.
- max_page_size_rows: integer or None, default None
Maximum number of rows of each page of the output. If None, 20000 will be used.
- max_dictionary_size: integer or None, default None
Maximum size of the dictionary page for each output column chunk. Dictionary encoding for column chunks that exceeds this limit will be disabled. If None, 1048576 (1MB) will be used.
- storage_optionsdict, optional, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Request
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open
. Please seefsspec
andurllib
for more details.- return_metadatabool, default False
Return parquet metadata for written data. Returned metadata will include the file path metadata (relative to root_path). To request metadata binary blob when using with
partition_cols
, Passreturn_metadata=True
instead of specifyingmetadata_file_path
- use_dictionarybool, default True
When
False
, prevents the use of dictionary encoding for Parquet page data. WhenTrue
, dictionary encoding is preferred subject tomax_dictionary_size
constraints.- header_version{‘1.0’, ‘2.0’}, default “1.0”
Controls whether to use version 1.0 or version 2.0 page headers when encoding. Version 1.0 is more portable, but version 2.0 enables the use of newer encoding schemes.
- force_nullable_schemabool, default False.
If True, writes all columns as null in schema. If False, columns are written as null if they contain null values, otherwise as not null.
- skip_compressionset, optional, default None
If a column name is present in the set, that column will not be compressed, regardless of the
compression
setting.- column_encodingdict, optional, default None
Sets the page encoding to use on a per-column basis. The key is a column name, and the value is one of: ‘PLAIN’, ‘DICTIONARY’, ‘DELTA_BINARY_PACKED’, ‘DELTA_LENGTH_BYTE_ARRAY’, ‘DELTA_BYTE_ARRAY’, ‘BYTE_STREAM_SPLIT’, or ‘USE_DEFAULT’.
- column_type_lengthdict, optional, default None
Specifies the width in bytes of
FIXED_LEN_BYTE_ARRAY
column elements. The key is a column name and the value is an integer. The named column will be output as unannotated binary (i.e. the column will behave as ifoutput_as_binary
was set).- output_as_binaryset, optional, default None
If a column name is present in the set, that column will be output as unannotated binary, rather than the default ‘UTF-8’.
- **kwargs
Additional parameters will be passed to execution engines other than
cudf
.
See also