Parquet#

class pylibcudf.io.parquet.ChunkedParquetReader( ParquetReaderOptions options, stream=None, DeviceMemoryResource mr=None, size_t chunk_read_limit=0, size_t pass_read_limit=1024000000, parquet_metadatas=None, )#

Reads chunks of a Parquet file into a TableWithMetadata.

For details, see chunked_parquet_reader.

Parameters:

optionsParquetReaderOptions: Settings for controlling reading behavior
streamStream | None: CUDA stream used for device memory operations and kernel launches
mrDeviceMemoryResource, optional: Device memory resource used to allocate the returned table’s device memory.
chunk_read_limitsize_t, default 0: Limit on total number of bytes to be returned per read, or 0 if there is no limit.
pass_read_limitsize_t, default 1024000000: Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit.
parquet_metadataslist[FileMetaData], optional: Pre-materialized parquet footer metadata, one for each source. If not provided, footers are read from the sources internally.

Methods

`has_next`(self)	Returns True if there is another chunk in the Parquet file to be read.
`read_chunk`(self, DeviceMemoryResource mr=None)	Read the next chunk into a `TableWithMetadata`

has_next(self) → bool#

Returns True if there is another chunk in the Parquet file to be read.

Returns:

True if we have not finished reading the file.

read_chunk( self, DeviceMemoryResource mr=None, ) → TableWithMetadata#

Read the next chunk into a TableWithMetadata

Parameters:

mrDeviceMemoryResource, optional: Device memory resource used to allocate the returned table’s device memory.

Returns:

TableWithMetadata: The Table and its corresponding metadata (column names) that were read in.

class pylibcudf.io.parquet.ParquetReaderOptions#

The settings to use for read_parquet For details, see cudf::io::parquet_reader_options

Methods

`builder`(SourceInfo source)	Create a ParquetReaderOptionsBuilder object
`enable_case_sensitive_names`(self, bool val)	Sets whether column names are matched case-sensitively.
`is_enabled_case_sensitive_names`(self)	Returns whether column name matching is case sensitive.
`is_enabled_use_jit_filter`(self)	Returns whether to use JIT compilation for filtering.
`set_column_indices`(self, list col_indices)	Sets indices of the top-level columns to be read.
`set_column_names`(self, list col_names)	Sets names of the columns to be read.
`set_columns`(self, list col_names)	Sets names of the columns to be read.
`set_filter`(self, Expression filter)	Sets AST based filter for predicate pushdown.
`set_num_rows`(self, int64_t nrows)	Sets number of rows to read.
`set_row_groups`(self, list row_groups)	Sets list of individual row groups to read.
`set_skip_rows`(self, int64_t skip_rows)	Sets number of rows to skip.
`set_source`(self, SourceInfo src)	Set a new source info location.

static builder(SourceInfo source)#

Create a ParquetReaderOptionsBuilder object

For details, see cudf::io::parquet_reader_options::builder()

Parameters:

sinkSourceInfo: The source to read the Parquet file from.

Returns:

ParquetReaderOptionsBuilder: Builder to build ParquetReaderOptions

enable_case_sensitive_names( self, bool val, ) → void#

Sets whether column names are matched case-sensitively.

Parameters:

valbool: Enables case-sensitive matching

Returns:

None

is_enabled_case_sensitive_names(self) → bool#

Returns whether column name matching is case sensitive.

Returns:

bool: Whether column names are matched case-sensitively

is_enabled_use_jit_filter(self) → bool#: Returns whether to use JIT compilation for filtering.

set_column_indices( self, list col_indices, ) → void#

Sets indices of the top-level columns to be read.

Parameters:

col_nameslist: List of top-level column indices

Returns:

None

set_column_names(self, list col_names) → void#

Sets names of the columns to be read.

Parameters:

col_nameslist: List of column names

Returns:

None

set_columns(self, list col_names) → void#

Sets names of the columns to be read. Deprecated and will be removed in a future version. Use set_column_names instead.

Parameters:

col_nameslist: List of column names

Returns:

None

set_filter(self, Expression filter) → void#

Sets AST based filter for predicate pushdown.

Parameters:

filterExpression: AST expression to use as filter

Returns:

None

set_num_rows(self, int64_t nrows) → void#

Sets number of rows to read.

Parameters:

nrowsint64_t: Number of rows to read after skip

Returns:

None

Notes

Although this allows one to request more than size_type::max() rows, if any single read would produce a table larger than this row limit, an error is thrown.

set_row_groups(self, list row_groups) → void#

Sets list of individual row groups to read.

Parameters:

row_groupslist[list[int]]: Row groups to read, one inner list per input source.

Returns:

None

Notes

Rows are emitted in input-source order; all rows selected from source 0 are emitted before rows selected from source 1, and so on. Within each source, row groups are read in the order provided; indices are not sorted or deduplicated, and repeated indices are emitted multiple times. Empty inner lists contribute no rows. When unset, all row groups are read in source order, then in on-disk order within each source. Predicate pushdown drops row groups in place; remaining row groups keep their relative order.

set_skip_rows(self, int64_t skip_rows) → void#

Sets number of rows to skip.

Parameters:

skip_rowsint64_t: Number of rows to skip from start

Returns:

None

set_source(self, SourceInfo src) → void#

Set a new source info location.

Parameters:

srcSourceInfo: New source information, replacing existing information.

Returns:

None

pylibcudf.io.parquet.is_supported_read_parquet(compression_type compression) → bool#

Check if the compression type is supported for reading Parquet files.

For details, see is_supported_read_parquet().

Parameters:

compressionCompressionType: The compression type to check

Returns:

bool: True if the compression type is supported for reading Parquet files

pylibcudf.io.parquet.is_supported_write_parquet(compression_type compression) → bool#

Check if the compression type is supported for writing Parquet files.

For details, see is_supported_write_parquet().

Parameters:

compressionCompressionType: The compression type to check

Returns:

bool: True if the compression type is supported for writing Parquet files

pylibcudf.io.parquet.merge_row_group_metadata(list metdata_list) → memoryview#

Merges multiple raw metadata blobs that were previously created by write_parquet into a single metadata blob.

For details, see merge_row_group_metadata().

Parameters:

metdata_listlist: List of input file metadata

Returns:

memoryview: A parquet-compatible blob that contains the data for all row groups in the list

pylibcudf.io.parquet.read_parquet( ParquetReaderOptions options, stream=None, DeviceMemoryResource mr=None, parquet_metadatas=None, )#

Read from Parquet format.

The source to read from and options are encapsulated by the options object.

For details, see read_parquet().

Parameters:

options: ParquetReaderOptions: Settings for controlling reading behavior
streamStream | None: CUDA stream used for device memory operations and kernel launches
mrDeviceMemoryResource, optional: Device memory resource used to allocate the returned table’s device memory.
parquet_metadataslist[FileMetaData], optional: Pre-materialized parquet footer metadata, one for each source. If not provided, footers are read from the sources internally.

pylibcudf.io.parquet.write_parquet(ParquetWriterOptions options, stream=None) → memoryview#

Writes a set of columns to parquet format.

Parameters:

optionsParquetWriterOptions: Settings for controlling writing behavior
streamStream | None: CUDA stream used for device memory operations and kernel launches

Returns:

memoryview: A blob that contains the file metadata (parquet FileMetadata thrift message) if requested in parquet_writer_options (empty blob otherwise).