Parquet#

class pylibcudf.io.parquet.ChunkedParquetReader(ParquetReaderOptions options, stream=None, DeviceMemoryResource mr=None, size_t chunk_read_limit=0, size_t pass_read_limit=1024000000, parquet_metadatas=None)#

Reads chunks of a Parquet file into a TableWithMetadata.

For details, see chunked_parquet_reader.

Parameters:
optionsParquetReaderOptions

Settings for controlling reading behavior

streamStream | None

CUDA stream used for device memory operations and kernel launches

mrDeviceMemoryResource, optional

Device memory resource used to allocate the returned table’s device memory.

chunk_read_limitsize_t, default 0

Limit on total number of bytes to be returned per read, or 0 if there is no limit.

pass_read_limitsize_t, default 1024000000

Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit.

parquet_metadataslist[FileMetaData], optional

Pre-materialized parquet footer metadata, one for each source. If not provided, footers are read from the sources internally.

Methods

has_next(self)

Returns True if there is another chunk in the Parquet file to be read.

read_chunk(self, DeviceMemoryResource mr=None)

Read the next chunk into a TableWithMetadata

has_next(self) bool#

Returns True if there is another chunk in the Parquet file to be read.

Returns:
True if we have not finished reading the file.
read_chunk(self, DeviceMemoryResource mr=None) TableWithMetadata#

Read the next chunk into a TableWithMetadata

Parameters:
mrDeviceMemoryResource, optional

Device memory resource used to allocate the returned table’s device memory.

Returns:
TableWithMetadata

The Table and its corresponding metadata (column names) that were read in.

class pylibcudf.io.parquet.ParquetReaderOptions#

The settings to use for read_parquet For details, see cudf::io::parquet_reader_options

Methods

builder(SourceInfo source)

Create a ParquetReaderOptionsBuilder object

enable_case_sensitive_names(self, bool val)

Sets whether column names are matched case-sensitively.

is_enabled_case_sensitive_names(self)

Returns whether column name matching is case sensitive.

is_enabled_use_jit_filter(self)

Returns whether to use JIT compilation for filtering.

set_column_indices(self, list col_indices)

Sets indices of the top-level columns to be read.

set_column_names(self, list col_names)

Sets names of the columns to be read.

set_columns(self, list col_names)

Sets names of the columns to be read.

set_filter(self, Expression filter)

Sets AST based filter for predicate pushdown.

set_num_rows(self, int64_t nrows)

Sets number of rows to read.

set_row_groups(self, list row_groups)

Sets list of individual row groups to read.

set_skip_rows(self, int64_t skip_rows)

Sets number of rows to skip.

set_source(self, SourceInfo src)

Set a new source info location.

static builder(SourceInfo source)#

Create a ParquetReaderOptionsBuilder object

For details, see cudf::io::parquet_reader_options::builder()

Parameters:
sinkSourceInfo

The source to read the Parquet file from.

Returns:
ParquetReaderOptionsBuilder

Builder to build ParquetReaderOptions

enable_case_sensitive_names(self, bool val) void#

Sets whether column names are matched case-sensitively.

Parameters:
valbool

Enables case-sensitive matching

Returns:
None
is_enabled_case_sensitive_names(self) bool#

Returns whether column name matching is case sensitive.

Returns:
bool

Whether column names are matched case-sensitively

is_enabled_use_jit_filter(self) bool#

Returns whether to use JIT compilation for filtering.

set_column_indices(self, list col_indices) void#

Sets indices of the top-level columns to be read.

Parameters:
col_nameslist

List of top-level column indices

Returns:
None
set_column_names(self, list col_names) void#

Sets names of the columns to be read.

Parameters:
col_nameslist

List of column names

Returns:
None
set_columns(self, list col_names) void#

Sets names of the columns to be read. Deprecated and will be removed in a future version. Use set_column_names instead.

Parameters:
col_nameslist

List of column names

Returns:
None
set_filter(self, Expression filter) void#

Sets AST based filter for predicate pushdown.

Parameters:
filterExpression

AST expression to use as filter

Returns:
None
set_num_rows(self, int64_t nrows) void#

Sets number of rows to read.

Parameters:
nrowsint64_t

Number of rows to read after skip

Returns:
None

Notes

Although this allows one to request more than size_type::max() rows, if any single read would produce a table larger than this row limit, an error is thrown.

set_row_groups(self, list row_groups) void#

Sets list of individual row groups to read.

Parameters:
row_groupslist[list[int]]

Row groups to read, one inner list per input source.

Returns:
None

Notes

Rows are emitted in input-source order; all rows selected from source 0 are emitted before rows selected from source 1, and so on. Within each source, row groups are read in the order provided; indices are not sorted or deduplicated, and repeated indices are emitted multiple times. Empty inner lists contribute no rows. When unset, all row groups are read in source order, then in on-disk order within each source. Predicate pushdown drops row groups in place; remaining row groups keep their relative order.

set_skip_rows(self, int64_t skip_rows) void#

Sets number of rows to skip.

Parameters:
skip_rowsint64_t

Number of rows to skip from start

Returns:
None
set_source(self, SourceInfo src) void#

Set a new source info location.

Parameters:
srcSourceInfo

New source information, replacing existing information.

Returns:
None
pylibcudf.io.parquet.is_supported_read_parquet(compression_type compression) bool#

Check if the compression type is supported for reading Parquet files.

For details, see is_supported_read_parquet().

Parameters:
compressionCompressionType

The compression type to check

Returns:
bool

True if the compression type is supported for reading Parquet files

pylibcudf.io.parquet.is_supported_write_parquet(compression_type compression) bool#

Check if the compression type is supported for writing Parquet files.

For details, see is_supported_write_parquet().

Parameters:
compressionCompressionType

The compression type to check

Returns:
bool

True if the compression type is supported for writing Parquet files

pylibcudf.io.parquet.merge_row_group_metadata(list metdata_list) memoryview#

Merges multiple raw metadata blobs that were previously created by write_parquet into a single metadata blob.

For details, see merge_row_group_metadata().

Parameters:
metdata_listlist

List of input file metadata

Returns:
memoryview

A parquet-compatible blob that contains the data for all row groups in the list

pylibcudf.io.parquet.read_parquet(ParquetReaderOptions options, stream=None, DeviceMemoryResource mr=None, parquet_metadatas=None)#

Read from Parquet format.

The source to read from and options are encapsulated by the options object.

For details, see read_parquet().

Parameters:
options: ParquetReaderOptions

Settings for controlling reading behavior

streamStream | None

CUDA stream used for device memory operations and kernel launches

mrDeviceMemoryResource, optional

Device memory resource used to allocate the returned table’s device memory.

parquet_metadataslist[FileMetaData], optional

Pre-materialized parquet footer metadata, one for each source. If not provided, footers are read from the sources internally.

pylibcudf.io.parquet.write_parquet(ParquetWriterOptions options, stream=None) memoryview#

Writes a set of columns to parquet format.

Parameters:
optionsParquetWriterOptions

Settings for controlling writing behavior

streamStream | None

CUDA stream used for device memory operations and kernel launches

Returns:
memoryview

A blob that contains the file metadata (parquet FileMetadata thrift message) if requested in parquet_writer_options (empty blob otherwise).