Parquet#
- class pylibcudf.io.parquet.ChunkedParquetReader(ParquetReaderOptions options, stream=None, DeviceMemoryResource mr=None, size_t chunk_read_limit=0, size_t pass_read_limit=1024000000, parquet_metadatas=None)#
Reads chunks of a Parquet file into a
TableWithMetadata.For details, see
chunked_parquet_reader.- Parameters:
- optionsParquetReaderOptions
Settings for controlling reading behavior
- streamStream | None
CUDA stream used for device memory operations and kernel launches
- mrDeviceMemoryResource, optional
Device memory resource used to allocate the returned table’s device memory.
- chunk_read_limitsize_t, default 0
Limit on total number of bytes to be returned per read, or 0 if there is no limit.
- pass_read_limitsize_t, default 1024000000
Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit.
- parquet_metadataslist[FileMetaData], optional
Pre-materialized parquet footer metadata, one for each source. If not provided, footers are read from the sources internally.
Methods
has_next(self)Returns True if there is another chunk in the Parquet file to be read.
read_chunk(self, DeviceMemoryResource mr=None)Read the next chunk into a
TableWithMetadata- has_next(self) bool#
Returns True if there is another chunk in the Parquet file to be read.
- Returns:
- True if we have not finished reading the file.
- read_chunk(self, DeviceMemoryResource mr=None) TableWithMetadata#
Read the next chunk into a
TableWithMetadata- Parameters:
- mrDeviceMemoryResource, optional
Device memory resource used to allocate the returned table’s device memory.
- Returns:
- TableWithMetadata
The Table and its corresponding metadata (column names) that were read in.
- class pylibcudf.io.parquet.ParquetReaderOptions#
The settings to use for
read_parquetFor details, seecudf::io::parquet_reader_optionsMethods
builder(SourceInfo source)Create a ParquetReaderOptionsBuilder object
enable_case_sensitive_names(self, bool val)Sets whether column names are matched case-sensitively.
Returns whether column name matching is case sensitive.
Returns whether to use JIT compilation for filtering.
set_column_indices(self, list col_indices)Sets indices of the top-level columns to be read.
set_column_names(self, list col_names)Sets names of the columns to be read.
set_columns(self, list col_names)Sets names of the columns to be read.
set_filter(self, Expression filter)Sets AST based filter for predicate pushdown.
set_num_rows(self, int64_t nrows)Sets number of rows to read.
set_row_groups(self, list row_groups)Sets list of individual row groups to read.
set_skip_rows(self, int64_t skip_rows)Sets number of rows to skip.
set_source(self, SourceInfo src)Set a new source info location.
- static builder(SourceInfo source)#
Create a ParquetReaderOptionsBuilder object
For details, see
cudf::io::parquet_reader_options::builder()- Parameters:
- sinkSourceInfo
The source to read the Parquet file from.
- Returns:
- ParquetReaderOptionsBuilder
Builder to build ParquetReaderOptions
- enable_case_sensitive_names(self, bool val) void#
Sets whether column names are matched case-sensitively.
- Parameters:
- valbool
Enables case-sensitive matching
- Returns:
- None
- is_enabled_case_sensitive_names(self) bool#
Returns whether column name matching is case sensitive.
- Returns:
- bool
Whether column names are matched case-sensitively
- set_column_indices(self, list col_indices) void#
Sets indices of the top-level columns to be read.
- Parameters:
- col_nameslist
List of top-level column indices
- Returns:
- None
- set_column_names(self, list col_names) void#
Sets names of the columns to be read.
- Parameters:
- col_nameslist
List of column names
- Returns:
- None
- set_columns(self, list col_names) void#
Sets names of the columns to be read. Deprecated and will be removed in a future version. Use set_column_names instead.
- Parameters:
- col_nameslist
List of column names
- Returns:
- None
- set_filter(self, Expression filter) void#
Sets AST based filter for predicate pushdown.
- Parameters:
- filterExpression
AST expression to use as filter
- Returns:
- None
- set_num_rows(self, int64_t nrows) void#
Sets number of rows to read.
- Parameters:
- nrowsint64_t
Number of rows to read after skip
- Returns:
- None
Notes
Although this allows one to request more than size_type::max() rows, if any single read would produce a table larger than this row limit, an error is thrown.
- set_row_groups(self, list row_groups) void#
Sets list of individual row groups to read.
- Parameters:
- row_groupslist[list[int]]
Row groups to read, one inner list per input source.
- Returns:
- None
Notes
Rows are emitted in input-source order; all rows selected from source 0 are emitted before rows selected from source 1, and so on. Within each source, row groups are read in the order provided; indices are not sorted or deduplicated, and repeated indices are emitted multiple times. Empty inner lists contribute no rows. When unset, all row groups are read in source order, then in on-disk order within each source. Predicate pushdown drops row groups in place; remaining row groups keep their relative order.
- set_skip_rows(self, int64_t skip_rows) void#
Sets number of rows to skip.
- Parameters:
- skip_rowsint64_t
Number of rows to skip from start
- Returns:
- None
- set_source(self, SourceInfo src) void#
Set a new source info location.
- Parameters:
- srcSourceInfo
New source information, replacing existing information.
- Returns:
- None
- pylibcudf.io.parquet.is_supported_read_parquet(compression_type compression) bool#
Check if the compression type is supported for reading Parquet files.
For details, see
is_supported_read_parquet().- Parameters:
- compressionCompressionType
The compression type to check
- Returns:
- bool
True if the compression type is supported for reading Parquet files
- pylibcudf.io.parquet.is_supported_write_parquet(compression_type compression) bool#
Check if the compression type is supported for writing Parquet files.
For details, see
is_supported_write_parquet().- Parameters:
- compressionCompressionType
The compression type to check
- Returns:
- bool
True if the compression type is supported for writing Parquet files
- pylibcudf.io.parquet.merge_row_group_metadata(list metdata_list) memoryview#
Merges multiple raw metadata blobs that were previously created by write_parquet into a single metadata blob.
For details, see
merge_row_group_metadata().- Parameters:
- metdata_listlist
List of input file metadata
- Returns:
- memoryview
A parquet-compatible blob that contains the data for all row groups in the list
- pylibcudf.io.parquet.read_parquet(ParquetReaderOptions options, stream=None, DeviceMemoryResource mr=None, parquet_metadatas=None)#
Read from Parquet format.
The source to read from and options are encapsulated by the options object.
For details, see
read_parquet().- Parameters:
- options: ParquetReaderOptions
Settings for controlling reading behavior
- streamStream | None
CUDA stream used for device memory operations and kernel launches
- mrDeviceMemoryResource, optional
Device memory resource used to allocate the returned table’s device memory.
- parquet_metadataslist[FileMetaData], optional
Pre-materialized parquet footer metadata, one for each source. If not provided, footers are read from the sources internally.
- pylibcudf.io.parquet.write_parquet(ParquetWriterOptions options, stream=None) memoryview#
Writes a set of columns to parquet format.
- Parameters:
- optionsParquetWriterOptions
Settings for controlling writing behavior
- streamStream | None
CUDA stream used for device memory operations and kernel launches
- Returns:
- memoryview
A blob that contains the file metadata (parquet FileMetadata thrift message) if requested in parquet_writer_options (empty blob otherwise).