Public Member Functions | List of all members
cudf::io::parquet::experimental::hybrid_scan_multifile Class Reference

Multi-file variant of the experimental Hybrid Scan Parquet reader. More...

#include <hybrid_scan_multifile.hpp>

Public Member Functions

 hybrid_scan_multifile (cudf::host_span< cudf::host_span< uint8_t const > const > footer_bytes, parquet_reader_options const &options)
 Constructor for the multi-file experimental Parquet reader. More...
 
 hybrid_scan_multifile (cudf::host_span< FileMetaData const > parquet_metadata, parquet_reader_options const &options)
 Constructor for the multi-file experimental Parquet reader. More...
 
 ~hybrid_scan_multifile ()
 Destructor for the multi-file experimental Parquet reader.
 
std::vector< FileMetaDataparquet_metadatas () const
 Get parquet metadatas for all sources. More...
 
std::vector< byte_range_infopage_index_byte_ranges () const
 Get byte ranges of the page index for all sources. More...
 
void setup_page_indexes (cudf::host_span< cudf::host_span< uint8_t const > const > page_index_bytes) const
 Setup the per-source page index within each Parquet file metadata. More...
 
std::vector< std::vector< size_type > > all_row_groups (parquet_reader_options const &options) const
 Get all available per-source row group indices from the parquet files. More...
 
size_type total_rows_in_row_groups (cudf::host_span< std::vector< size_type > const > row_group_indices) const
 Get the total number of top-level rows in the per-source row groups. More...
 
void reset_column_selection () const
 Resets the current column selection. More...
 
std::vector< std::vector< size_type > > filter_row_groups_with_byte_range (cudf::host_span< std::vector< size_type > const > row_group_indices, parquet_reader_options const &options) const
 Filter the row groups using the byte range specified by [bytes_to_skip, bytes_to_skip + bytes_to_read) More...
 
std::vector< std::vector< size_type > > filter_row_groups_with_stats (cudf::host_span< std::vector< size_type > const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const
 Filter the input row groups using column chunk statistics. More...
 
std::pair< std::vector< byte_range_info >, std::vector< byte_range_info > > secondary_filters_byte_ranges (cudf::host_span< std::vector< size_type > const > row_group_indices, parquet_reader_options const &options) const
 Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning. More...
 

Detailed Description

Multi-file variant of the experimental Hybrid Scan Parquet reader.

Vectorizes hybrid_scan_reader APIs to support multiple Parquet sources. Inputs and outputs are indexed by source order except for the row mask which is a single BOOL8 column spanning all rows from all sources concatenated in source order, then row-group order within a source.

Note
Detailed usage documentation will be added once all APIs are in place. This reader will eventually move to hybrid_scan.hpp and the existing single-file reader (hybrid_scan_reader) will become its subclass. Only keeping this separate here for now to reduce noise.

Definition at line 52 of file hybrid_scan_multifile.hpp.

Constructor & Destructor Documentation

◆ hybrid_scan_multifile() [1/2]

cudf::io::parquet::experimental::hybrid_scan_multifile::hybrid_scan_multifile ( cudf::host_span< cudf::host_span< uint8_t const > const >  footer_bytes,
parquet_reader_options const &  options 
)
explicit

Constructor for the multi-file experimental Parquet reader.

Parameters
footer_bytesHost span of Parquet file footer byte spans, one per source
optionsParquet reader options

◆ hybrid_scan_multifile() [2/2]

cudf::io::parquet::experimental::hybrid_scan_multifile::hybrid_scan_multifile ( cudf::host_span< FileMetaData const >  parquet_metadata,
parquet_reader_options const &  options 
)
explicit

Constructor for the multi-file experimental Parquet reader.

Parameters
parquet_metadataHost span of pre-populated Parquet file metadata, one per source
optionsParquet reader options

Member Function Documentation

◆ all_row_groups()

std::vector<std::vector<size_type> > cudf::io::parquet::experimental::hybrid_scan_multifile::all_row_groups ( parquet_reader_options const &  options) const

Get all available per-source row group indices from the parquet files.

Parameters
optionsParquet reader options
Returns
Vector of row group indices, one inner vector per source

◆ filter_row_groups_with_byte_range()

std::vector<std::vector<size_type> > cudf::io::parquet::experimental::hybrid_scan_multifile::filter_row_groups_with_byte_range ( cudf::host_span< std::vector< size_type > const >  row_group_indices,
parquet_reader_options const &  options 
) const

Filter the row groups using the byte range specified by [bytes_to_skip, bytes_to_skip + bytes_to_read)

Filters the row groups such that only the row groups that start within the byte range are selected. Note that the last selected row group may end beyond the byte range.

Parameters
row_group_indicesInput row group indices, one per source
optionsParquet reader options
Returns
Filtered per-source row group indices (one inner vector per source)

◆ filter_row_groups_with_stats()

std::vector<std::vector<size_type> > cudf::io::parquet::experimental::hybrid_scan_multifile::filter_row_groups_with_stats ( cudf::host_span< std::vector< size_type > const >  row_group_indices,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream 
) const

Filter the input row groups using column chunk statistics.

Parameters
row_group_indicesInput row group indices, one per source
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
Returns
Filtered row group indices, one per source

◆ page_index_byte_ranges()

std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_multifile::page_index_byte_ranges ( ) const

Get byte ranges of the page index for all sources.

Returns
Vector of page index byte ranges, one per source

◆ parquet_metadatas()

std::vector<FileMetaData> cudf::io::parquet::experimental::hybrid_scan_multifile::parquet_metadatas ( ) const

Get parquet metadatas for all sources.

Returns
Vector of parquet metadata, one per source

◆ reset_column_selection()

void cudf::io::parquet::experimental::hybrid_scan_multifile::reset_column_selection ( ) const

Resets the current column selection.

Resets the current column selection state forcing column re-selection in subsequent filter, byte range, setup chunking and materialization APIs. This is useful if the filter expression has been cascaded (and-ed) to include new columns.

◆ secondary_filters_byte_ranges()

std::pair<std::vector<byte_range_info>, std::vector<byte_range_info> > cudf::io::parquet::experimental::hybrid_scan_multifile::secondary_filters_byte_ranges ( cudf::host_span< std::vector< size_type > const >  row_group_indices,
parquet_reader_options const &  options 
) const

Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning.

Note
Device buffers for bloom filter byte ranges must be allocated using a 32 byte aligned memory resource
Parameters
row_group_indicesInput row group indices, one per source
optionsParquet reader options
Returns
Pair of vectors of byte ranges of column chunk with bloom filters and dictionary pages subject to filter predicate

◆ setup_page_indexes()

void cudf::io::parquet::experimental::hybrid_scan_multifile::setup_page_indexes ( cudf::host_span< cudf::host_span< uint8_t const > const >  page_index_bytes) const

Setup the per-source page index within each Parquet file metadata.

Parameters
page_index_bytesHost span of Parquet page index buffer bytes, one per source

◆ total_rows_in_row_groups()

size_type cudf::io::parquet::experimental::hybrid_scan_multifile::total_rows_in_row_groups ( cudf::host_span< std::vector< size_type > const >  row_group_indices) const

Get the total number of top-level rows in the per-source row groups.

Parameters
row_group_indicesInput per-source row group indices (one inner vector per source)
Returns
Total number of top-level rows across all sources

The documentation for this class was generated from the following file: