Public Member Functions | List of all members
cudf::io::parquet::experimental::hybrid_scan_reader Class Reference

The experimental parquet reader class to optimally read parquet files subject to highly selective filters, called a Hybrid Scan operation. More...

#include <hybrid_scan.hpp>

Public Member Functions

 hybrid_scan_reader (cudf::host_span< uint8_t const > footer_bytes, parquet_reader_options const &options)
 Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters. More...
 
 hybrid_scan_reader (FileMetaData const &parquet_metadata, parquet_reader_options const &options)
 Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters. More...
 
 ~hybrid_scan_reader ()
 Destructor for the experimental parquet reader class.
 
FileMetaData parquet_metadata () const
 Get the Parquet file footer metadata. More...
 
byte_range_info page_index_byte_range () const
 Get the byte range of the page index in the Parquet file. More...
 
void setup_page_index (cudf::host_span< uint8_t const > page_index_bytes) const
 Setup the page index within the Parquet file metadata (FileMetaData) More...
 
std::vector< size_typeall_row_groups (parquet_reader_options const &options) const
 Get all available row groups from the parquet file. More...
 
size_type total_rows_in_row_groups (cudf::host_span< size_type const > row_group_indices) const
 Get the total number of top-level rows in the row groups. More...
 
void reset_column_selection () const
 Resets the current column selection. More...
 
std::vector< size_typefilter_row_groups_with_byte_range (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const
 Filter the row groups using the specified byte range specified by [bytes_to_skip, bytes_to_skip + bytes_to_read) More...
 
std::vector< size_typefilter_row_groups_with_stats (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const
 Filter the input row groups using column chunk statistics. More...
 
std::pair< std::vector< byte_range_info >, std::vector< byte_range_info > > secondary_filters_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const
 Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning. More...
 
std::vector< size_typefilter_row_groups_with_dictionary_pages (cudf::host_span< cudf::device_span< uint8_t const > const > dictionary_page_data, cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const
 Filter the row groups using column chunk dictionary pages. More...
 
std::vector< size_typefilter_row_groups_with_bloom_filters (cudf::host_span< cudf::device_span< uint8_t const > const > bloom_filter_data, cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const
 Filter the row groups using column chunk bloom filters. More...
 
std::unique_ptr< cudf::columnbuild_all_true_row_mask (cudf::host_span< size_type const > row_group_indices, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Builds a boolean (survival) column of size equal to the total number of rows in the row groups containing all true values. More...
 
std::unique_ptr< cudf::columnbuild_row_mask_with_page_index_stats (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Builds a boolean column indicating surviving rows using page-level statistics in the page index. More...
 
std::vector< byte_range_infofilter_column_chunks_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const
 Get byte ranges of column chunks of filter columns. More...
 
table_with_metadata materialize_filter_columns (cudf::host_span< size_type const > row_group_indices, cudf::host_span< cudf::device_span< uint8_t const > const > column_chunk_data, cudf::mutable_column_view &row_mask, use_data_page_mask mask_data_pages, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Materializes filter columns and updates the input row mask to only the rows that exist in the output table. More...
 
std::vector< byte_range_infopayload_column_chunks_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const
 Get byte ranges of column chunks of payload columns. More...
 
table_with_metadata materialize_payload_columns (cudf::host_span< size_type const > row_group_indices, cudf::host_span< cudf::device_span< uint8_t const > const > column_chunk_data, cudf::column_view const &row_mask, use_data_page_mask mask_data_pages, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Materialize payload columns and applies the row mask to the output table. More...
 
std::vector< byte_range_infoall_column_chunks_byte_ranges (cudf::host_span< size_type const > row_group_indices, parquet_reader_options const &options) const
 Get byte ranges of column chunks of all (or selected) columns. More...
 
table_with_metadata materialize_all_columns (cudf::host_span< size_type const > row_group_indices, cudf::host_span< cudf::device_span< uint8_t const > const > column_chunk_data, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Materializes all (or selected) columns and returns the final output table. More...
 
void setup_chunking_for_filter_columns (std::size_t chunk_read_limit, std::size_t pass_read_limit, cudf::host_span< size_type const > row_group_indices, cudf::column_view const &row_mask, use_data_page_mask mask_data_pages, cudf::host_span< cudf::device_span< uint8_t const > const > column_chunk_data, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Setup chunking information for filter columns and preprocess the input data pages. More...
 
table_with_metadata materialize_filter_columns_chunk (cudf::mutable_column_view &row_mask) const
 Materializes a chunk of filter columns and updates the corresponding range of input row mask to only the rows that exist in the output table. More...
 
void setup_chunking_for_payload_columns (std::size_t chunk_read_limit, std::size_t pass_read_limit, cudf::host_span< size_type const > row_group_indices, cudf::column_view const &row_mask, use_data_page_mask mask_data_pages, cudf::host_span< cudf::device_span< uint8_t const > const > column_chunk_data, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Setup chunking information for payload columns and preprocess the input data pages. More...
 
table_with_metadata materialize_payload_columns_chunk (cudf::column_view const &row_mask) const
 Materializes a chunk of payload columns and applies the corresponding range of input row mask to the output table chunk. More...
 
void setup_chunking_for_all_columns (std::size_t chunk_read_limit, std::size_t pass_read_limit, cudf::host_span< size_type const > row_group_indices, cudf::host_span< cudf::device_span< uint8_t const > const > column_chunk_data, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const
 Setup chunking information for all (or selected) columns and preprocess the input data pages. More...
 
table_with_metadata materialize_all_columns_chunk () const
 Materializes all (or selected) columns and returns the final output table. More...
 
bool has_next_table_chunk () const
 Check if there is any parquet data left to read for the current setup. More...
 

Detailed Description

The experimental parquet reader class to optimally read parquet files subject to highly selective filters, called a Hybrid Scan operation.

This class is designed to best exploit reductive optimization techniques to speed up reading Parquet files subject to highly selective filters. The parquet file contents are read in two passes. In the first pass, only the filter columns (i.e. columns that appear in the filter expression) are read allowing pruning of row groups and filter column data pages using the filter expression. In the second pass, only the payload columns (i.e. columns that do not appear in the filter expression) are optimally read by applying the surviving row mask from the first pass to prune payload column data pages.

The following code snippets demonstrate how to use the experimental parquet reader.

Start with an instance of the experimental reader with a span of parquet file footer bytes and parquet reader options.

// Example filter expression `A < 100`
column_name_reference{"A"},
literal{100});
using namespace cudf::io;
// Input datasource
auto const datasource_ptr = datasource::create(parquet_filepath);
auto datasource = std::ref(*datasource_ptr);
// Parquet reader options
auto options = parquet_reader_options::builder().filter(filter_expression).build();
// Fetch parquet file footer bytes from the file
auto const footer_buffer = parquet::fetch_footer_to_host(datasource);
// Create the reader
auto reader =
std::make_unique<parquet::experimental::hybrid_scan_reader>(*footer_buffer, options);
An operation expression holds an operator and zero or more operands.
static std::unique_ptr< datasource > create(std::string const &filepath, size_t offset=0, size_t max_size_estimate=0)
Creates a source from a file path.
parquet_reader_options && build()
move parquet_reader_options member once it's built.
Definition: parquet.hpp:743
parquet_reader_options_builder & filter(ast::expression const &filter)
Sets AST based filter for predicate pushdown.
Definition: parquet.hpp:574
static parquet_reader_options_builder builder(source_info src=source_info{})
Creates a parquet_reader_options_builder to build parquet_reader_options. By default,...
std::unique_ptr< cudf::io::datasource::buffer > fetch_footer_to_host(cudf::io::datasource &datasource)
Fetches a host buffer of Parquet footer bytes from the input data source.
IO interfaces.
Definition: avro.hpp:19

Metadata handling (OPTIONAL): Get a materialized parquet file footer metadata struct (FileMetaData) from the reader to get insights into the parquet data as needed. Optionally, set up the page index to materialize page level stats used for data page pruning.

// Get Parquet file metadata from the reader
auto metadata = reader->parquet_metadata();
// Example metadata use: Calculate the number of rows in the file
auto nrows = std::accumulate(metadata.row_groups.begin(),
metadata.row_groups.end(),
[](auto sum, auto const& rg) { return sum + rg.num_rows; });
// Get the page index byte range from the reader
auto page_index_byte_range = reader->page_index_byte_range();
// Fetch the page index bytes from the parquet file
auto const page_index_buffer =
// Set up the page index
reader->setup_page_index(*page_index_buffer);
// A new `FileMetaData` struct with populated page index structs may be obtained
// using `parquet_metadata()` at this point. Page index may be set up at any time.
auto metadata_with_page_index = reader->parquet_metadata();
byte_range_info page_index_byte_range() const
Get the byte range of the page index in the Parquet file.
std::unique_ptr< cudf::io::datasource::buffer > fetch_page_index_to_host(cudf::io::datasource &datasource, byte_range_info const page_index_bytes)
Fetches a host buffer of Parquet page index from the input data source.
int32_t size_type
Row index type for columns and tables.
Definition: types.hpp:84

Row group pruning (OPTIONAL): Start with either a list of custom or all row group indices in the parquet file and optionally filter it using a byte range and/or the filter expression using column chunk statistics, dictionaries and bloom filters. Byte ranges for column chunk dictionary pages and bloom filters within parquet file may be obtained via secondary_filters_byte_ranges() function. The byte ranges may be read into device buffers and their device spans may be passed to the row group filtration functions.

// Start with a list of all parquet row group indices from the file footer
auto all_row_group_indices = reader->all_row_groups(options);
// Span to track the indices of row groups currently at hand
auto current_row_group_indices = cudf::host_span<size_type>(all_row_group_indices);
// Optional: Prune row group indices to the ones that start within the byte range
auto byte_range_filtered_row_group_indices =
reader->filter_row_groups_with_byte_range(current_row_group_indices, options);
// Update current row group indices to byte range filtered row group indices
current_row_group_indices = byte_range_filtered_row_group_indices;
// Optional: Prune row group indices subject to filter expression using row group statistics
auto stats_filtered_row_group_indices =
reader->filter_row_groups_with_stats(current_row_group_indices, options, stream);
// Update current row group indices to now track the stats-filtered row group indices
current_row_group_indices = stats_filtered_row_group_indices;
// Get byte ranges of bloom filters and dictionaries for the current row groups
auto [bloom_filter_byte_ranges, dict_page_byte_ranges] =
reader->secondary_filters_byte_ranges(current_row_group_indices, options);
// Optional: Prune row groups if we have valid dictionary pages
auto dict_filtered_row_group_indices = std::vector<size_type>{};
if (dict_page_byte_ranges.size()) {
// Fetch dictionary page byte ranges into device buffers and create spans
auto [dict_page_buffers, dict_page_data, dict_page_tasks] =
parquet::fetch_byte_ranges_to_device_async(datasource, dict_page_byte_ranges, stream, mr);
dict_page_tasks.get();
// Prune row groups using dictionaries
dict_filtered_row_group_indices = reader->filter_row_groups_with_dictionary_pages(
dict_page_data, current_row_group_indices, options, stream);
// Update current row group indices to dictionary page filtered row group indices
current_row_group_indices = dict_filtered_row_group_indices;
}
// Optional: Prune row groups if we have valid bloom filters
auto bloom_filtered_row_group_indices = std::vector<size_type>{};
if (bloom_filter_byte_ranges.size()) {
// Fetch bloom filter byte ranges into device buffers and create spans
auto [bloom_filter_buffers, bloom_filter_data, bloom_filter_tasks] =
parquet::fetch_byte_ranges_to_device_async(datasource, bloom_filter_byte_ranges, stream, mr);
bloom_filter_tasks.get();
// Prune row groups using bloom filters
bloom_filtered_row_group_indices = reader->filter_row_groups_with_bloom_filters(
bloom_filter_data, current_row_group_indices, options, stream);
// Update current row group indices to bloom filtered row group indices
current_row_group_indices = bloom_filtered_row_group_indices;
}
std::tuple< std::vector< rmm::device_buffer >, std::vector< cudf::device_span< uint8_t const > >, std::future< void > > fetch_byte_ranges_to_device_async(cudf::io::datasource &datasource, cudf::host_span< byte_range_info const > byte_ranges, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr)
Fetches a list of byte ranges from a datasource into device buffers.
C++20 std::span with reduced feature set.
Definition: span.hpp:184

Build an initial row mask: Once the row groups are filtered, the next step is to build an initial BOOL8 row mask column indicating which rows in the current span of row groups survive in the final table. This row mask column may contain all true values built using the build_all_true_row_mask() function or it may contain a true value for only the rows that survive the page-level statistics from the page index subject to the same filter as row groups (needs page index to be set up using the setup_page_index() function). The size of this row mask column must be equal to the total number of rows in the current span of row groups.

// If not already done, get the page index byte range
auto page_index_byte_range = reader->page_index_byte_range();
// If not already done, fetch the page index bytes from the parquet file
auto const page_index_buffer =
// If not already done, set up the page index now
reader->setup_page_index(*page_index_buffer);
// Build a row mask column containing all `true` values
auto row_mask = reader->build_all_true_row_mask(current_row_group_indices, stream, mr);
// Alternatively, build a row mask column indicating only the rows that survive the page-level
// statistics in the page index
auto row_mask = reader->build_row_mask_with_page_index_stats(
current_row_group_indices, options, stream, mr);

Materialize filter columns: Once we are done with pruning row groups and constructing the row mask, the next step is to materialize filter columns into a table (first reader pass). This is done using the materialize_filter_columns() function. This function requires a span of device spans of column chunk data for the current list of row groups, and a mutable view of the current row mask. The function optionally builds a mask for the current data pages using the input row mask to skip decompression and decoding of the pruned pages based on the mask_data_pages argument. The filter columns are then read into a table and filtered based on the filter expression and the row mask is updated to only indicate the rows that survive in the read table. The final table is returned. The byte ranges for the required column chunk data may be obtained using the filter_column_chunks_byte_ranges() function and read into device buffers with corresponding device spans.

// Get byte ranges of column chunk byte ranges from the reader
auto const filter_col_byte_ranges =
reader->filter_column_chunks_byte_ranges(current_row_group_indices, options);
// Fetch column chunk data into device buffers and create spans
auto [filter_col_buffers, filter_col_data, filter_col_tasks] =
parquet::fetch_byte_ranges_to_device_async(datasource, filter_col_byte_ranges, stream, mr);
filter_col_tasks.get();
// Materialize the table with only the filter columns
auto [filter_table, filter_metadata] =
reader->materialize_filter_columns(current_row_group_indices,
filter_col_data,
row_mask->mutable_view(),
options,
stream);
@ YES
Compute and use a data page mask.

Materialize payload columns: Once the filter columns are materialized, the final step is to materialize the payload columns into another table (second reader pass). This is done using the materialize_payload_columns() function which is identical to the materialize_filter_columns() in terms of functionality except that it accepts an immutable view of the row mask and uses it to filter the read output table before returning it. The byte ranges for the required column chunk data may be obtained using the payload_column_chunks_byte_ranges() function and read into device buffers with corresponding device spans.

// Get column chunk byte ranges from the reader
auto const payload_col_byte_ranges =
reader->payload_column_chunks_byte_ranges(current_row_group_indices, options);
// Fetch column chunk data into device buffers and create spans
auto [payload_col_buffers, payload_col_data, payload_col_tasks] =
parquet::fetch_byte_ranges_to_device_async(datasource, payload_col_byte_ranges, stream, mr);
payload_col_tasks.get();
// Materialize the table with only the payload columns
auto [payload_table, payload_metadata] =
reader->materialize_payload_columns(current_row_group_indices,
payload_col_data,
row_mask->view(),
options,
stream);

Once both reader passes are complete, the filter and payload column tables may be trivially combined by releasing the columns from both tables and moving them into a new cudf table.

Note
The performance advantage of this reader is most prominent when the filter expression is highly selective, i.e. when the data in filter columns are at least partially ordered and the number of rows that survive the filter is small compared to the total number of rows in the parquet file. Otherwise, the performance is identical to the cudf::io::read_parquet() function.

Definition at line 278 of file hybrid_scan.hpp.

Constructor & Destructor Documentation

◆ hybrid_scan_reader() [1/2]

cudf::io::parquet::experimental::hybrid_scan_reader::hybrid_scan_reader ( cudf::host_span< uint8_t const >  footer_bytes,
parquet_reader_options const &  options 
)
explicit

Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters.

Parameters
footer_bytesHost span of parquet file footer bytes
optionsParquet reader options

◆ hybrid_scan_reader() [2/2]

cudf::io::parquet::experimental::hybrid_scan_reader::hybrid_scan_reader ( FileMetaData const &  parquet_metadata,
parquet_reader_options const &  options 
)
explicit

Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters.

Parameters
parquet_metadataPre-populated Parquet file metadata
optionsParquet reader options

Member Function Documentation

◆ all_column_chunks_byte_ranges()

std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_reader::all_column_chunks_byte_ranges ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options 
) const

Get byte ranges of column chunks of all (or selected) columns.

Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
Returns
Vector of byte ranges to column chunks of all (or selected) columns

◆ all_row_groups()

std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::all_row_groups ( parquet_reader_options const &  options) const

Get all available row groups from the parquet file.

Parameters
optionsParquet reader options
Returns
Vector of row group indices

◆ build_all_true_row_mask()

std::unique_ptr<cudf::column> cudf::io::parquet::experimental::hybrid_scan_reader::build_all_true_row_mask ( cudf::host_span< size_type const >  row_group_indices,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Builds a boolean (survival) column of size equal to the total number of rows in the row groups containing all true values.

Parameters
row_group_indicesInput row groups indices
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
An all-true boolean (survival) column of size equal to the total number of rows in the row groups

◆ build_row_mask_with_page_index_stats()

std::unique_ptr<cudf::column> cudf::io::parquet::experimental::hybrid_scan_reader::build_row_mask_with_page_index_stats ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Builds a boolean column indicating surviving rows using page-level statistics in the page index.

Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
A boolean column indicating which filter column rows survive the statistics in the page index

◆ filter_column_chunks_byte_ranges()

std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_reader::filter_column_chunks_byte_ranges ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options 
) const

Get byte ranges of column chunks of filter columns.

Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
Returns
Vector of byte ranges to column chunks of filter columns

◆ filter_row_groups_with_bloom_filters()

std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_bloom_filters ( cudf::host_span< cudf::device_span< uint8_t const > const >  bloom_filter_data,
cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream 
) const

Filter the row groups using column chunk bloom filters.

Note
The bloom_filter_data device spans must point to 32-byte aligned addresses
Parameters
bloom_filter_dataDevice spans of bloom filter data of column chunks with an equality predicate
row_group_indicesInput row groups indices
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
Returns
Filtered row group indices

◆ filter_row_groups_with_byte_range()

std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_byte_range ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options 
) const

Filter the row groups using the specified byte range specified by [bytes_to_skip, bytes_to_skip + bytes_to_read)

Filters the row groups such that only the row groups that start within the byte range are selected. Note that the last selected row group may end beyond the byte range.

Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
Returns
Filtered row group indices

◆ filter_row_groups_with_dictionary_pages()

std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_dictionary_pages ( cudf::host_span< cudf::device_span< uint8_t const > const >  dictionary_page_data,
cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream 
) const

Filter the row groups using column chunk dictionary pages.

Parameters
dictionary_page_dataDevice spans of dictionary page data of column chunks with an (in)equality predicate
row_group_indicesInput row groups indices
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
Returns
Filtered row group indices

◆ filter_row_groups_with_stats()

std::vector<size_type> cudf::io::parquet::experimental::hybrid_scan_reader::filter_row_groups_with_stats ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream 
) const

Filter the input row groups using column chunk statistics.

Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
Returns
Filtered row group indices

◆ has_next_table_chunk()

bool cudf::io::parquet::experimental::hybrid_scan_reader::has_next_table_chunk ( ) const

Check if there is any parquet data left to read for the current setup.

Returns
Boolean indicating if there is any data left to read

◆ materialize_all_columns()

table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_all_columns ( cudf::host_span< size_type const >  row_group_indices,
cudf::host_span< cudf::device_span< uint8_t const > const >  column_chunk_data,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Materializes all (or selected) columns and returns the final output table.

Parameters
row_group_indicesInput row groups indices
column_chunk_dataDevice spans of column chunk data of all columns
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the device memory for the output table
Returns
Table of all materialized columns and metadata

◆ materialize_all_columns_chunk()

table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_all_columns_chunk ( ) const

Materializes all (or selected) columns and returns the final output table.

Returns
Table of materialized all (or selected) columns and metadata

◆ materialize_filter_columns()

table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_filter_columns ( cudf::host_span< size_type const >  row_group_indices,
cudf::host_span< cudf::device_span< uint8_t const > const >  column_chunk_data,
cudf::mutable_column_view row_mask,
use_data_page_mask  mask_data_pages,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Materializes filter columns and updates the input row mask to only the rows that exist in the output table.

Parameters
row_group_indicesInput row groups indices
column_chunk_dataDevice spans of column chunk data of filter columns
[in,out]row_maskMutable boolean column indicating surviving rows from page pruning
mask_data_pagesWhether to build and use a data page mask using the row mask
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the device memory for the output table
Returns
Table of materialized filter columns and metadata

◆ materialize_filter_columns_chunk()

table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_filter_columns_chunk ( cudf::mutable_column_view row_mask) const

Materializes a chunk of filter columns and updates the corresponding range of input row mask to only the rows that exist in the output table.

Parameters
[in,out]row_maskMutable boolean column indicating surviving rows from page pruning
Returns
Table chunk of materialized filter columns and metadata

◆ materialize_payload_columns()

table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_payload_columns ( cudf::host_span< size_type const >  row_group_indices,
cudf::host_span< cudf::device_span< uint8_t const > const >  column_chunk_data,
cudf::column_view const &  row_mask,
use_data_page_mask  mask_data_pages,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Materialize payload columns and applies the row mask to the output table.

Parameters
row_group_indicesInput row groups indices
column_chunk_dataDevice spans of column chunk data of payload columns
row_maskBoolean column indicating which rows need to be read. All rows read if empty
mask_data_pagesWhether to build and use a data page mask using the row mask
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the device memory for the output table
Returns
Table of materialized payload columns and metadata

◆ materialize_payload_columns_chunk()

table_with_metadata cudf::io::parquet::experimental::hybrid_scan_reader::materialize_payload_columns_chunk ( cudf::column_view const &  row_mask) const

Materializes a chunk of payload columns and applies the corresponding range of input row mask to the output table chunk.

Parameters
row_maskBoolean column indicating which rows need to be read. All rows read if empty
Returns
Table chunk of materialized filter columns and metadata

◆ page_index_byte_range()

byte_range_info cudf::io::parquet::experimental::hybrid_scan_reader::page_index_byte_range ( ) const

Get the byte range of the page index in the Parquet file.

Returns
Byte range of the page index

◆ parquet_metadata()

FileMetaData cudf::io::parquet::experimental::hybrid_scan_reader::parquet_metadata ( ) const

Get the Parquet file footer metadata.

Returns the materialized Parquet file footer metadata struct. The footer will contain the materialized page index if called after setup_page_index().

Returns
Parquet file footer metadata

◆ payload_column_chunks_byte_ranges()

std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_reader::payload_column_chunks_byte_ranges ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options 
) const

Get byte ranges of column chunks of payload columns.

Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
Returns
Vector of byte ranges to column chunks of payload columns

◆ reset_column_selection()

void cudf::io::parquet::experimental::hybrid_scan_reader::reset_column_selection ( ) const

Resets the current column selection.

Resets the current column selection state forcing column re-selection in subsequent filter, byte range, setup chunking and materialization APIs. This is useful if the filter expression has been cascaded (and-ed) to include new columns

◆ secondary_filters_byte_ranges()

std::pair<std::vector<byte_range_info>, std::vector<byte_range_info> > cudf::io::parquet::experimental::hybrid_scan_reader::secondary_filters_byte_ranges ( cudf::host_span< size_type const >  row_group_indices,
parquet_reader_options const &  options 
) const

Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning.

Note
Device buffers for bloom filter byte ranges must be allocated using a 32 byte aligned memory resource
Parameters
row_group_indicesInput row groups indices
optionsParquet reader options
Returns
Pair of vectors of byte ranges of column chunk with bloom filters and dictionary pages subject to filter predicate

◆ setup_chunking_for_all_columns()

void cudf::io::parquet::experimental::hybrid_scan_reader::setup_chunking_for_all_columns ( std::size_t  chunk_read_limit,
std::size_t  pass_read_limit,
cudf::host_span< size_type const >  row_group_indices,
cudf::host_span< cudf::device_span< uint8_t const > const >  column_chunk_data,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Setup chunking information for all (or selected) columns and preprocess the input data pages.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per table chunk. 0 if there is no limit
pass_read_limitLimit on the memory used for reading and decompressing data. 0 if there is no limit
row_group_indicesInput row groups indices
column_chunk_dataDevice spans of column chunk data of all columns
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the device memory for the output table chunks

◆ setup_chunking_for_filter_columns()

void cudf::io::parquet::experimental::hybrid_scan_reader::setup_chunking_for_filter_columns ( std::size_t  chunk_read_limit,
std::size_t  pass_read_limit,
cudf::host_span< size_type const >  row_group_indices,
cudf::column_view const &  row_mask,
use_data_page_mask  mask_data_pages,
cudf::host_span< cudf::device_span< uint8_t const > const >  column_chunk_data,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Setup chunking information for filter columns and preprocess the input data pages.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per table chunk. 0 if there is no limit
pass_read_limitLimit on the memory used for reading and decompressing data. 0 if there is no limit
row_group_indicesInput row groups indices
row_maskBoolean column indicating which rows need to be read. All rows read if empty
mask_data_pagesWhether to build and use a data page mask using the row mask
column_chunk_dataDevice spans of column chunk data of filter columns
optionsParquet reader options
mrDevice memory resource used to allocate the device memory for the output table chunks
streamCUDA stream used for device memory operations and kernel launches

◆ setup_chunking_for_payload_columns()

void cudf::io::parquet::experimental::hybrid_scan_reader::setup_chunking_for_payload_columns ( std::size_t  chunk_read_limit,
std::size_t  pass_read_limit,
cudf::host_span< size_type const >  row_group_indices,
cudf::column_view const &  row_mask,
use_data_page_mask  mask_data_pages,
cudf::host_span< cudf::device_span< uint8_t const > const >  column_chunk_data,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream,
rmm::device_async_resource_ref  mr 
) const

Setup chunking information for payload columns and preprocess the input data pages.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per table chunk. 0 if there is no limit
pass_read_limitLimit on the memory used for reading and decompressing data. 0 if there is no limit
row_group_indicesInput row groups indices
row_maskBoolean column indicating which rows need to be read. All rows read if empty
mask_data_pagesWhether to build and use a data page mask using the row mask
column_chunk_dataDevice spans of column chunk data of payload columns
optionsParquet reader options
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the device memory for the output table chunks

◆ setup_page_index()

void cudf::io::parquet::experimental::hybrid_scan_reader::setup_page_index ( cudf::host_span< uint8_t const >  page_index_bytes) const

Setup the page index within the Parquet file metadata (FileMetaData)

Materialize the ColumnIndex and OffsetIndex structs (collectively called the page index) within the Parquet file metadata struct (returned by parquet_metadata()). The statistics contained in page index can be used to prune data pages before decoding.

Parameters
page_index_bytesHost span of Parquet page index buffer bytes

◆ total_rows_in_row_groups()

size_type cudf::io::parquet::experimental::hybrid_scan_reader::total_rows_in_row_groups ( cudf::host_span< size_type const >  row_group_indices) const

Get the total number of top-level rows in the row groups.

Parameters
row_group_indicesInput row groups indices
Returns
Total number of top-level rows in the row groups

The documentation for this class was generated from the following file: