Multi-file variant of the experimental Hybrid Scan Parquet reader. More...
#include <hybrid_scan_multifile.hpp>
Public Member Functions | |
| hybrid_scan_multifile (cudf::host_span< cudf::host_span< uint8_t const > const > footer_bytes, parquet_reader_options const &options) | |
| Constructor for the multi-file experimental Parquet reader. More... | |
| hybrid_scan_multifile (cudf::host_span< FileMetaData const > parquet_metadata, parquet_reader_options const &options) | |
| Constructor for the multi-file experimental Parquet reader. More... | |
| ~hybrid_scan_multifile () | |
| Destructor for the multi-file experimental Parquet reader. | |
| std::vector< FileMetaData > | parquet_metadatas () const |
| Get parquet metadatas for all sources. More... | |
| std::vector< byte_range_info > | page_index_byte_ranges () const |
| Get byte ranges of the page index for all sources. More... | |
| void | setup_page_indexes (cudf::host_span< cudf::host_span< uint8_t const > const > page_index_bytes) const |
| Setup the per-source page index within each Parquet file metadata. More... | |
| std::vector< std::vector< size_type > > | all_row_groups (parquet_reader_options const &options) const |
| Get all available per-source row group indices from the parquet files. More... | |
| size_type | total_rows_in_row_groups (cudf::host_span< std::vector< size_type > const > row_group_indices) const |
| Get the total number of top-level rows in the per-source row groups. More... | |
| void | reset_column_selection () const |
| Resets the current column selection. More... | |
| std::vector< std::vector< size_type > > | filter_row_groups_with_byte_range (cudf::host_span< std::vector< size_type > const > row_group_indices, parquet_reader_options const &options) const |
Filter the row groups using the byte range specified by [bytes_to_skip, bytes_to_skip + bytes_to_read) More... | |
| std::vector< std::vector< size_type > > | filter_row_groups_with_stats (cudf::host_span< std::vector< size_type > const > row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const |
| Filter the input row groups using column chunk statistics. More... | |
| std::pair< std::vector< byte_range_info >, std::vector< byte_range_info > > | secondary_filters_byte_ranges (cudf::host_span< std::vector< size_type > const > row_group_indices, parquet_reader_options const &options) const |
| Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning. More... | |
Multi-file variant of the experimental Hybrid Scan Parquet reader.
Vectorizes hybrid_scan_reader APIs to support multiple Parquet sources. Inputs and outputs are indexed by source order except for the row mask which is a single BOOL8 column spanning all rows from all sources concatenated in source order, then row-group order within a source.
hybrid_scan.hpp and the existing single-file reader (hybrid_scan_reader) will become its subclass. Only keeping this separate here for now to reduce noise. Definition at line 52 of file hybrid_scan_multifile.hpp.
|
explicit |
Constructor for the multi-file experimental Parquet reader.
| footer_bytes | Host span of Parquet file footer byte spans, one per source |
| options | Parquet reader options |
|
explicit |
Constructor for the multi-file experimental Parquet reader.
| parquet_metadata | Host span of pre-populated Parquet file metadata, one per source |
| options | Parquet reader options |
| std::vector<std::vector<size_type> > cudf::io::parquet::experimental::hybrid_scan_multifile::all_row_groups | ( | parquet_reader_options const & | options | ) | const |
Get all available per-source row group indices from the parquet files.
| options | Parquet reader options |
| std::vector<std::vector<size_type> > cudf::io::parquet::experimental::hybrid_scan_multifile::filter_row_groups_with_byte_range | ( | cudf::host_span< std::vector< size_type > const > | row_group_indices, |
| parquet_reader_options const & | options | ||
| ) | const |
Filter the row groups using the byte range specified by [bytes_to_skip, bytes_to_skip + bytes_to_read)
Filters the row groups such that only the row groups that start within the byte range are selected. Note that the last selected row group may end beyond the byte range.
| row_group_indices | Input row group indices, one per source |
| options | Parquet reader options |
| std::vector<std::vector<size_type> > cudf::io::parquet::experimental::hybrid_scan_multifile::filter_row_groups_with_stats | ( | cudf::host_span< std::vector< size_type > const > | row_group_indices, |
| parquet_reader_options const & | options, | ||
| rmm::cuda_stream_view | stream | ||
| ) | const |
Filter the input row groups using column chunk statistics.
| row_group_indices | Input row group indices, one per source |
| options | Parquet reader options |
| stream | CUDA stream used for device memory operations and kernel launches |
| std::vector<byte_range_info> cudf::io::parquet::experimental::hybrid_scan_multifile::page_index_byte_ranges | ( | ) | const |
Get byte ranges of the page index for all sources.
| std::vector<FileMetaData> cudf::io::parquet::experimental::hybrid_scan_multifile::parquet_metadatas | ( | ) | const |
Get parquet metadatas for all sources.
| void cudf::io::parquet::experimental::hybrid_scan_multifile::reset_column_selection | ( | ) | const |
Resets the current column selection.
Resets the current column selection state forcing column re-selection in subsequent filter, byte range, setup chunking and materialization APIs. This is useful if the filter expression has been cascaded (and-ed) to include new columns.
| std::pair<std::vector<byte_range_info>, std::vector<byte_range_info> > cudf::io::parquet::experimental::hybrid_scan_multifile::secondary_filters_byte_ranges | ( | cudf::host_span< std::vector< size_type > const > | row_group_indices, |
| parquet_reader_options const & | options | ||
| ) | const |
Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning.
| row_group_indices | Input row group indices, one per source |
| options | Parquet reader options |
| void cudf::io::parquet::experimental::hybrid_scan_multifile::setup_page_indexes | ( | cudf::host_span< cudf::host_span< uint8_t const > const > | page_index_bytes | ) | const |
Setup the per-source page index within each Parquet file metadata.
| page_index_bytes | Host span of Parquet page index buffer bytes, one per source |
| size_type cudf::io::parquet::experimental::hybrid_scan_multifile::total_rows_in_row_groups | ( | cudf::host_span< std::vector< size_type > const > | row_group_indices | ) | const |
Get the total number of top-level rows in the per-source row groups.
| row_group_indices | Input per-source row group indices (one inner vector per source) |