The chunked parquet reader class to read a Parquet source iteratively in a series of tables, chunk by chunk. Each chunk is prepended with a row index column built using the specified row group offsets and row counts. The resultant table chunk is filtered using the supplied serialized roaring64 bitmap deletion vector and returned. More...
#include <deletion_vectors.hpp>
Public Member Functions | |
| chunked_parquet_reader (std::size_t chunk_read_limit, parquet_reader_options const &options, cudf::host_span< cuda::std::byte const > serialized_roaring64, cudf::host_span< size_t const > row_group_offsets, cudf::host_span< size_type const > row_group_num_rows, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) | |
| Constructor for the chunked reader. More... | |
| chunked_parquet_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, cudf::host_span< cuda::std::byte const > serialized_roaring64, cudf::host_span< size_t const > row_group_offsets, cudf::host_span< size_type const > row_group_num_rows, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) | |
| Constructor for the chunked reader. More... | |
| ~chunked_parquet_reader () | |
| Destructor, destroying the internal reader instance and the roaring bitmap deletion vector. | |
| bool | has_next () const |
| Check if there is any data in the given source that has not yet been read. More... | |
| table_with_metadata | read_chunk () |
| Read a chunk of table from the Parquet source, prepend an index column to it, and filters the resultant table chunk using the 64-bit roaring bitmap deletion vector, if provided. More... | |
The chunked parquet reader class to read a Parquet source iteratively in a series of tables, chunk by chunk. Each chunk is prepended with a row index column built using the specified row group offsets and row counts. The resultant table chunk is filtered using the supplied serialized roaring64 bitmap deletion vector and returned.
This class is designed to address the reading issue when reading very large Parquet source such that the row count exceeds the cudf column size limit or if there are device memory constraints. By reading the source content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit. Note that the given memory limits do not account for the device memory needed to deserialize and construct the roaring64 bitmap deletion vector that stays alive throughout the the lifetime of the reader.
Definition at line 37 of file deletion_vectors.hpp.
| cudf::io::parquet::experimental::chunked_parquet_reader::chunked_parquet_reader | ( | std::size_t | chunk_read_limit, |
| parquet_reader_options const & | options, | ||
| cudf::host_span< cuda::std::byte const > | serialized_roaring64, | ||
| cudf::host_span< size_t const > | row_group_offsets, | ||
| cudf::host_span< size_type const > | row_group_num_rows, | ||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) |
Constructor for the chunked reader.
Requires the same arguments as the cudf::io::parquet::experimental::read_parquet(), and an additional parameter to specify the size byte limit of the output table chunk produced.
| chunk_read_limit | Byte limit on the returned table chunk size, 0 if there is no limit |
| options | Parquet reader options |
| serialized_roaring64 | Host span of portable serialized 64-bit roaring bitmap |
| row_group_offsets | Host span of row offsets of each row group |
| row_group_num_rows | Host span of number of rows in each row group |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource to use for device memory allocation |
| cudf::io::parquet::experimental::chunked_parquet_reader::chunked_parquet_reader | ( | std::size_t | chunk_read_limit, |
| std::size_t | pass_read_limit, | ||
| parquet_reader_options const & | options, | ||
| cudf::host_span< cuda::std::byte const > | serialized_roaring64, | ||
| cudf::host_span< size_t const > | row_group_offsets, | ||
| cudf::host_span< size_type const > | row_group_num_rows, | ||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) |
Constructor for the chunked reader.
Requires the same arguments as cudf::io::parquet::experimental::read_parquet(), with additional parameters to specify the size byte limit of the output table chunk produced, and a byte limit on the amount of temporary memory to use when reading. The pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. The pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded. Also note that the pass_read_limit does not include the memory to deserialize and construct the roaring64 bitmap deletion vector that stays alive throughout the the lifetime of the reader.
| chunk_read_limit | Byte limit on the returned table chunk size, 0 if there is no limit |
| pass_read_limit | Byte limit on the amount of memory used for decompressing and decoding data, 0 if there is no limit |
| options | Parquet reader options |
| serialized_roaring64 | Host span of portable serialized 64-bit roaring bitmap |
| row_group_offsets | Host span of row offsets of each row group |
| row_group_num_rows | Host span of number of rows in each row group |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource to use for device memory allocation |
| bool cudf::io::parquet::experimental::chunked_parquet_reader::has_next | ( | ) | const |
Check if there is any data in the given source that has not yet been read.
| table_with_metadata cudf::io::parquet::experimental::chunked_parquet_reader::read_chunk | ( | ) |
Read a chunk of table from the Parquet source, prepend an index column to it, and filters the resultant table chunk using the 64-bit roaring bitmap deletion vector, if provided.
The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given source at once.
An empty table will be returned if the given source is empty, or all the data in the source has been read and returned by the previous calls.
cudf::table along with its metadata