The chunked parquet reader class to read a Parquet source iteratively in a series of tables, chunk by chunk. Each chunk is prepended with a row index column built using the specified row group offsets and row counts. The resultant table chunk is filtered using the supplied serialized roaring64 bitmap deletion vector and returned. More...

#include <deletion_vectors.hpp>

Public Member Functions
	chunked_parquet_reader (std::size_t chunk_read_limit, parquet_reader_options const &options, cudf::host_span< cuda::std::byte const > serialized_roaring64, cudf::host_span< size_t const > row_group_offsets, cudf::host_span< size_type const > row_group_num_rows, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Constructor for the chunked reader. More...

	chunked_parquet_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, cudf::host_span< cuda::std::byte const > serialized_roaring64, cudf::host_span< size_t const > row_group_offsets, cudf::host_span< size_type const > row_group_num_rows, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Constructor for the chunked reader. More...

	~chunked_parquet_reader ()
	Destructor, destroying the internal reader instance and the roaring bitmap deletion vector.

bool	has_next () const
	Check if there is any data in the given source that has not yet been read. More...

table_with_metadata	read_chunk ()
	Read a chunk of table from the Parquet source, prepend an index column to it, and filters the resultant table chunk using the 64-bit roaring bitmap deletion vector, if provided. More...

Detailed Description

The chunked parquet reader class to read a Parquet source iteratively in a series of tables, chunk by chunk. Each chunk is prepended with a row index column built using the specified row group offsets and row counts. The resultant table chunk is filtered using the supplied serialized roaring64 bitmap deletion vector and returned.

This class is designed to address the reading issue when reading very large Parquet source such that the row count exceeds the cudf column size limit or if there are device memory constraints. By reading the source content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit. Note that the given memory limits do not account for the device memory needed to deserialize and construct the roaring64 bitmap deletion vector that stays alive throughout the the lifetime of the reader.

Definition at line 37 of file deletion_vectors.hpp.

Constructor & Destructor Documentation

◆ chunked_parquet_reader() [1/2]

cudf::io::parquet::experimental::chunked_parquet_reader::chunked_parquet_reader	(	std::size_t	chunk_read_limit,
		parquet_reader_options const &	options,
		cudf::host_span< cuda::std::byte const >	serialized_roaring64,
		cudf::host_span< size_t const >	row_group_offsets,
		cudf::host_span< size_type const >	row_group_num_rows,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Constructor for the chunked reader.

Requires the same arguments as the cudf::io::parquet::experimental::read_parquet(), and an additional parameter to specify the size byte limit of the output table chunk produced.

Parameters

chunk_read_limit	Byte limit on the returned table chunk size, `0` if there is no limit
options	Parquet reader options
serialized_roaring64	Host span of `portable` serialized 64-bit roaring bitmap
row_group_offsets	Host span of row offsets of each row group
row_group_num_rows	Host span of number of rows in each row group
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

◆ chunked_parquet_reader() [2/2]

cudf::io::parquet::experimental::chunked_parquet_reader::chunked_parquet_reader	(	std::size_t	chunk_read_limit,
		std::size_t	pass_read_limit,
		parquet_reader_options const &	options,
		cudf::host_span< cuda::std::byte const >	serialized_roaring64,
		cudf::host_span< size_t const >	row_group_offsets,
		cudf::host_span< size_type const >	row_group_num_rows,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Constructor for the chunked reader.

Requires the same arguments as cudf::io::parquet::experimental::read_parquet(), with additional parameters to specify the size byte limit of the output table chunk produced, and a byte limit on the amount of temporary memory to use when reading. The pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. The pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded. Also note that the pass_read_limit does not include the memory to deserialize and construct the roaring64 bitmap deletion vector that stays alive throughout the the lifetime of the reader.

Parameters

chunk_read_limit	Byte limit on the returned table chunk size, `0` if there is no limit
pass_read_limit	Byte limit on the amount of memory used for decompressing and decoding data, `0` if there is no limit
options	Parquet reader options
serialized_roaring64	Host span of `portable` serialized 64-bit roaring bitmap
row_group_offsets	Host span of row offsets of each row group
row_group_num_rows	Host span of number of rows in each row group
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

Member Function Documentation

◆ has_next()

bool cudf::io::parquet::experimental::chunked_parquet_reader::has_next ( ) const

Check if there is any data in the given source that has not yet been read.

Returns: Boolean value indicating if there is any data left to be read

◆ read_chunk()

table_with_metadata cudf::io::parquet::experimental::chunked_parquet_reader::read_chunk ( )

Read a chunk of table from the Parquet source, prepend an index column to it, and filters the resultant table chunk using the 64-bit roaring bitmap deletion vector, if provided.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given source at once.

An empty table will be returned if the given source is empty, or all the data in the source has been read and returned by the previous calls.

Returns: An output cudf::table along with its metadata

The documentation for this class was generated from the following file:

deletion_vectors.hpp