The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk. More...

#include <parquet.hpp>

Public Member Functions
	chunked_parquet_reader ()
	Default constructor, this should never be used. More...

	chunked_parquet_reader (std::size_t chunk_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Constructor for chunked reader. More...

	chunked_parquet_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Constructor for chunked reader. More...

	~chunked_parquet_reader ()
	Destructor, destroying the internal reader instance. More...

bool	has_next () const
	Check if there is any data in the given file has not yet read. More...

table_with_metadata	read_chunk () const
	Read a chunk of rows in the given Parquet file. More...

Detailed Description

The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk.

This class is designed to address the reading issue when reading very large Parquet files such that the sizes of their column exceed the limit that can be stored in cudf column. By reading the file content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit.

Definition at line 646 of file parquet.hpp.

Constructor & Destructor Documentation

◆ chunked_parquet_reader() [1/3]

cudf::io::chunked_parquet_reader::chunked_parquet_reader ( )

Default constructor, this should never be used.

This is added just to satisfy cython. This is added to not leak detail API

◆ chunked_parquet_reader() [2/3]

cudf::io::chunked_parquet_reader::chunked_parquet_reader	(	std::size_t	chunk_read_limit,
		parquet_reader_options const &	options,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Constructor for chunked reader.

This constructor requires the same parquet_reader_option parameter as in cudf::read_parquet(), and an additional parameter to specify the size byte limit of the output table for each reading.

Parameters

chunk_read_limit	Limit on total number of bytes to be returned per read, or `0` if there is no limit
options	The options used to read Parquet file
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

◆ chunked_parquet_reader() [3/3]

cudf::io::chunked_parquet_reader::chunked_parquet_reader	(	std::size_t	chunk_read_limit,
		std::size_t	pass_read_limit,
		parquet_reader_options const &	options,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Constructor for chunked reader.

This constructor requires the same parquet_reader_option parameter as in cudf::read_parquet(), with additional parameters to specify the size byte limit of the output table for each reading, and a byte limit on the amount of temporary memory to use when reading. pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded.

Parameters

chunk_read_limit	Limit on total number of bytes to be returned per read, or `0` if there is no limit
pass_read_limit	Limit on the amount of memory used for reading and decompressing data or `0` if there is no limit
options	The options used to read Parquet file
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

◆ ~chunked_parquet_reader()

cudf::io::chunked_parquet_reader::~chunked_parquet_reader ( )

Destructor, destroying the internal reader instance.

Since the declaration of the internal reader object does not exist in this header, this destructor needs to be defined in a separate source file which can access to that object's declaration.

Member Function Documentation

◆ has_next()

bool cudf::io::chunked_parquet_reader::has_next ( ) const

Check if there is any data in the given file has not yet read.

Returns: A boolean value indicating if there is any data left to read

◆ read_chunk()

table_with_metadata cudf::io::chunked_parquet_reader::read_chunk ( ) const

Read a chunk of rows in the given Parquet file.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given file at once.

An empty table will be returned if the given file is empty, or all the data in the file has been read and returned by the previous calls.

Returns: An output cudf::table along with its metadata

The documentation for this class was generated from the following file:

parquet.hpp

Public Member Functions