Public Member Functions | List of all members
cudf::io::chunked_parquet_reader Class Reference

The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk. More...

#include <parquet.hpp>

Public Member Functions

 chunked_parquet_reader ()
 Default constructor, this should never be used. More...
 
 chunked_parquet_reader (std::size_t chunk_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Constructor for chunked reader. More...
 
 chunked_parquet_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Constructor for chunked reader. More...
 
 ~chunked_parquet_reader ()
 Destructor, destroying the internal reader instance. More...
 
bool has_next () const
 Check if there is any data in the given file has not yet read. More...
 
table_with_metadata read_chunk () const
 Read a chunk of rows in the given Parquet file. More...
 

Detailed Description

The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk.

This class is designed to address the reading issue when reading very large Parquet files such that the sizes of their column exceed the limit that can be stored in cudf column. By reading the file content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit.

Definition at line 479 of file parquet.hpp.

Constructor & Destructor Documentation

◆ chunked_parquet_reader() [1/3]

cudf::io::chunked_parquet_reader::chunked_parquet_reader ( )

Default constructor, this should never be used.

This is added just to satisfy cython. This is added to not leak detail API

◆ chunked_parquet_reader() [2/3]

cudf::io::chunked_parquet_reader::chunked_parquet_reader ( std::size_t  chunk_read_limit,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Constructor for chunked reader.

This constructor requires the same parquet_reader_option parameter as in cudf::read_parquet(), and an additional parameter to specify the size byte limit of the output table for each reading.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per read, or 0 if there is no limit
optionsThe options used to read Parquet file
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource to use for device memory allocation

◆ chunked_parquet_reader() [3/3]

cudf::io::chunked_parquet_reader::chunked_parquet_reader ( std::size_t  chunk_read_limit,
std::size_t  pass_read_limit,
parquet_reader_options const &  options,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Constructor for chunked reader.

This constructor requires the same parquet_reader_option parameter as in cudf::read_parquet(), with additional parameters to specify the size byte limit of the output table for each reading, and a byte limit on the amount of temporary memory to use when reading. pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per read, or 0 if there is no limit
pass_read_limitLimit on the amount of memory used for reading and decompressing data or 0 if there is no limit
optionsThe options used to read Parquet file
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource to use for device memory allocation

◆ ~chunked_parquet_reader()

cudf::io::chunked_parquet_reader::~chunked_parquet_reader ( )

Destructor, destroying the internal reader instance.

Since the declaration of the internal reader object does not exist in this header, this destructor needs to be defined in a separate source file which can access to that object's declaration.

Member Function Documentation

◆ has_next()

bool cudf::io::chunked_parquet_reader::has_next ( ) const

Check if there is any data in the given file has not yet read.

Returns
A boolean value indicating if there is any data left to read

◆ read_chunk()

table_with_metadata cudf::io::chunked_parquet_reader::read_chunk ( ) const

Read a chunk of rows in the given Parquet file.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given file at once.

An empty table will be returned if the given file is empty, or all the data in the file has been read and returned by the previous calls.

Returns
An output cudf::table along with its metadata

The documentation for this class was generated from the following file: