The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk. More...
#include <parquet.hpp>
Public Member Functions | |
chunked_parquet_reader () | |
Default constructor, this should never be used. More... | |
chunked_parquet_reader (std::size_t chunk_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) | |
Constructor for chunked reader. More... | |
chunked_parquet_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) | |
Constructor for chunked reader. More... | |
~chunked_parquet_reader () | |
Destructor, destroying the internal reader instance. More... | |
bool | has_next () const |
Check if there is any data in the given file has not yet read. More... | |
table_with_metadata | read_chunk () const |
Read a chunk of rows in the given Parquet file. More... | |
The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk.
This class is designed to address the reading issue when reading very large Parquet files such that the sizes of their column exceed the limit that can be stored in cudf column. By reading the file content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit.
Definition at line 479 of file parquet.hpp.
cudf::io::chunked_parquet_reader::chunked_parquet_reader | ( | ) |
Default constructor, this should never be used.
This is added just to satisfy cython. This is added to not leak detail API
cudf::io::chunked_parquet_reader::chunked_parquet_reader | ( | std::size_t | chunk_read_limit, |
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Constructor for chunked reader.
This constructor requires the same parquet_reader_option
parameter as in cudf::read_parquet()
, and an additional parameter to specify the size byte limit of the output table for each reading.
chunk_read_limit | Limit on total number of bytes to be returned per read, or 0 if there is no limit |
options | The options used to read Parquet file |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource to use for device memory allocation |
cudf::io::chunked_parquet_reader::chunked_parquet_reader | ( | std::size_t | chunk_read_limit, |
std::size_t | pass_read_limit, | ||
parquet_reader_options const & | options, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Constructor for chunked reader.
This constructor requires the same parquet_reader_option
parameter as in cudf::read_parquet()
, with additional parameters to specify the size byte limit of the output table for each reading, and a byte limit on the amount of temporary memory to use when reading. pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded.
chunk_read_limit | Limit on total number of bytes to be returned per read, or 0 if there is no limit |
pass_read_limit | Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit |
options | The options used to read Parquet file |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource to use for device memory allocation |
cudf::io::chunked_parquet_reader::~chunked_parquet_reader | ( | ) |
Destructor, destroying the internal reader instance.
Since the declaration of the internal reader
object does not exist in this header, this destructor needs to be defined in a separate source file which can access to that object's declaration.
bool cudf::io::chunked_parquet_reader::has_next | ( | ) | const |
Check if there is any data in the given file has not yet read.
table_with_metadata cudf::io::chunked_parquet_reader::read_chunk | ( | ) | const |
Read a chunk of rows in the given Parquet file.
The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given file at once.
An empty table will be returned if the given file is empty, or all the data in the file has been read and returned by the previous calls.
cudf::table
along with its metadata