The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk. More...
#include <orc.hpp>
Public Member Functions | |
chunked_orc_reader () | |
Default constructor, this should never be used. More... | |
chunked_orc_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, size_type output_row_granularity, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) | |
Construct the reader from input/output size limits, output row granularity, along with other ORC reader options. More... | |
chunked_orc_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) | |
Construct the reader from input/output size limits along with other ORC reader options. More... | |
chunked_orc_reader (std::size_t chunk_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) | |
Construct the reader from output size limits along with other ORC reader options. More... | |
~chunked_orc_reader () | |
Destructor, destroying the internal reader instance. | |
bool | has_next () const |
Check if there is any data in the given data sources has not yet read. More... | |
table_with_metadata | read_chunk () const |
Read a chunk of rows in the given data sources. More... | |
The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk.
This class is designed to address the reading issue when reading very large ORC files such that sizes of their columns exceed the limit that can be stored in cudf columns. By reading the file content by chunks using this class, each chunk is guaranteed to have its size stay within the given limit.
cudf::io::chunked_orc_reader::chunked_orc_reader | ( | ) |
Default constructor, this should never be used.
This is added just to satisfy cython.
|
explicit |
Construct the reader from input/output size limits, output row granularity, along with other ORC reader options.
The typical usage should be similar to this:
If chunk_read_limit == 0
(i.e., no output limit) and pass_read_limit == 0
(no temporary memory size limit), a call to read_chunk()
will read the whole data source and return a table containing all rows.
The chunk_read_limit
parameter controls the size of the output table to be returned per read_chunk()
call. If the user specifies a 100 MB limit, the reader will attempt to return tables that have a total bytes size (over all columns) of 100 MB or less. This is a soft limit and the code will not fail if it cannot satisfy the limit.
The pass_read_limit
parameter controls how much temporary memory is used in the entire process of loading, decompressing and decoding of data. Again, this is also a soft limit and the reader will try to make the best effort.
Finally, the parameter output_row_granularity
controls the changes in row number of the output chunk. For each call to read_chunk()
, with respect to the given pass_read_limit
, a subset of stripes may be loaded, decompressed and decoded into an intermediate table. The reader will then subdivide that table into smaller tables for final output using output_row_granularity
as the subdivision step.
chunk_read_limit | Limit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit |
pass_read_limit | Limit on temporary memory usage for reading the data sources, or 0 if there is no limit |
output_row_granularity | The granularity parameter used for subdividing the decoded table for final output |
options | Settings for controlling reading behaviors |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource to use for device memory allocation |
cudf::logic_error | if output_row_granularity is non-positive |
|
explicit |
Construct the reader from input/output size limits along with other ORC reader options.
This constructor implicitly call the other constructor with output_row_granularity
set to DEFAULT_OUTPUT_ROW_GRANULARITY
rows.
chunk_read_limit | Limit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit |
pass_read_limit | Limit on temporary memory usage for reading the data sources, or 0 if there is no limit |
options | Settings for controlling reading behaviors |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource to use for device memory allocation |
|
explicit |
Construct the reader from output size limits along with other ORC reader options.
This constructor implicitly call the other constructor with pass_read_limit
set to 0
and output_row_granularity
set to DEFAULT_OUTPUT_ROW_GRANULARITY
rows.
chunk_read_limit | Limit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit |
options | Settings for controlling reading behaviors |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource to use for device memory allocation |
bool cudf::io::chunked_orc_reader::has_next | ( | ) | const |
Check if there is any data in the given data sources has not yet read.
table_with_metadata cudf::io::chunked_orc_reader::read_chunk | ( | ) | const |
Read a chunk of rows in the given data sources.
The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given data sources at once.
An empty table will be returned if the given sources are empty, or all the data has been read and returned by the previous calls.
cudf::table
along with its metadata