Public Member Functions | List of all members
cudf::io::chunked_orc_reader Class Reference

The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk. More...

#include <orc.hpp>

Public Member Functions

 chunked_orc_reader ()
 Default constructor, this should never be used. More...
 
 chunked_orc_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, size_type output_row_granularity, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Construct the reader from input/output size limits, output row granularity, along with other ORC reader options. More...
 
 chunked_orc_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Construct the reader from input/output size limits along with other ORC reader options. More...
 
 chunked_orc_reader (std::size_t chunk_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Construct the reader from output size limits along with other ORC reader options. More...
 
 ~chunked_orc_reader ()
 Destructor, destroying the internal reader instance.
 
bool has_next () const
 Check if there is any data in the given data sources has not yet read. More...
 
table_with_metadata read_chunk () const
 Read a chunk of rows in the given data sources. More...
 

Detailed Description

The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk.

This class is designed to address the reading issue when reading very large ORC files such that sizes of their columns exceed the limit that can be stored in cudf columns. By reading the file content by chunks using this class, each chunk is guaranteed to have its size stay within the given limit.

Definition at line 421 of file orc.hpp.

Constructor & Destructor Documentation

◆ chunked_orc_reader() [1/4]

cudf::io::chunked_orc_reader::chunked_orc_reader ( )

Default constructor, this should never be used.

This is added just to satisfy cython.

◆ chunked_orc_reader() [2/4]

cudf::io::chunked_orc_reader::chunked_orc_reader ( std::size_t  chunk_read_limit,
std::size_t  pass_read_limit,
size_type  output_row_granularity,
orc_reader_options const &  options,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)
explicit

Construct the reader from input/output size limits, output row granularity, along with other ORC reader options.

The typical usage should be similar to this:

do {
auto const chunk = reader.read_chunk();
// Process chunk
} while (reader.has_next());

If chunk_read_limit == 0 (i.e., no output limit) and pass_read_limit == 0 (no temporary memory size limit), a call to read_chunk() will read the whole data source and return a table containing all rows.

The chunk_read_limit parameter controls the size of the output table to be returned per read_chunk() call. If the user specifies a 100 MB limit, the reader will attempt to return tables that have a total bytes size (over all columns) of 100 MB or less. This is a soft limit and the code will not fail if it cannot satisfy the limit.

The pass_read_limit parameter controls how much temporary memory is used in the entire process of loading, decompressing and decoding of data. Again, this is also a soft limit and the reader will try to make the best effort.

Finally, the parameter output_row_granularity controls the changes in row number of the output chunk. For each call to read_chunk(), with respect to the given pass_read_limit, a subset of stripes may be loaded, decompressed and decoded into an intermediate table. The reader will then subdivide that table into smaller tables for final output using output_row_granularity as the subdivision step.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit
pass_read_limitLimit on temporary memory usage for reading the data sources, or 0 if there is no limit
output_row_granularityThe granularity parameter used for subdividing the decoded table for final output
optionsSettings for controlling reading behaviors
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource to use for device memory allocation
Exceptions
cudf::logic_errorif output_row_granularity is non-positive

◆ chunked_orc_reader() [3/4]

cudf::io::chunked_orc_reader::chunked_orc_reader ( std::size_t  chunk_read_limit,
std::size_t  pass_read_limit,
orc_reader_options const &  options,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)
explicit

Construct the reader from input/output size limits along with other ORC reader options.

This constructor implicitly call the other constructor with output_row_granularity set to DEFAULT_OUTPUT_ROW_GRANULARITY rows.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit
pass_read_limitLimit on temporary memory usage for reading the data sources, or 0 if there is no limit
optionsSettings for controlling reading behaviors
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource to use for device memory allocation

◆ chunked_orc_reader() [4/4]

cudf::io::chunked_orc_reader::chunked_orc_reader ( std::size_t  chunk_read_limit,
orc_reader_options const &  options,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)
explicit

Construct the reader from output size limits along with other ORC reader options.

This constructor implicitly call the other constructor with pass_read_limit set to 0 and output_row_granularity set to DEFAULT_OUTPUT_ROW_GRANULARITY rows.

Parameters
chunk_read_limitLimit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit
optionsSettings for controlling reading behaviors
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource to use for device memory allocation

Member Function Documentation

◆ has_next()

bool cudf::io::chunked_orc_reader::has_next ( ) const

Check if there is any data in the given data sources has not yet read.

Returns
A boolean value indicating if there is any data left to read

◆ read_chunk()

table_with_metadata cudf::io::chunked_orc_reader::read_chunk ( ) const

Read a chunk of rows in the given data sources.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given data sources at once.

An empty table will be returned if the given sources are empty, or all the data has been read and returned by the previous calls.

Returns
An output cudf::table along with its metadata

The documentation for this class was generated from the following file: