The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk. More...

#include <orc.hpp>

Public Member Functions
	chunked_orc_reader ()=default
	Default constructor, this should never be used. More...

	chunked_orc_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, size_type output_row_granularity, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
	Construct the reader from input/output size limits, output row granularity, along with other ORC reader options. More...

	chunked_orc_reader (std::size_t chunk_read_limit, std::size_t pass_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
	Construct the reader from input/output size limits along with other ORC reader options. More...

	chunked_orc_reader (std::size_t chunk_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
	Construct the reader from output size limits along with other ORC reader options. More...

	~chunked_orc_reader ()
	Destructor, destroying the internal reader instance.

bool	has_next () const
	Check if there is any data in the given data sources has not yet read. More...

table_with_metadata	read_chunk () const
	Read a chunk of rows in the given data sources. More...

Detailed Description

The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk.

This class is designed to address the reading issue when reading very large ORC files such that sizes of their columns exceed the limit that can be stored in cudf columns. By reading the file content by chunks using this class, each chunk is guaranteed to have its size stay within the given limit.

Definition at line 418 of file orc.hpp.

Constructor & Destructor Documentation

◆ chunked_orc_reader() [1/4]

cudf::io::chunked_orc_reader::chunked_orc_reader ( )

default

Default constructor, this should never be used.

This is added just to satisfy cython.

◆ chunked_orc_reader() [2/4]

cudf::io::chunked_orc_reader::chunked_orc_reader	(	std::size_t	chunk_read_limit,
		std::size_t	pass_read_limit,
		size_type	output_row_granularity,
		orc_reader_options const &	options,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `rmm::mr::get_current_device_resource()`
	)

explicit

Construct the reader from input/output size limits, output row granularity, along with other ORC reader options.

The typical usage should be similar to this:

do {
  auto const chunk = reader.read_chunk();
  // Process chunk
} while (reader.has_next());

If chunk_read_limit == 0 (i.e., no output limit) and pass_read_limit == 0 (no temporary memory size limit), a call to read_chunk() will read the whole data source and return a table containing all rows.

The chunk_read_limit parameter controls the size of the output table to be returned per read_chunk() call. If the user specifies a 100 MB limit, the reader will attempt to return tables that have a total bytes size (over all columns) of 100 MB or less. This is a soft limit and the code will not fail if it cannot satisfy the limit.

The pass_read_limit parameter controls how much temporary memory is used in the entire process of loading, decompressing and decoding of data. Again, this is also a soft limit and the reader will try to make the best effort.

Finally, the parameter output_row_granularity controls the changes in row number of the output chunk. For each call to read_chunk(), with respect to the given pass_read_limit, a subset of stripes may be loaded, decompressed and decoded into an intermediate table. The reader will then subdivide that table into smaller tables for final output using output_row_granularity as the subdivision step.

Parameters

chunk_read_limit	Limit on total number of bytes to be returned per `read_chunk()` call, or `0` if there is no limit
pass_read_limit	Limit on temporary memory usage for reading the data sources, or `0` if there is no limit
output_row_granularity	The granularity parameter used for subdividing the decoded table for final output
options	Settings for controlling reading behaviors
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

Exceptions

cudf::logic_error if output_row_granularity is non-positive

◆ chunked_orc_reader() [3/4]

cudf::io::chunked_orc_reader::chunked_orc_reader	(	std::size_t	chunk_read_limit,
		std::size_t	pass_read_limit,
		orc_reader_options const &	options,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `rmm::mr::get_current_device_resource()`
	)

explicit

Construct the reader from input/output size limits along with other ORC reader options.

This constructor implicitly call the other constructor with output_row_granularity set to DEFAULT_OUTPUT_ROW_GRANULARITY rows.

Parameters

chunk_read_limit	Limit on total number of bytes to be returned per `read_chunk()` call, or `0` if there is no limit
pass_read_limit	Limit on temporary memory usage for reading the data sources, or `0` if there is no limit
options	Settings for controlling reading behaviors
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

◆ chunked_orc_reader() [4/4]

cudf::io::chunked_orc_reader::chunked_orc_reader	(	std::size_t	chunk_read_limit,
		orc_reader_options const &	options,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `rmm::mr::get_current_device_resource()`
	)

explicit

Construct the reader from output size limits along with other ORC reader options.

This constructor implicitly call the other constructor with pass_read_limit set to 0 and output_row_granularity set to DEFAULT_OUTPUT_ROW_GRANULARITY rows.

Parameters

chunk_read_limit	Limit on total number of bytes to be returned per `read_chunk()` call, or `0` if there is no limit
options	Settings for controlling reading behaviors
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource to use for device memory allocation

Member Function Documentation

◆ has_next()

bool cudf::io::chunked_orc_reader::has_next ( ) const

Check if there is any data in the given data sources has not yet read.

Returns: A boolean value indicating if there is any data left to read

◆ read_chunk()

table_with_metadata cudf::io::chunked_orc_reader::read_chunk ( ) const

Read a chunk of rows in the given data sources.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given data sources at once.

An empty table will be returned if the given sources are empty, or all the data has been read and returned by the previous calls.

Returns: An output cudf::table along with its metadata

The documentation for this class was generated from the following file:

orc.hpp

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ chunked_orc_reader() [1/4]

◆ chunked_orc_reader() [2/4]

◆ chunked_orc_reader() [3/4]

◆ chunked_orc_reader() [4/4]

Member Function Documentation

◆ has_next()

◆ read_chunk()