Io Readers#

group Readers

Enums

enum class use_data_page_mask : bool#

Whether to compute and use a page mask using the row mask to skip decompression and decoding of the masked pages.

Values:

enumerator YES#: Compute and use a data page mask.

enumerator NO#: Do not compute or use a data page mask.

enum class json_recovery_mode_t#

Control the error recovery behavior of the json parser.

Values:

enumerator FAIL#: Does not recover from an error when encountering an invalid format.

enumerator RECOVER_WITH_NULL#: Recovers from an error, replacing invalid records with null.

Functions

table_with_metadata read_avro(avro_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Reads an Avro dataset into a set of columns.

The following code snippet demonstrates how to read a dataset from a file:

auto source  = cudf::io::source_info("dataset.avro");
auto options = cudf::io::avro_reader_options::builder(source);
auto result  = cudf::io::read_avro(options);

Parameters:

options – Settings for controlling reading behavior
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate device memory of the table in the returned table_with_metadata

Returns:

The set of columns along with metadata

table_with_metadata read_csv(csv_reader_options options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Reads a CSV dataset into a set of columns.

The following code snippet demonstrates how to read a dataset from a file:

auto source  = cudf::io::source_info("dataset.csv");
auto options = cudf::io::csv_reader_options::builder(source);
auto result  = cudf::io::read_csv(options);

Parameters:

options – Settings for controlling reading behavior
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate device memory of the table in the returned table_with_metadata

Returns:

The set of columns along with metadata

table_with_metadata read_parquet(parquet_reader_options const &options, cudf::host_span<cuda::std::byte const> serialized_roaring64, cudf::host_span<size_t const> row_group_offsets, cudf::host_span<size_type const> row_group_num_rows, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource_ref())#

Reads a table from parquet source, prepends an index column to it, deserializes the roaring64 deletion vector and applies it to the read table.

Reads a table from a parquet source, builds a row index column to the table using the specified row group offsets and row counts and prepends it to the table, deserializes the specified roaring64 deletion vector and applies it to the read table. If the row group offsets and row counts are empty, the index column is simply a sequence of UINT64 from 0 to the total number of rows in the table. If the serialized roaring64 bitmap span is empty, the read table (prepended with the index column) is returned as is.

Parameters:

options – Parquet reader options
serialized_roaring64 – Host span of portable serialized 64-bit roaring bitmap
row_group_offsets – Host span of row index offsets for each row group
row_group_num_rows – Host span of number of rows in each row group
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate device memory of the returned table

Returns:

Read table with a prepended index column filtered using the deletion vector, along with its metadata

table_with_metadata read_json(json_reader_options options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Reads a JSON dataset into a set of columns.

The following code snippet demonstrates how to read a dataset from a file:

auto source  = cudf::io::source_info("dataset.json");
auto options = cudf::io::read_json_options::builder(source);
auto result  = cudf::io::read_json(options);

Parameters:

options – Settings for controlling reading behavior
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate device memory of the table in the returned table_with_metadata.

Returns:

The set of columns along with metadata

bool is_supported_read_orc(compression_type compression)#

Check if the compression type is supported for reading ORC files.

Note

This is a runtime check. Some compression types may not be supported because of the current system configuration.

Parameters:: compression – Compression type
Returns:: Boolean indicating if the compression type is supported

bool is_supported_write_orc(compression_type compression)#

Check if the compression type is supported for writing ORC files.

Note

This is a runtime check. Some compression types may not be supported because of the current system configuration.

Parameters:: compression – Compression type
Returns:: Boolean indicating if the compression type is supported

table_with_metadata read_orc(orc_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Reads an ORC dataset into a set of columns.

The following code snippet demonstrates how to read a dataset from a file:

auto source  = cudf::io::source_info("dataset.orc");
auto options = cudf::io::orc_reader_options::builder(source);
auto result  = cudf::io::read_orc(options);

Parameters:

options – Settings for controlling reading behavior
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate device memory of the table in the returned table_with_metadata.

Returns:

The set of columns

raw_orc_statistics read_raw_orc_statistics(source_info const &src_info, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Reads file-level and stripe-level statistics of ORC dataset.

The following code snippet demonstrates how to read statistics of a dataset from a file:

auto result = cudf::read_raw_orc_statistics(cudf::source_info("dataset.orc"));

Parameters:

src_info – Dataset source
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Column names and encoded ORC statistics

parsed_orc_statistics read_parsed_orc_statistics(source_info const &src_info, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Reads file-level and stripe-level statistics of ORC dataset.

Parameters:

src_info – Dataset source
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Column names and decoded ORC statistics

orc_metadata read_orc_metadata(source_info const &src_info, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Reads metadata of ORC dataset.

Parameters:

src_info – Dataset source
stream – CUDA stream used for device memory operations and kernel launches

Returns:

orc_metadata with ORC schema, number of rows and number of stripes.

bool is_supported_read_parquet(compression_type compression)#

Check if the compression type is supported for reading Parquet files.

Note

This is a runtime check. Some compression types may not be supported because of the current system configuration.

Parameters:: compression – Compression type
Returns:: Boolean indicating if the compression type is supported

bool is_supported_write_parquet(compression_type compression)#

Check if the compression type is supported for writing Parquet files.

Note

This is a runtime check. Some compression types may not be supported because of the current system configuration.

Parameters:: compression – Compression type
Returns:: Boolean indicating if the compression type is supported

table_with_metadata read_parquet(parquet_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Reads a Parquet dataset into a set of columns.

The following code snippet demonstrates how to read a dataset from a file:

auto source  = cudf::io::source_info("dataset.parquet");
auto options = cudf::io::parquet_reader_options::builder(source);
auto result  = cudf::io::read_parquet(options);

Parameters:

options – Settings for controlling reading behavior
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate device memory of the table in the returned table_with_metadata

Returns:

The set of columns along with metadata

parquet_metadata read_parquet_metadata(source_info const &src_info)#

Reads metadata of parquet dataset.

Parameters:: src_info – Dataset source
Returns:: parquet_metadata with parquet schema, number of rows, number of row groups and key-value metadata

std::vector<byte_range_info> create_byte_range_infos_consecutive(int64_t total_bytes, int64_t range_count)#

Create a collection of consecutive ranges between [0, total_bytes).

Each range wil be the same size except if total_bytes is not evenly divisible by range_count, in which case the last range size will be the remainder.

Parameters:

total_bytes – total number of bytes in all ranges
range_count – total number of ranges in which to divide bytes

Returns:

Vector of range objects

byte_range_info create_byte_range_info_max()#

Create a byte_range_info which represents as much of a file as possible. Specifically, [0, numeric_limits<int64_t>:\:max()).

Returns:: Byte range info of size [0, numeric_limits<int64_t>:\:max())

std::unique_ptr<cudf::column> multibyte_split(data_chunk_source const &source, std::string_view delimiter, parse_options options = {}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Splits the source text into a strings column using a multiple byte delimiter.

Providing a byte range allows multibyte_split to read a file partially, only returning the offsets of delimiters which begin within the range. If thinking in terms of “records”, where each delimiter dictates the end of a record, all records which begin within the byte range provided will be returned, including any record which may begin in the range but end outside of the range. Records which begin outside of the range will ignored, even if those records end inside the range.

Examples:
 source:     "abc..def..ghi..jkl.."
 delimiter:  ".."

 byte_range: nullopt
 return:     ["abc..", "def..", "ghi..", jkl..", ""]

 byte_range: [0, 2)
 return:     ["abc.."]

 byte_range: [2, 9)
 return:     ["def..", "ghi.."]

 byte_range: [11, 2)
 return:     []

 byte_range: [13, 7)
 return:     ["jkl..", ""]

Parameters:

source – The source string
delimiter – UTF-8 encoded string for which to find offsets in the source
options – the parsing options to use (including byte range)
stream – CUDA stream used for device memory operations and kernel launches
mr – Memory resource to use for the device memory allocation

Returns:

The strings found by splitting the source by the delimiter within the relevant byte range.

Variables

constexpr size_t default_stripe_size_bytes = 64 * 1024 * 1024#: 64MB default orc stripe size

constexpr size_type default_stripe_size_rows = 1000000#: 1M rows default orc stripe rows

constexpr size_type default_row_index_stride = 10000#: 10K rows default orc row index stride

constexpr size_t default_row_group_size_bytes = std::numeric_limits<size_t>::max()#: Infinite bytes per row group.

constexpr size_type default_row_group_size_rows = 1'000'000#: 1 million rows per row group

constexpr size_t default_max_page_size_bytes = 512 * 1024#: 512KB per page

constexpr size_type default_max_page_size_rows = 20000#: 20k rows per page

constexpr int32_t default_column_index_truncate_length = 64#: truncate to 64 bytes

constexpr size_t default_max_dictionary_size = 1024 * 1024#: 1MB dictionary size

constexpr size_type default_max_page_fragment_size = 5000#: 5000 rows per page fragment

class avro_reader_options#

#include <avro.hpp>

Settings to use for read_avro().

Public Functions

avro_reader_options() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline source_info const &get_source() const#

Returns source info.

Returns:: Source info

inline std::vector<std::string> get_columns() const#

Returns names of the columns to be read.

Returns:: Names of the columns to be read

inline size_type get_skip_rows() const#

Returns number of rows to skip from the start.

Returns:: Number of rows to skip from the start

inline size_type get_num_rows() const#

Returns number of rows to read.

Returns:: Number of rows to read

inline void set_source(source_info src)#

Sets source info.

Parameters:: src – The source info.

inline void set_columns(std::vector<std::string> col_names)#

Set names of the column to be read.

Parameters:: col_names – Vector of column names

inline void set_skip_rows(size_type val)#

Sets number of rows to skip.

Parameters:: val – Number of rows to skip from start

inline void set_num_rows(size_type val)#

Sets number of rows to read.

Parameters:: val – Number of rows to read after skip

Public Static Functions

static avro_reader_options_builder builder(source_info src)#

create avro_reader_options_builder which will build avro_reader_options.

Parameters:: src – source information used to read avro file
Returns:: builder to build reader options

class avro_reader_options_builder#

#include <avro.hpp>

Builder to build options for read_avro().

Public Functions

avro_reader_options_builder() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline explicit avro_reader_options_builder(source_info src)#

Constructor from source info.

Parameters:: src – The source information used to read avro file

inline avro_reader_options_builder &columns(std::vector<std::string> col_names)#

Set names of the column to be read.

Parameters:: col_names – Vector of column names
Returns:: this for chaining

inline avro_reader_options_builder &skip_rows(size_type val)#

Sets number of rows to skip.

Parameters:: val – Number of rows to skip from start
Returns:: this for chaining

inline avro_reader_options_builder &num_rows(size_type val)#

Sets number of rows to read.

Parameters:: val – Number of rows to read after skip
Returns:: this for chaining

inline operator avro_reader_options&&()#: move avro_reader_options member once it’s built.

inline avro_reader_options &&build()#

move avro_reader_options member once it’s built.

This has been added since Cython does not support overloading of conversion operators.

Returns:: Built avro_reader_options object’s r-value reference

class csv_reader_options#

#include <csv.hpp>

Settings to use for read_csv().

Public Functions

csv_reader_options() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline source_info const &get_source() const#

Returns source info.

Returns:: Source info

inline compression_type get_compression() const#

Returns compression format of the source.

Returns:: Compression format of the source

inline std::size_t get_byte_range_offset() const#

Returns number of bytes to skip from source start.

Returns:: Number of bytes to skip from source start

inline std::size_t get_byte_range_size() const#

Returns number of bytes to read.

Returns:: Number of bytes to read

inline std::size_t get_byte_range_size_with_padding() const#

Returns number of bytes to read with padding.

Returns:: Number of bytes to read with padding

inline std::size_t get_byte_range_padding() const#

Returns number of bytes to pad when reading.

Returns:: Number of bytes to pad when reading

inline std::vector<std::string> const &get_names() const#

Returns names of the columns.

Returns:: Names of the columns

inline std::string get_prefix() const#

Returns prefix to be used for column ID.

Returns:: Prefix to be used for column ID

inline bool is_enabled_mangle_dupe_cols() const#

Whether to rename duplicate column names.

Returns:: true if duplicate column names are renamed

inline std::vector<std::string> const &get_use_cols_names() const#

Returns names of the columns to be read.

Returns:: Names of the columns to be read

inline std::vector<int> const &get_use_cols_indexes() const#

Returns indexes of columns to read.

Returns:: Indexes of columns to read

inline size_type get_nrows() const#

Returns number of rows to read.

Returns:: Number of rows to read

inline size_type get_skiprows() const#

Returns number of rows to skip from start.

Returns:: Number of rows to skip from start

inline size_type get_skipfooter() const#

Returns number of rows to skip from end.

Returns:: Number of rows to skip from end

inline size_type get_header() const#

Returns header row index.

Returns:: Header row index

inline char get_lineterminator() const#

Returns line terminator.

Returns:: Line terminator

inline char get_delimiter() const#

Returns field delimiter.

Returns:: Field delimiter

inline char get_thousands() const#

Returns numeric data thousands separator.

Returns:: Numeric data thousands separator

inline char get_decimal() const#

Returns decimal point character.

Returns:: Decimal point character

inline char get_comment() const#

Returns comment line start character.

Returns:: Comment line start character

inline bool is_enabled_windowslinetermination() const#

Whether to treat \r\n as line terminator.

Returns:: true if \r\n is treated as line terminator

inline bool is_enabled_delim_whitespace() const#

Whether to treat whitespace as field delimiter.

Returns:: true if whitespace is treated as field delimiter

inline bool is_enabled_skipinitialspace() const#

Whether to skip whitespace after the delimiter.

Returns:: true if whitespace is skipped after the delimiter

inline bool is_enabled_skip_blank_lines() const#

Whether to ignore empty lines or parse line values as invalid.

Returns:: true if empty lines or parse line values are ignored as invalid

inline quote_style get_quoting() const#

Returns quoting style.

Returns:: Quoting style

inline char get_quotechar() const#

Returns quoting character.

Returns:: Quoting character

inline bool is_enabled_doublequote() const#

Whether a quote inside a value is double-quoted.

Returns:: true if a quote inside a value is double-quoted

inline bool is_enabled_detect_whitespace_around_quotes() const#

Whether to detect quotes surrounded by spaces e.g. "data". This flag has no effect when _doublequote is true.

Returns:: true if detect_whitespace_around_quotes is enabled

inline std::vector<std::string> const &get_parse_dates_names() const#

Returns names of columns to read as datetime.

Returns:: Names of columns to read as datetime

inline std::vector<int> const &get_parse_dates_indexes() const#

Returns indexes of columns to read as datetime.

Returns:: Indexes of columns to read as datetime

inline std::vector<std::string> const &get_parse_hex_names() const#

Returns names of columns to read as hexadecimal.

Returns:: Names of columns to read as hexadecimal

inline std::vector<int> const &get_parse_hex_indexes() const#

Returns indexes of columns to read as hexadecimal.

Returns:: Indexes of columns to read as hexadecimal

inline std::variant<std::vector<data_type>, std::map<std::string, data_type>> const &get_dtypes() const#

Returns per-column types.

Returns:: Per-column types

inline std::vector<std::string> const &get_true_values() const#

Returns additional values to recognize as boolean true values.

Returns:: Additional values to recognize as boolean true values

inline std::vector<std::string> const &get_false_values() const#

Returns additional values to recognize as boolean false values.

Returns:: Additional values to recognize as boolean false values

inline std::vector<std::string> const &get_na_values() const#

Returns additional values to recognize as null values.

Returns:: Additional values to recognize as null values

inline bool is_enabled_keep_default_na() const#

Whether to keep the built-in default NA values.

Returns:: true if the built-in default NA values are kept

inline bool is_enabled_na_filter() const#

Whether to disable null filter.

Returns:: true if null filter is enabled

inline bool is_enabled_dayfirst() const#

Whether to parse dates as DD/MM versus MM/DD.

Returns:: True if dates are parsed as DD/MM, false if MM/DD

inline data_type get_timestamp_type() const#

Returns timestamp_type to which all timestamp columns will be cast.

Returns:: timestamp_type to which all timestamp columns will be cast

inline void set_source(source_info src)#

Sets source info.

Parameters:: src – The source info.

inline void set_compression(compression_type comp)#

Sets compression format of the source.

Parameters:: comp – Compression type

inline void set_byte_range_offset(std::size_t offset)#

Sets number of bytes to skip from source start.

Parameters:: offset – Number of bytes of offset

inline void set_byte_range_size(std::size_t size)#

Sets number of bytes to read.

Parameters:: size – Number of bytes to read

inline void set_names(std::vector<std::string> col_names)#

Sets names of the column.

Parameters:: col_names – Vector of column names

inline void set_prefix(std::string pfx)#

Sets prefix to be used for column ID.

Parameters:: pfx – String used as prefix in for each column name

inline void enable_mangle_dupe_cols(bool val)#

Sets whether to rename duplicate column names.

Parameters:: val – Boolean value to enable/disable

inline void set_use_cols_names(std::vector<std::string> col_names)#

Sets names of the columns to be read.

Parameters:: col_names – Vector of column names that are needed

inline void set_use_cols_indexes(std::vector<int> col_indices)#

Sets indexes of columns to read.

Parameters:: col_indices – Vector of column indices that are needed

inline void set_nrows(size_type nrows)#

Sets number of rows to read.

Parameters:: nrows – Number of rows to read

inline void set_skiprows(size_type skiprows)#

Sets number of rows to skip from start.

Parameters:: skiprows – Number of rows to skip

inline void set_skipfooter(size_type skipfooter)#

Sets number of rows to skip from end.

Parameters:: skipfooter – Number of rows to skip

inline void set_header(size_type hdr)#

Sets header row index.

Parameters:: hdr – Index where header row is located

inline void set_lineterminator(char term)#

Sets line terminator.

Parameters:: term – A character to indicate line termination

inline void set_delimiter(char delim)#

Sets field delimiter.

Parameters:: delim – A character to indicate delimiter

inline void set_thousands(char val)#

Sets numeric data thousands separator.

Parameters:: val – A character that separates thousands

inline void set_decimal(char val)#

Sets decimal point character.

Parameters:: val – A character that indicates decimal values

inline void set_comment(char val)#

Sets comment line start character.

Parameters:: val – A character that indicates comment

inline void enable_windowslinetermination(bool val)#

Sets whether to treat \r\n as line terminator.

Parameters:: val – Boolean value to enable/disable

inline void enable_delim_whitespace(bool val)#

Sets whether to treat whitespace as field delimiter.

Parameters:: val – Boolean value to enable/disable

inline void enable_skipinitialspace(bool val)#

Sets whether to skip whitespace after the delimiter.

Parameters:: val – Boolean value to enable/disable

inline void enable_skip_blank_lines(bool val)#

Sets whether to ignore empty lines or parse line values as invalid.

Parameters:: val – Boolean value to enable/disable

inline void set_quoting(quote_style quoting)#

Sets the expected quoting style used in the input CSV data.

Note: Only the following quoting styles are supported:

MINIMAL: String columns containing special characters like row-delimiters/ field-delimiter/quotes will be quoted.
NONE: No quoting is done for any columns.

Parameters:: quoting – Quoting style used

inline void set_quotechar(char ch)#

Sets quoting character.

Parameters:: ch – A character to indicate quoting

inline void enable_doublequote(bool val)#

Sets a quote inside a value is double-quoted.

Parameters:: val – Boolean value to enable/disable

inline void enable_detect_whitespace_around_quotes(bool val)#

Sets whether to detect quotes surrounded by spaces e.g. "data". This flag has no effect when _doublequote is true.

Parameters:: val – Boolean value to enable/disable

inline void set_parse_dates(std::vector<std::string> col_names)#

Sets names of columns to read as datetime.

Parameters:: col_names – Vector of column names to infer as datetime

inline void set_parse_dates(std::vector<int> col_indices)#

Sets indexes of columns to read as datetime.

Parameters:: col_indices – Vector of column indices to infer as datetime

inline void set_parse_hex(std::vector<std::string> col_names)#

Sets names of columns to parse as hexadecimal.

Parameters:: col_names – Vector of column names to parse as hexadecimal

inline void set_parse_hex(std::vector<int> col_indices)#

Sets indexes of columns to parse as hexadecimal.

Parameters:: col_indices – Vector of column indices to parse as hexadecimal

inline void set_dtypes(std::map<std::string, data_type> types)#

Sets per-column types.

Parameters:: types – Column name -> data type map specifying the columns’ target data types

inline void set_dtypes(std::vector<data_type> types)#

Sets per-column types.

Parameters:: types – Vector specifying the columns’ target data types

inline void set_true_values(std::vector<std::string> vals)#

Sets additional values to recognize as boolean true values.

Parameters:: vals – Vector of values to be considered to be true

inline void set_false_values(std::vector<std::string> vals)#

Sets additional values to recognize as boolean false values.

Parameters:: vals – Vector of values to be considered to be false

inline void set_na_values(std::vector<std::string> vals)#

Sets additional values to recognize as null values.

Parameters:: vals – Vector of values to be considered to be null

inline void enable_keep_default_na(bool val)#

Sets whether to keep the built-in default NA values.

Parameters:: val – Boolean value to enable/disable

inline void enable_na_filter(bool val)#

Sets whether to disable null filter.

Parameters:: val – Boolean value to enable/disable

inline void enable_dayfirst(bool val)#

Sets whether to parse dates as DD/MM versus MM/DD.

Parameters:: val – Boolean value to enable/disable

inline void set_timestamp_type(data_type type)#

Sets timestamp_type to which all timestamp columns will be cast.

Parameters:: type – Dtype to which all timestamp column will be cast

Public Static Functions

static csv_reader_options_builder builder(source_info src)#

Creates a csv_reader_options_builder which will build csv_reader_options.

Parameters:: src – Source information to read csv file
Returns:: Builder to build reader options

class csv_reader_options_builder#

#include <csv.hpp>

Builder to build options for read_csv().

Public Functions

csv_reader_options_builder() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline csv_reader_options_builder(source_info src)#

Constructor from source info.

Parameters:: src – The source information used to read csv file

inline csv_reader_options_builder &compression(compression_type comp)#

Sets compression format of the source.

Parameters:: comp – Compression type
Returns:: this for chaining

inline csv_reader_options_builder &byte_range_offset(std::size_t offset)#

Sets number of bytes to skip from source start.

Parameters:: offset – Number of bytes of offset
Returns:: this for chaining

inline csv_reader_options_builder &byte_range_size(std::size_t size)#

Sets number of bytes to read.

Parameters:: size – Number of bytes to read
Returns:: this for chaining

inline csv_reader_options_builder &names(std::vector<std::string> col_names)#

Sets names of the column.

Parameters:: col_names – Vector of column names
Returns:: this for chaining

inline csv_reader_options_builder &prefix(std::string pfx)#

Sets prefix to be used for column ID.

Parameters:: pfx – String used as prefix in for each column name
Returns:: this for chaining

inline csv_reader_options_builder &mangle_dupe_cols(bool val)#

Sets whether to rename duplicate column names.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &use_cols_names(std::vector<std::string> col_names)#

Sets names of the columns to be read.

Parameters:: col_names – Vector of column names that are needed
Returns:: this for chaining

inline csv_reader_options_builder &use_cols_indexes(std::vector<int> col_indices)#

Sets indexes of columns to read.

Parameters:: col_indices – Vector of column indices that are needed
Returns:: this for chaining

inline csv_reader_options_builder &nrows(size_type rows)#

Sets number of rows to read.

Parameters:: rows – Number of rows to read
Returns:: this for chaining

inline csv_reader_options_builder &skiprows(size_type skip)#

Sets number of rows to skip from start.

Parameters:: skip – Number of rows to skip
Returns:: this for chaining

inline csv_reader_options_builder &skipfooter(size_type skip)#

Sets number of rows to skip from end.

Parameters:: skip – Number of rows to skip
Returns:: this for chaining

inline csv_reader_options_builder &header(size_type hdr)#

Sets header row index.

Parameters:: hdr – Index where header row is located
Returns:: this for chaining

inline csv_reader_options_builder &lineterminator(char term)#

Sets line terminator.

Parameters:: term – A character to indicate line termination
Returns:: this for chaining

inline csv_reader_options_builder &delimiter(char delim)#

Sets field delimiter.

Parameters:: delim – A character to indicate delimiter
Returns:: this for chaining

inline csv_reader_options_builder &thousands(char val)#

Sets numeric data thousands separator.

Parameters:: val – A character that separates thousands
Returns:: this for chaining

inline csv_reader_options_builder &decimal(char val)#

Sets decimal point character.

Parameters:: val – A character that indicates decimal values
Returns:: this for chaining

inline csv_reader_options_builder &comment(char val)#

Sets comment line start character.

Parameters:: val – A character that indicates comment
Returns:: this for chaining

inline csv_reader_options_builder &windowslinetermination(bool val)#

Sets whether to treat \r\n as line terminator.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &delim_whitespace(bool val)#

Sets whether to treat whitespace as field delimiter.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &skipinitialspace(bool val)#

Sets whether to skip whitespace after the delimiter.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &skip_blank_lines(bool val)#

Sets whether to ignore empty lines or parse line values as invalid.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &quoting(quote_style style)#

Sets quoting style.

Parameters:: style – Quoting style used
Returns:: this for chaining

inline csv_reader_options_builder &quotechar(char ch)#

Sets quoting character.

Parameters:: ch – A character to indicate quoting
Returns:: this for chaining

inline csv_reader_options_builder &doublequote(bool val)#

Sets a quote inside a value is double-quoted.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &detect_whitespace_around_quotes(bool val)#

Sets whether to detect quotes surrounded by spaces e.g. "data". This flag has no effect when _doublequote is true.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &parse_dates(std::vector<std::string> col_names)#

Sets names of columns to read as datetime.

Parameters:: col_names – Vector of column names to read as datetime
Returns:: this for chaining

inline csv_reader_options_builder &parse_dates(std::vector<int> col_indices)#

Sets indexes of columns to read as datetime.

Parameters:: col_indices – Vector of column indices to read as datetime
Returns:: this for chaining

inline csv_reader_options_builder &parse_hex(std::vector<std::string> col_names)#

Sets names of columns to parse as hexadecimal.

Parameters:: col_names – Vector of column names to parse as hexadecimal
Returns:: this for chaining

inline csv_reader_options_builder &parse_hex(std::vector<int> col_indices)#

Sets indexes of columns to parse as hexadecimal.

Parameters:: col_indices – Vector of column indices to parse as hexadecimal
Returns:: this for chaining

inline csv_reader_options_builder &dtypes(std::map<std::string, data_type> types)#

Sets per-column types.

Parameters:: types – Column name -> data type map specifying the columns’ target data types
Returns:: this for chaining

inline csv_reader_options_builder &dtypes(std::vector<data_type> types)#

Sets per-column types.

Parameters:: types – Vector of data types in which the column needs to be read
Returns:: this for chaining

inline csv_reader_options_builder &true_values(std::vector<std::string> vals)#

Sets additional values to recognize as boolean true values.

Parameters:: vals – Vector of values to be considered to be true
Returns:: this for chaining

inline csv_reader_options_builder &false_values(std::vector<std::string> vals)#

Sets additional values to recognize as boolean false values.

Parameters:: vals – Vector of values to be considered to be false
Returns:: this for chaining

inline csv_reader_options_builder &na_values(std::vector<std::string> vals)#

Sets additional values to recognize as null values.

Parameters:: vals – Vector of values to be considered to be null
Returns:: this for chaining

inline csv_reader_options_builder &keep_default_na(bool val)#

Sets whether to keep the built-in default NA values.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &na_filter(bool val)#

Sets whether to disable null filter.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &dayfirst(bool val)#

Sets whether to parse dates as DD/MM versus MM/DD.

Parameters:: val – Boolean value to enable/disable
Returns:: this for chaining

inline csv_reader_options_builder &timestamp_type(data_type type)#

Sets timestamp_type to which all timestamp columns will be cast.

Parameters:: type – Dtype to which all timestamp column will be cast
Returns:: this for chaining

inline operator csv_reader_options&&()#: move csv_reader_options member once it’s built.

inline csv_reader_options &&build()#

move csv_reader_options member once it’s built.

This has been added since Cython does not support overloading of conversion operators.

Returns:: Built csv_reader_options object’s r-value reference

class chunked_parquet_reader#

#include <deletion_vectors.hpp>

The chunked parquet reader class to read a Parquet source iteratively in a series of tables, chunk by chunk. Each chunk is prepended with a row index column built using the specified row group offsets and row counts. The resultant table chunk is filtered using the supplied serialized roaring64 bitmap deletion vector and returned.

This class is designed to address the reading issue when reading very large Parquet source such that the row count exceeds the cudf column size limit or if there are device memory constraints. By reading the source content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit. Note that the given memory limits do not account for the device memory needed to deserialize and construct the roaring64 bitmap deletion vector that stays alive throughout the the lifetime of the reader.

Public Functions

chunked_parquet_reader(std::size_t chunk_read_limit, parquet_reader_options const &options, cudf::host_span<cuda::std::byte const> serialized_roaring64, cudf::host_span<size_t const> row_group_offsets, cudf::host_span<size_type const> row_group_num_rows, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Constructor for the chunked reader.

Requires the same arguments as the cudf::io::parquet::experimental::read_parquet(), and an additional parameter to specify the size byte limit of the output table chunk produced.

Parameters:

chunk_read_limit – Byte limit on the returned table chunk size, 0 if there is no limit
options – Parquet reader options
serialized_roaring64 – Host span of portable serialized 64-bit roaring bitmap
row_group_offsets – Host span of row offsets of each row group
row_group_num_rows – Host span of number of rows in each row group
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

chunked_parquet_reader(std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, cudf::host_span<cuda::std::byte const> serialized_roaring64, cudf::host_span<size_t const> row_group_offsets, cudf::host_span<size_type const> row_group_num_rows, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Constructor for the chunked reader.

Requires the same arguments as cudf::io::parquet::experimental::read_parquet(), with additional parameters to specify the size byte limit of the output table chunk produced, and a byte limit on the amount of temporary memory to use when reading. The pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. The pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded. Also note that the pass_read_limit does not include the memory to deserialize and construct the roaring64 bitmap deletion vector that stays alive throughout the the lifetime of the reader.

Parameters:

chunk_read_limit – Byte limit on the returned table chunk size, 0 if there is no limit
pass_read_limit – Byte limit on the amount of memory used for decompressing and decoding data, 0 if there is no limit
options – Parquet reader options
serialized_roaring64 – Host span of portable serialized 64-bit roaring bitmap
row_group_offsets – Host span of row offsets of each row group
row_group_num_rows – Host span of number of rows in each row group
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

~chunked_parquet_reader()#: Destructor, destroying the internal reader instance and the roaring bitmap deletion vector.

bool has_next() const#

Check if there is any data in the given source that has not yet been read.

Returns:: Boolean value indicating if there is any data left to be read

table_with_metadata read_chunk()#

Read a chunk of table from the Parquet source, prepend an index column to it, and filters the resultant table chunk using the 64-bit roaring bitmap deletion vector, if provided.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given source at once.

An empty table will be returned if the given source is empty, or all the data in the source has been read and returned by the previous calls.

Returns:: An output cudf::table along with its metadata

class hybrid_scan_reader#

#include <hybrid_scan.hpp>

The experimental parquet reader class to optimally read parquet files subject to highly selective filters, called a Hybrid Scan operation.

This class is designed to best exploit reductive optimization techniques to speed up reading Parquet files subject to highly selective filters. The parquet file contents are read in two passes. In the first pass, only the filter columns (i.e. columns that appear in the filter expression) are read allowing pruning of row groups and filter column data pages using the filter expression. In the second pass, only the payload columns (i.e. columns that do not appear in the filter expression) are optimally read by applying the surviving row mask from the first pass to prune payload column data pages.

The following code snippets demonstrate how to use the experimental parquet reader.

Start with an instance of the experimental reader with a span of parquet file footer bytes and parquet reader options.

// Example filter expression `A < 100`
auto filter_expression = cudf::ast::operation(cudf::ast::ast_operator::LESS,
                           column_name_reference{"A"}, literal{100});

using namespace cudf::io;

// Parquet reader options with empty source info
auto options = parquet_reader_options::builder(source_info(nullptr, 0))
                 .filter(filter_expression);

 // Fetch parquet file footer bytes from the file
 cudf::host_span<uint8_t const> footer_bytes = fetch_parquet_footer_bytes();

// Create the reader
 auto reader =
   std::make_unique<parquet::experimental::hybrid_scan_reader>(footer_bytes, options);

Metadata handling (OPTIONAL): Get a materialized parquet file footer metadata struct (FileMetaData) from the reader to get insights into the parquet data as needed. Optionally, set up the page index to materialize page level stats used for data page pruning.

// Get Parquet file metadata from the reader
auto metadata = reader->parquet_metadata();

// Example metadata use: Calculate the number of rows in the file
auto nrows = std::accumulate(metadata.row_groups.begin(),
                             metadata.row_groups.end(),
                             size_type{0},
                             [](auto sum, auto const& rg) {
                               return sum + rg.num_rows;
                             });

// Get the page index byte range from the reader
auto page_index_byte_range = reader->page_index_byte_range();

// Fetch the page index bytes from the parquet file
cudf::host_span<uint8_t const> page_index_bytes = fetch_parquet_bytes(page_index_byte_range);

// Set up the page index
reader->setup_page_index(page_index_bytes);

// A new `FileMetaData` struct with populated page index structs may be obtained
// using `parquet_metadata()` at this point. Page index may be set up at any time.
auto metadata_with_page_index = reader->parquet_metadata();

Row group pruning (OPTIONAL): Start with either a list of custom or all row group indices in the parquet file and optionally filter it subject to filter expression using column chunk statistics, dictionaries and bloom filters. Byte ranges for column chunk dictionary pages and bloom filters within parquet file may be obtained via secondary_filters_byte_ranges() function. The byte ranges may be read into a corresponding vector of device buffers and passed to the corresponding row group filtration function.

// Start with a list of all parquet row group indices from the file footer
auto all_row_group_indices = reader->all_row_groups(options);

// Span to track the indices of row groups currently at hand
auto current_row_group_indices = cudf::host_span<size_type>(all_row_group_indices);

// Optional: Prune row group indices subject to filter expression using row group statistics
auto stats_filtered_row_group_indices =
  reader->filter_row_groups_with_stats(current_row_group_indices, options, stream);

// Update current row group indices to now track the stats-filtered row group indices
current_row_group_indices = stats_filtered_row_group_indices;

// Get byte ranges of bloom filters and dictionaries for the current row groups
auto [bloom_filter_byte_ranges, dict_page_byte_ranges] =
  reader->secondary_filters_byte_ranges(current_row_group_indices, options);

// Optional: Prune row groups if we have valid dictionary pages
auto dictionary_page_filtered_row_group_indices = std::vector<size_type>{};

if (dict_page_byte_ranges.size()) {
  // Fetch dictionary page byte ranges into device buffers
  std::vector<rmm::device_buffer> dictionary_page_data =
    fetch_device_buffers(dict_page_byte_ranges);

  // Prune row groups using dictionaries
  dictionary_page_filtered_row_group_indices = reader->filter_row_groups_with_dictionary_pages(
    dictionary_page_data, current_row_group_indices, options, stream);

  // Update current row group indices to dictionary page filtered row group indices
  current_row_group_indices = dictionary_page_filtered_row_group_indices;
}

// Optional: Prune row groups if we have valid bloom filters
auto bloom_filtered_row_group_indices = std::vector<size_type>{};

if (bloom_filter_byte_ranges.size()) {
  // Fetch bloom filter byte ranges into device buffers
  std::vector<rmm::device_buffer> bloom_filter_data =
    fetch_device_buffers(bloom_filter_byte_ranges);

 // Prune row groups using bloom filters
  bloom_filtered_row_group_indices = reader->filter_row_groups_with_bloom_filters(
    bloom_filter_data, current_row_group_indices, options, stream);

  // Update current row group indices to bloom filtered row group indices
  current_row_group_indices = bloom_filtered_row_group_indices;
}

Build an initial row mask: Once the row groups are filtered, the next step is to build an initial row mask column to indicate which rows in the current span of row groups will survive in the read table. This initial row mask may be a BOOL8 cudf column of size equal to the total number of rows in the current span of row groups (computed by total_rows_in_row_groups()) containing all true values. Alternatively, the row mask may be built with the build_row_mask_with_page_index_stats() function and contain a true value for only the rows that survive the page-level statistics from the page index subject to the same filter as row groups. Note that this step requires the page index to be set up using the setup_page_index() function.

// If not already done, get the page index byte range
auto page_index_byte_range = reader->page_index_byte_range();

// If not already done, fetch the page index bytes from the parquet file
cudf::host_span<uint8_t const> page_index_bytes = fetch_parquet_bytes(page_index_byte_range);

// If not already done, Set up the page index now
reader->setup_page_index(page_index_bytes);

// Build a row mask column containing all `true` values
auto const num_rows = reader->total_rows_in_row_groups(current_row_group_indices);
auto row_mask = cudf::make_numeric_column(
    cudf::data_type{cudf::type_id::BOOL8}, num_rows, rmm::device_buffer{}, 0, stream, mr);

// Alternatively, build a row mask column indicating only the rows that survive the page-level
statistics in the page index
row_mask = reader->build_row_mask_with_page_index_stats(current_row_group_indices, options,
                                                        stream, mr);

Materialize filter columns: Once we are done with pruning row groups and constructing the row mask, the next step is to materialize filter columns into a table (first reader pass). This is done using the materialize_filter_columns() function. This function requires a vector of device buffers containing column chunk data for the current list of row groups, and a mutable view of the current row mask. The function optionally builds a mask for the current data pages using the input row mask to skip decompression and decoding of the pruned pages based on the mask_data_pages argument. The filter columns are then read into a table and filtered based on the filter expression and the row mask is updated to only indicate the rows that survive in the read table. The final table is returned. The byte ranges for the required column chunk data may be obtained using the filter_column_chunks_byte_ranges() function and read into a corresponding vector of vectors of device buffers.

// Get byte ranges of column chunk byte ranges from the reader
auto const filter_column_chunk_byte_ranges =
  reader->filter_column_chunks_byte_ranges(current_row_group_indices, options);

// Fetch column chunk device buffers from the input buffer
auto filter_column_chunk_buffers =
  fetch_device_buffers(filter_column_chunk_byte_ranges);

// Materialize the table with only the filter columns
auto [filter_table, filter_metadata] =
  reader->materialize_filter_columns(current_row_group_indices,
                                     std::move(filter_column_chunk_buffers),
                                     row_mask->mutable_view(),
                                     use_data_page_mask::YES/NO,
                                     options,
                                     stream);

Materialize payload columns: Once the filter columns are materialized, the final step is to materialize the payload columns into another table (second reader pass). This is done using the materialize_payload_columns() function which is identical to the materialize_filter_columns() in terms of functionality except that it accepts an immutable view of the row mask and uses it to filter the read output table before returning it. The byte ranges for the required column chunk data may be obtained using the payload_column_chunks_byte_ranges() function and read into a corresponding vector of vectors of device buffers.

// Get column chunk byte ranges from the reader
auto const payload_column_chunk_byte_ranges =
  reader->payload_column_chunks_byte_ranges(current_row_group_indices, options);

// Fetch column chunk device buffers from the input buffer
auto payload_column_chunk_buffers =
  fetch_device_buffers(payload_column_chunk_byte_ranges);

// Materialize the table with only the payload columns
auto [payload_table, payload_metadata] =
  reader->materialize_payload_columns(current_row_group_indices,
                                      std::move(payload_column_chunk_buffers),
                                      row_mask->view(),
                                      use_data_page_mask::YES/NO,
                                      options,
                                      stream);

Once both reader passes are complete, the filter and payload column tables may be trivially combined by releasing the columns from both tables and moving them into a new cudf table.

Note

The performance advantage of this reader is most prominent when the filter expression is highly selective, i.e. when the data in filter columns are at least partially ordered and the number of rows that survive the filter is small compared to the total number of rows in the parquet file. Otherwise, the performance is identical to the cudf::io::read_parquet() function.

Public Functions

explicit hybrid_scan_reader(cudf::host_span<uint8_t const> footer_bytes, parquet_reader_options const &options)#

Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters.

Parameters:

footer_bytes – Host span of parquet file footer bytes
options – Parquet reader options

explicit hybrid_scan_reader(FileMetaData const &parquet_metadata, parquet_reader_options const &options)#

Constructor for the experimental parquet reader class to optimally read Parquet files subject to highly selective filters.

Parameters:

parquet_metadata – Pre-populated Parquet file metadata
options – Parquet reader options

~hybrid_scan_reader()#: Destructor for the experimental parquet reader class.

FileMetaData parquet_metadata() const#

Get the Parquet file footer metadata.

Returns the materialized Parquet file footer metadata struct. The footer will contain the materialized page index if called after setup_page_index().

Returns:: Parquet file footer metadata

byte_range_info page_index_byte_range() const#

Get the byte range of the page index in the Parquet file.

Returns:: Byte range of the page index

void setup_page_index(cudf::host_span<uint8_t const> page_index_bytes) const#

Setup the page index within the Parquet file metadata (FileMetaData)

Materialize the ColumnIndex and OffsetIndex structs (collectively called the page index) within the Parquet file metadata struct (returned by parquet_metadata()). The statistics contained in page index can be used to prune data pages before decoding.

Parameters:: page_index_bytes – Host span of Parquet page index buffer bytes

std::vector<size_type> all_row_groups(parquet_reader_options const &options) const#

Get all available row groups from the parquet file.

Parameters:: options – Parquet reader options
Returns:: Vector of row group indices

size_type total_rows_in_row_groups(cudf::host_span<size_type const> row_group_indices) const#

Get the total number of top-level rows in the row groups.

Parameters:: row_group_indices – Input row groups indices
Returns:: Total number of top-level rows in the row groups

std::vector<size_type> filter_row_groups_with_stats(cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Filter the input row groups using column chunk statistics.

Parameters:

row_group_indices – Input row groups indices
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Filtered row group indices

std::pair<std::vector<byte_range_info>, std::vector<byte_range_info>> secondary_filters_byte_ranges(cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options) const#

Get byte ranges of bloom filters and dictionary pages (secondary filters) for row group pruning.

Note

Device buffers for bloom filter byte ranges must be allocated using a 32 byte aligned memory resource

Parameters:

row_group_indices – Input row groups indices
options – Parquet reader options

Returns:

Pair of vectors of byte ranges of column chunk with bloom filters and dictionary pages subject to filter predicate

std::vector<size_type> filter_row_groups_with_dictionary_pages(cudf::host_span<rmm::device_buffer> dictionary_page_data, cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Filter the row groups using column chunk dictionary pages.

Parameters:

dictionary_page_data – Device buffers containing dictionary page data of column chunks with (in)equality predicate
row_group_indices – Input row groups indices
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Filtered row group indices

std::vector<size_type> filter_row_groups_with_bloom_filters(cudf::host_span<rmm::device_buffer> bloom_filter_data, cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Filter the row groups using column chunk bloom filters.

Note

The bloom_filter_data device buffers must be allocated using a 32 byte aligned memory resource

Parameters:

bloom_filter_data – Device buffers containing bloom filter data of column chunks with an equality predicate
row_group_indices – Input row groups indices
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Filtered row group indices

std::unique_ptr<cudf::column> build_row_mask_with_page_index_stats(cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr) const#

Builds a boolean column indicating surviving rows using page-level statistics in the page index.

Parameters:

row_group_indices – Input row groups indices
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory

Returns:

A boolean column indicating which filter column rows survive the statistics in the page index

std::vector<byte_range_info> filter_column_chunks_byte_ranges(cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options) const#

Get byte ranges of column chunks of filter columns.

Parameters:

row_group_indices – Input row groups indices
options – Parquet reader options

Returns:

Vector of byte ranges to column chunks of filter columns

table_with_metadata materialize_filter_columns(cudf::host_span<size_type const> row_group_indices, std::vector<rmm::device_buffer> &&column_chunk_buffers, cudf::mutable_column_view &row_mask, use_data_page_mask mask_data_pages, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Materializes filter columns and updates the input row mask to only the rows that exist in the output table.

Parameters:

row_group_indices – Input row groups indices
column_chunk_buffers – Device buffers containing column chunk data of filter columns
row_mask – [inout] Mutable boolean column indicating surviving rows from page pruning
mask_data_pages – Whether to build and use a data page mask using the row mask
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Table of materialized filter columns and metadata

std::vector<byte_range_info> payload_column_chunks_byte_ranges(cudf::host_span<size_type const> row_group_indices, parquet_reader_options const &options) const#

Get byte ranges of column chunks of payload columns.

Parameters:

row_group_indices – Input row groups indices
options – Parquet reader options

Returns:

Vector of byte ranges to column chunks of payload columns

table_with_metadata materialize_payload_columns(cudf::host_span<size_type const> row_group_indices, std::vector<rmm::device_buffer> &&column_chunk_buffers, cudf::column_view const &row_mask, use_data_page_mask mask_data_pages, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Materialize payload columns and applies the row mask to the output table.

Parameters:

row_group_indices – Input row groups indices
column_chunk_buffers – Device buffers containing column chunk data of payload columns
row_mask – Boolean column indicating which rows need to be read. All rows read if empty
mask_data_pages – Whether to build and use a data page mask using the row mask
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Table of materialized payload columns and metadata

void setup_chunking_for_filter_columns(std::size_t chunk_read_limit, std::size_t pass_read_limit, cudf::host_span<size_type const> row_group_indices, cudf::column_view const &row_mask, use_data_page_mask mask_data_pages, std::vector<rmm::device_buffer> &&column_chunk_buffers, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Setup chunking information for filter columns and preprocess the input data pages.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per table chunk. 0 if there is no limit
pass_read_limit – Limit on the memory used for reading and decompressing data. 0 if there is no limit
row_group_indices – Input row groups indices
row_mask – Boolean column indicating which rows need to be read. All rows read if empty
mask_data_pages – Whether to build and use a data page mask using the row mask
column_chunk_buffers – Device buffers containing column chunk data of filter columns
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

table_with_metadata materialize_filter_columns_chunk(cudf::mutable_column_view &row_mask, rmm::cuda_stream_view stream) const#

Materializes a chunk of filter columns and updates the corresponding range of input row mask to only the rows that exist in the output table.

Parameters:

row_mask – [inout] Mutable boolean column indicating surviving rows from page pruning
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Table chunk of materialized filter columns and metadata

void setup_chunking_for_payload_columns(std::size_t chunk_read_limit, std::size_t pass_read_limit, cudf::host_span<size_type const> row_group_indices, cudf::column_view const &row_mask, use_data_page_mask mask_data_pages, std::vector<rmm::device_buffer> &&column_chunk_buffers, parquet_reader_options const &options, rmm::cuda_stream_view stream) const#

Setup chunking information for payload columns and preprocess the input data pages.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per table chunk. 0 if there is no limit
pass_read_limit – Limit on the memory used for reading and decompressing data. 0 if there is no limit
row_group_indices – Input row groups indices
row_mask – Boolean column indicating which rows need to be read. All rows read if empty
mask_data_pages – Whether to build and use a data page mask using the row mask
column_chunk_buffers – Device buffers containing column chunk data of payload columns
options – Parquet reader options
stream – CUDA stream used for device memory operations and kernel launches

table_with_metadata materialize_payload_columns_chunk(cudf::column_view const &row_mask, rmm::cuda_stream_view stream) const#

Materializes a chunk of payload columns and applies the corresponding range of input row mask to the output table chunk.

Parameters:

row_mask – Boolean column indicating which rows need to be read. All rows read if empty
stream – CUDA stream used for device memory operations and kernel launches

Returns:

Table chunk of materialized filter columns and metadata

bool has_next_table_chunk() const#

Check if there is any parquet data left to read for the current setup.

Returns:: Boolean indicating if there is any data left to read

struct schema_element#

#include <json.hpp>

Allows specifying the target types for nested JSON data via json_reader_options’ set_dtypes method.

Public Members

data_type type#: The type that this column should be converted to.

std::map<std::string, schema_element> child_types#: Allows specifying this column’s child columns target type.

std::optional<std::vector<std::string>> column_order#: Allows specifying the order of the columns.

class json_reader_options#

#include <json.hpp>

Input arguments to the read_json interface.

Available parameters are closely patterned after PANDAS’ read_json API. Not all parameters are supported. If the matching PANDAS’ parameter has a default value of None, then a default value of -1 or 0 may be used as the equivalent.

Parameters in PANDAS that are unavailable or in cudf:

Name	Description
`orient`	currently fixed-format
`typ`	data is always returned as a cudf::table
`convert_axes`	use column functions for axes operations instead
`convert_dates`	dates are detected automatically
`keep_default_dates`	dates are detected automatically
`numpy`	data is always returned as a cudf::table
`precise_float`	there is only one converter
`date_unit`	only millisecond units are supported
`encoding`	only ASCII-encoded data is supported
`chunksize`	use `byte_range_xxx` for chunking instead

Public Types

using dtype_variant = std::variant<std::vector<data_type>, std::map<std::string, data_type>, std::map<std::string, schema_element>, schema_element>#: Variant type holding dtypes information for the columns.

Public Functions

json_reader_options() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline source_info const &get_source() const#

Returns source info.

Returns:: Source info

inline dtype_variant const &get_dtypes() const#

Returns data types of the columns.

Returns:: Data types of the columns

inline compression_type get_compression() const#

Returns compression format of the source.

Returns:: Compression format of the source

inline size_t get_byte_range_offset() const#

Returns number of bytes to skip from source start.

Returns:: Number of bytes to skip from source start

inline size_t get_byte_range_size() const#

Returns number of bytes to read.

Returns:: Number of bytes to read

inline size_t get_byte_range_size_with_padding() const#

Returns number of bytes to read with padding.

Returns:: Number of bytes to read with padding

inline size_t get_byte_range_padding() const#

Returns number of bytes to pad when reading.

Returns:: Number of bytes to pad

inline char get_delimiter() const#

Returns delimiter separating records in JSON lines.

Returns:: Delimiter separating records in JSON lines

inline bool is_enabled_lines() const#

Whether to read the file as a json object per line.

Returns:: true if reading the file as a json object per line

inline bool is_enabled_mixed_types_as_string() const#

Whether to parse mixed types as a string column.

Returns:: true if mixed types are parsed as a string column

inline bool is_enabled_prune_columns() const#

Whether to prune columns on read, selected based on the set_dtypes option.

When set as true, if the reader options include set_dtypes, then the reader will only return those columns which are mentioned in set_dtypes. If false, then all columns are returned, independent of the set_dtypes setting.

Returns:: True if column pruning is enabled

inline bool is_enabled_experimental() const#

Whether to enable experimental features.

When set to true, experimental features, such as the new column tree construction, utf-8 matching of field names will be enabled.

Returns:: true if experimental features are enabled

inline bool is_enabled_dayfirst() const#

Whether to parse dates as DD/MM versus MM/DD.

Returns:: true if dates are parsed as DD/MM, false if MM/DD

inline bool is_enabled_keep_quotes() const#

Whether the reader should keep quotes of string values.

Returns:: true if the reader should keep quotes, false otherwise

inline bool is_enabled_normalize_single_quotes() const#

Whether the reader should normalize single quotes around strings.

Returns:: true if the reader should normalize single quotes, false otherwise

inline bool is_enabled_normalize_whitespace() const#

Whether the reader should normalize unquoted whitespace characters.

Returns:: true if the reader should normalize whitespace, false otherwise

inline json_recovery_mode_t recovery_mode() const#

Queries the JSON reader’s behavior on invalid JSON lines.

Returns:: An enum that specifies the JSON reader’s behavior on invalid JSON lines.

inline bool is_strict_validation() const#

Whether json validation should be enforced strictly or not.

Returns:: true if it should be.

inline bool is_allowed_numeric_leading_zeros() const#

Whether leading zeros are allowed in numeric values.

Note

: This validation is enforced only if strict validation is enabled.

Returns:: true if leading zeros are allowed in numeric values

inline bool is_allowed_nonnumeric_numbers() const#

Whether unquoted number values should be allowed NaN, +INF, -INF, +Infinity, Infinity, and -Infinity.

Note

: This validation is enforced only if strict validation is enabled.

Returns:: true if leading zeros are allowed in numeric values

inline bool is_allowed_unquoted_control_chars() const#

Whether in a quoted string should characters greater than or equal to 0 and less than 32 be allowed without some form of escaping.

Note

: This validation is enforced only if strict validation is enabled.

Returns:: true if unquoted control chars are allowed.

inline std::vector<std::string> const &get_na_values() const#

Returns additional values to recognize as null values.

Returns:: Additional values to recognize as null values

inline void set_source(source_info src)#

Sets source info.

Parameters:: src – The source info.

inline void set_dtypes(std::vector<data_type> types)#

Set data types for columns to be read.

Parameters:: types – Vector of dtypes

inline void set_dtypes(std::map<std::string, data_type> types)#

Set data types for columns to be read.

Parameters:: types – Vector dtypes in string format

inline void set_dtypes(std::map<std::string, schema_element> types)#

Set data types for a potentially nested column hierarchy.

Parameters:: types – Map of column names to schema_element to support arbitrary nesting of data types

void set_dtypes(schema_element types)#

Set data types for a potentially nested column hierarchy.

Parameters:: types – schema element with column names and column order to support arbitrary nesting of data types

inline void set_compression(compression_type comp_type)#

Set the compression type.

Parameters:: comp_type – The compression type used

inline void set_byte_range_offset(size_t offset)#

Set number of bytes to skip from source start.

Parameters:: offset – Number of bytes of offset

inline void set_byte_range_size(size_t size)#

Set number of bytes to read.

Parameters:: size – Number of bytes to read

inline void set_delimiter(char delimiter)#

Set delimiter separating records in JSON lines.

Parameters:: delimiter – Delimiter separating records in JSON lines

inline void enable_lines(bool val)#

Set whether to read the file as a json object per line.

Parameters:: val – Boolean value to enable/disable the option to read each line as a json object

inline void enable_mixed_types_as_string(bool val)#

Set whether to parse mixed types as a string column. Also enables forcing to read a struct as string column using schema.

Parameters:: val – Boolean value to enable/disable parsing mixed types as a string column

inline void enable_prune_columns(bool val)#

Set whether to prune columns on read, selected based on the set_dtypes option.

Parameters:: val – Boolean value to enable/disable column pruning

inline void enable_experimental(bool val)#

Set whether to enable experimental features.

When set to true, experimental features, such as the new column tree construction, utf-8 matching of field names will be enabled.

Parameters:: val – Boolean value to enable/disable experimental features

inline void enable_dayfirst(bool val)#

Set whether to parse dates as DD/MM versus MM/DD.

Parameters:: val – Boolean value to enable/disable day first parsing format

inline void enable_keep_quotes(bool val)#

Set whether the reader should keep quotes of string values.

Parameters:: val – Boolean value to indicate whether the reader should keep quotes of string values

inline void enable_normalize_single_quotes(bool val)#

Set whether the reader should enable normalization of single quotes around strings.

Parameters:: val – Boolean value to indicate whether the reader should normalize single quotes around strings

inline void enable_normalize_whitespace(bool val)#

Set whether the reader should enable normalization of unquoted whitespace.

Parameters:: val – Boolean value to indicate whether the reader should normalize unquoted whitespace characters i.e. tabs and spaces

inline void set_recovery_mode(json_recovery_mode_t val)#

Specifies the JSON reader’s behavior on invalid JSON lines.

Parameters:: val – An enum value to indicate the JSON reader’s behavior on invalid JSON lines.

inline void set_strict_validation(bool val)#

Set whether strict validation is enabled or not.

Parameters:: val – Boolean value to indicate whether strict validation is enabled.

inline void allow_numeric_leading_zeros(bool val)#

Set whether leading zeros are allowed in numeric values. Strict validation must be enabled for this to work.

Throws:: cudf::logic_error – if strict_validation is not enabled before setting this option.
Parameters:: val – Boolean value to indicate whether leading zeros are allowed in numeric values

inline void allow_nonnumeric_numbers(bool val)#

Set whether unquoted number values should be allowed NaN, +INF, -INF, +Infinity, Infinity, and -Infinity. Strict validation must be enabled for this to work.

Throws:: cudf::logic_error – if strict_validation is not enabled before setting this option.
Parameters:: val – Boolean value to indicate whether leading zeros are allowed in numeric values

inline void allow_unquoted_control_chars(bool val)#

Set whether in a quoted string should characters greater than or equal to 0 and less than 32 be allowed without some form of escaping. Strict validation must be enabled for this to work.

Throws:: cudf::logic_error – if strict_validation is not enabled before setting this option.
Parameters:: val – true to indicate whether unquoted control chars are allowed.

inline void set_na_values(std::vector<std::string> vals)#

Sets additional values to recognize as null values.

Parameters:: vals – Vector of values to be considered to be null

Public Static Functions

static json_reader_options_builder builder(source_info src)#

create json_reader_options_builder which will build json_reader_options.

Parameters:: src – source information used to read json file
Returns:: builder to build the options

class json_reader_options_builder#

#include <json.hpp>

Builds settings to use for read_json().

Public Functions

explicit json_reader_options_builder() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline explicit json_reader_options_builder(source_info src)#

Constructor from source info.

Parameters:: src – The source information used to read avro file

inline json_reader_options_builder &dtypes(std::vector<data_type> types)#

Set data types for columns to be read.

Parameters:: types – Vector of dtypes
Returns:: this for chaining

inline json_reader_options_builder &dtypes(std::map<std::string, data_type> types)#

Set data types for columns to be read.

Parameters:: types – Column name -> dtype map
Returns:: this for chaining

inline json_reader_options_builder &dtypes(std::map<std::string, schema_element> types)#

Set data types for columns to be read.

Parameters:: types – Column name -> schema_element map
Returns:: this for chaining

inline json_reader_options_builder &dtypes(schema_element types)#

Set data types for columns to be read.

Parameters:: types – Struct schema_element with Column name -> schema_element with map and order
Returns:: this for chaining

inline json_reader_options_builder &compression(compression_type comp_type)#

Set the compression type.

Parameters:: comp_type – The compression type used
Returns:: this for chaining

inline json_reader_options_builder &byte_range_offset(size_type offset)#

Set number of bytes to skip from source start.

Parameters:: offset – Number of bytes of offset
Returns:: this for chaining

inline json_reader_options_builder &byte_range_size(size_type size)#

Set number of bytes to read.

Parameters:: size – Number of bytes to read
Returns:: this for chaining

inline json_reader_options_builder &delimiter(char delimiter)#

Set delimiter separating records in JSON lines.

Parameters:: delimiter – Delimiter separating records in JSON lines
Returns:: this for chaining

inline json_reader_options_builder &lines(bool val)#

Set whether to read the file as a json object per line.

Parameters:: val – Boolean value to enable/disable the option to read each line as a json object
Returns:: this for chaining

inline json_reader_options_builder &mixed_types_as_string(bool val)#

Set whether to parse mixed types as a string column. Also enables forcing to read a struct as string column using schema.

Parameters:: val – Boolean value to enable/disable parsing mixed types as a string column
Returns:: this for chaining

inline json_reader_options_builder &prune_columns(bool val)#

Set whether to prune columns on read, selected based on the dtypes option.

When set as true, if the reader options include dtypes, then the reader will only return those columns which are mentioned in dtypes. If false, then all columns are returned, independent of the dtypes setting.

Parameters:: val – Boolean value to enable/disable column pruning
Returns:: this for chaining

inline json_reader_options_builder &experimental(bool val)#

Set whether to enable experimental features.

When set to true, experimental features, such as the new column tree construction, utf-8 matching of field names will be enabled.

Parameters:: val – Boolean value to enable/disable experimental features
Returns:: this for chaining

inline json_reader_options_builder &dayfirst(bool val)#

Set whether to parse dates as DD/MM versus MM/DD.

Parameters:: val – Boolean value to enable/disable day first parsing format
Returns:: this for chaining

inline json_reader_options_builder &keep_quotes(bool val)#

Set whether the reader should keep quotes of string values.

Parameters:: val – Boolean value to indicate whether the reader should keep quotes of string values
Returns:: this for chaining

inline json_reader_options_builder &normalize_single_quotes(bool val)#

Set whether the reader should normalize single quotes around strings.

Parameters:: val – Boolean value to indicate whether the reader should normalize single quotes of strings
Returns:: this for chaining

inline json_reader_options_builder &normalize_whitespace(bool val)#

Set whether the reader should normalize unquoted whitespace.

Parameters:: val – Boolean value to indicate whether the reader should normalize unquoted whitespace
Returns:: this for chaining

inline json_reader_options_builder &recovery_mode(json_recovery_mode_t val)#

Specifies the JSON reader’s behavior on invalid JSON lines.

Parameters:: val – An enum value to indicate the JSON reader’s behavior on invalid JSON lines.
Returns:: this for chaining

inline json_reader_options_builder &strict_validation(bool val)#

Set whether json validation should be strict or not.

Parameters:: val – Boolean value to indicate whether json validation should be strict or not.
Returns:: this for chaining

inline json_reader_options_builder &numeric_leading_zeros(bool val)#

Set Whether leading zeros are allowed in numeric values. Strict validation must be enabled for this to have any effect.

Throws:: cudf::logic_error – if strict_validation is not enabled before setting this option.
Parameters:: val – Boolean value to indicate whether leading zeros are allowed in numeric values
Returns:: this for chaining

inline json_reader_options_builder &nonnumeric_numbers(bool val)#

Set whether specific unquoted number values are valid JSON. The values are NaN, +INF, -INF, +Infinity, Infinity, and -Infinity. Strict validation must be enabled for this to have any effect.

Throws:: cudf::logic_error – if strict_validation is not enabled before setting this option.
Parameters:: val – Boolean value to indicate if unquoted nonnumeric values are valid json or not.
Returns:: this for chaining

inline json_reader_options_builder &unquoted_control_chars(bool val)#

Set whether chars >= 0 and < 32 are allowed in a quoted string without some form of escaping. Strict validation must be enabled for this to have any effect.

Throws:: cudf::logic_error – if strict_validation is not enabled before setting this option.
Parameters:: val – Boolean value to indicate if unquoted control chars are allowed or not.
Returns:: this for chaining

inline json_reader_options_builder &na_values(std::vector<std::string> vals)#

Sets additional values to recognize as null values.

Parameters:: vals – Vector of values to be considered to be null
Returns:: this for chaining

inline operator json_reader_options&&()#: move json_reader_options member once it’s built.

inline json_reader_options &&build()#

move json_reader_options member once it’s built.

This has been added since Cython does not support overloading of conversion operators.

Returns:: Built json_reader_options object r-value reference

class orc_reader_options#

#include <orc.hpp>

Settings to use for read_orc().

Public Functions

orc_reader_options() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline source_info const &get_source() const#

Returns source info.

Returns:: Source info

inline auto const &get_columns() const#

Returns names of the columns to read, if set.

Returns:: Names of the columns to read; nullopt if the option is not set

inline auto const &get_stripes() const#

Returns vector of vectors, stripes to read for each input source.

Returns:: Vector of vectors, stripes to read for each input source

inline int64_t get_skip_rows() const#

Returns number of rows to skip from the start.

Returns:: Number of rows to skip from the start

inline std::optional<int64_t> const &get_num_rows() const#

Returns number of row to read.

Returns:: Number of rows to read; nullopt if the option hasn’t been set (in which case the file is read until the end)

inline bool is_enabled_use_index() const#

Whether to use row index to speed-up reading.

Returns:: true if row index is used to speed-up reading

inline bool is_enabled_use_np_dtypes() const#

Whether to use numpy-compatible dtypes.

Returns:: true if numpy-compatible dtypes are used

inline data_type get_timestamp_type() const#

Returns timestamp type to which timestamp column will be cast.

Returns:: Timestamp type to which timestamp column will be cast

inline std::vector<std::string> const &get_decimal128_columns() const#

Returns fully qualified names of columns that should be read as 128-bit Decimal.

Returns:: Fully qualified names of columns that should be read as 128-bit Decimal

inline bool get_ignore_timezone_in_stripe_footer() const#

Returns whether to ignore writer timezone in the stripe footer.

Returns:: true if the writer timezone in the stripe footer is ignored.

inline void set_source(source_info src)#

Sets source info.

Parameters:: src – The source info.

inline void set_columns(std::vector<std::string> col_names)#

Sets names of the column to read.

Parameters:: col_names – Vector of column names

inline void set_stripes(std::vector<std::vector<size_type>> stripes)#

Sets list of stripes to read for each input source.

Parameters:

stripes – Vector of vectors, mapping stripes to read to input sources

Throws:

cudf::logic_error – if a non-empty vector is passed, and skip_rows has been previously set
cudf::logic_error – if a non-empty vector is passed, and num_rows has been previously set

inline void set_skip_rows(int64_t rows)#

Sets number of rows to skip from the start.

Parameters:

rows – Number of rows

Throws:

cudf::logic_error – if a negative value is passed
cudf::logic_error – if stripes have been previously set

inline void set_num_rows(int64_t nrows)#

Sets number of row to read.

Parameters:

nrows – Number of rows

Throws:

cudf::logic_error – if a negative value is passed
cudf::logic_error – if stripes have been previously set

inline void enable_use_index(bool use)#

Enable/Disable use of row index to speed-up reading.

Parameters:: use – Boolean value to enable/disable row index use

inline void enable_use_np_dtypes(bool use)#

Enable/Disable use of numpy-compatible dtypes.

Parameters:: use – Boolean value to enable/disable

inline void set_timestamp_type(data_type type)#

Sets timestamp type to which timestamp column will be cast.

Parameters:: type – Type of timestamp

inline void set_decimal128_columns(std::vector<std::string> val)#

Set columns that should be read as 128-bit Decimal.

Parameters:: val – Vector of fully qualified column names

Public Static Functions

static orc_reader_options_builder builder(source_info src)#

Creates orc_reader_options_builder which will build orc_reader_options.

Parameters:: src – Source information to read orc file
Returns:: Builder to build reader options

class orc_reader_options_builder#

#include <orc.hpp>

Builds settings to use for read_orc().

Public Functions

explicit orc_reader_options_builder() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack.

inline explicit orc_reader_options_builder(source_info src)#

Constructor from source info.

Parameters:: src – The source information used to read orc file

inline orc_reader_options_builder &columns(std::vector<std::string> col_names)#

Sets names of the column to read.

Parameters:: col_names – Vector of column names
Returns:: this for chaining

inline orc_reader_options_builder &stripes(std::vector<std::vector<size_type>> stripes)#

Sets list of individual stripes to read per source.

Parameters:: stripes – Vector of vectors, mapping stripes to read to input sources
Returns:: this for chaining

inline orc_reader_options_builder &skip_rows(int64_t rows)#

Sets number of rows to skip from the start.

Parameters:: rows – Number of rows
Returns:: this for chaining

inline orc_reader_options_builder &num_rows(int64_t nrows)#

Sets number of row to read.

Parameters:: nrows – Number of rows
Returns:: this for chaining

inline orc_reader_options_builder &use_index(bool use)#

Enable/Disable use of row index to speed-up reading.

Parameters:: use – Boolean value to enable/disable row index use
Returns:: this for chaining

inline orc_reader_options_builder &use_np_dtypes(bool use)#

Enable/Disable use of numpy-compatible dtypes.

Parameters:: use – Boolean value to enable/disable
Returns:: this for chaining

inline orc_reader_options_builder &timestamp_type(data_type type)#

Sets timestamp type to which timestamp column will be cast.

Parameters:: type – Type of timestamp
Returns:: this for chaining

inline orc_reader_options_builder &decimal128_columns(std::vector<std::string> val)#

Columns that should be read as 128-bit Decimal.

Parameters:: val – Vector of column names
Returns:: this for chaining

inline orc_reader_options_builder &ignore_timezone_in_stripe_footer(bool ignore)#

Set whether to ignore writer timezone in the stripe footer.

Parameters:: ignore – Boolean value to enable/disable ignoring writer timezone
Returns:: this for chaining

inline operator orc_reader_options&&()#: move orc_reader_options member once it’s built.

inline orc_reader_options &&build()#

move orc_reader_options member once it’s built.

This has been added since Cython does not support overloading of conversion operators.

Returns:: Built orc_reader_options object’s r-value reference

class chunked_orc_reader#

#include <orc.hpp>

The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk.

This class is designed to address the reading issue when reading very large ORC files such that sizes of their columns exceed the limit that can be stored in cudf columns. By reading the file content by chunks using this class, each chunk is guaranteed to have its size stay within the given limit.

Public Functions

chunked_orc_reader()#

Default constructor, this should never be used.

This is added just to satisfy cython.

explicit chunked_orc_reader(std::size_t chunk_read_limit, std::size_t pass_read_limit, size_type output_row_granularity, orc_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Construct the reader from input/output size limits, output row granularity, along with other ORC reader options.

The typical usage should be similar to this:

do {
  auto const chunk = reader.read_chunk();
  // Process chunk
} while (reader.has_next());

If chunk_read_limit == 0 (i.e., no output limit) and pass_read_limit == 0 (no temporary memory size limit), a call to read_chunk() will read the whole data source and return a table containing all rows.

The chunk_read_limit parameter controls the size of the output table to be returned per read_chunk() call. If the user specifies a 100 MB limit, the reader will attempt to return tables that have a total bytes size (over all columns) of 100 MB or less. This is a soft limit and the code will not fail if it cannot satisfy the limit.

The pass_read_limit parameter controls how much temporary memory is used in the entire process of loading, decompressing and decoding of data. Again, this is also a soft limit and the reader will try to make the best effort.

Finally, the parameter output_row_granularity controls the changes in row number of the output chunk. For each call to read_chunk(), with respect to the given pass_read_limit, a subset of stripes may be loaded, decompressed and decoded into an intermediate table. The reader will then subdivide that table into smaller tables for final output using output_row_granularity as the subdivision step.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit
pass_read_limit – Limit on temporary memory usage for reading the data sources, or 0 if there is no limit
output_row_granularity – The granularity parameter used for subdividing the decoded table for final output
options – Settings for controlling reading behaviors
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

Throws:

cudf::logic_error – if output_row_granularity is non-positive

explicit chunked_orc_reader(std::size_t chunk_read_limit, std::size_t pass_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Construct the reader from input/output size limits along with other ORC reader options.

This constructor implicitly call the other constructor with output_row_granularity set to DEFAULT_OUTPUT_ROW_GRANULARITY rows.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit
pass_read_limit – Limit on temporary memory usage for reading the data sources, or 0 if there is no limit
options – Settings for controlling reading behaviors
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

explicit chunked_orc_reader(std::size_t chunk_read_limit, orc_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Construct the reader from output size limits along with other ORC reader options.

This constructor implicitly call the other constructor with pass_read_limit set to 0 and output_row_granularity set to DEFAULT_OUTPUT_ROW_GRANULARITY rows.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per read_chunk() call, or 0 if there is no limit
options – Settings for controlling reading behaviors
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

~chunked_orc_reader()#: Destructor, destroying the internal reader instance.

bool has_next() const#

Check if there is any data in the given data sources has not yet read.

Returns:: A boolean value indicating if there is any data left to read

table_with_metadata read_chunk() const#

Read a chunk of rows in the given data sources.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given data sources at once.

An empty table will be returned if the given sources are empty, or all the data has been read and returned by the previous calls.

Returns:: An output cudf::table along with its metadata

class parquet_reader_options#

#include <parquet.hpp>

Settings for read_parquet().

Public Functions

explicit parquet_reader_options() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack. The hybrid_scan_reader also uses this to create parquet_reader_options without a source.

inline source_info const &get_source() const#

Returns source info.

Returns:: Source info

inline bool is_enabled_convert_strings_to_categories() const#

Returns boolean depending on whether strings should be converted to categories.

Returns:: true if strings should be converted to categories

inline bool is_enabled_use_pandas_metadata() const#

Returns boolean depending on whether to use pandas metadata while reading.

Returns:: true if pandas metadata is used while reading

inline bool is_enabled_use_arrow_schema() const#

Returns boolean depending on whether to use arrow schema while reading.

Returns:: true if arrow schema is used while reading

inline bool is_enabled_allow_mismatched_pq_schemas() const#

Returns boolean depending on whether to read matching projected and filter columns from mismatched Parquet sources.

Returns:: true if mismatched projected and filter columns will be read from mismatched Parquet sources.

inline bool is_enabled_ignore_missing_columns() const#

Returns boolean depending on whether to ignore non-existent projected columns while reading.

Returns:: true if non-existent projected columns will be ignored while reading.

inline std::optional<std::vector<reader_column_schema>> get_column_schema() const#

Returns optional tree of metadata.

Returns:: vector of reader_column_schema objects.

inline int64_t get_skip_rows() const#

Returns number of rows to skip from the start.

Returns:: Number of rows to skip from the start

inline std::optional<int64_t> const &get_num_rows() const#

Returns number of rows to read.

Returns:: Number of rows to read; nullopt if the option hasn’t been set (in which case the file is read until the end)

inline size_t get_skip_bytes() const#

Returns bytes to skip before starting reading row groups.

Returns:: Bytes to skip before starting reading row groups; only valid for single parquet source case

inline std::optional<size_t> const &get_num_bytes() const#

Returns number of bytes after skipping to end reading row groups at.

Returns:: Number of bytes after skipping to end reading row groups at; only valid for single parquet source case

inline auto const &get_columns() const#

Returns names of column to be read, if set.

Returns:: Names of column to be read; nullopt if the option is not set

inline auto const &get_row_groups() const#

Returns list of individual row groups to be read.

Returns:: List of individual row groups to be read

inline auto const &get_filter() const#

Returns AST based filter for predicate pushdown.

Returns:: AST expression to use as filter

inline data_type get_timestamp_type() const#

Returns timestamp type used to cast timestamp columns.

Returns:: Timestamp type used to cast timestamp columns

inline bool is_enabled_use_jit_filter() const#

Returns whether to use JIT compilation for filtering.

Returns:: true if JIT compilation should be used for filtering

inline void set_source(source_info src)#

Set a new source location.

Parameters:: src – New source_info.

inline void set_columns(std::vector<std::string> col_names)#

Sets the names of columns to be read from all input sources.

Applies the same list of column names across all sources. Unlike set_row_groups, which allows per-source configuration, set_columns applies globally.

Columns that do not exist in the input files will be ignored silently. The output table will only include the columns that are actually found.

To select a nested column (e.g., a struct member), use dot notation.

Example: To read only the bar and baz fields, call: set_columns({“foo.bar”, “foo.baz”});

Note

This function does not currently support per-source column selection.

Parameters:: col_names – A vector of column names to attempt to read from each input source.

void set_row_groups(std::vector<std::vector<size_type>> row_groups)#

Specifies which row groups to read from each input source.

When reading from multiple sources (e.g., multiple files), this function allows selecting specific row groups for each source individually. The outer vector corresponds to the list of input sources, and each inner vector contains the row group indices to read from the respective source.

If no row groups should be read from a given source, its entry should be an empty vector.

Example: To read row groups [0, 2] from the first input and [1] from the second input, call: set_row_groups({{0, 2}, {1}});

Parameters:: row_groups – A vector of vectors, one per input source, each specifying the row group indices to read from that source.

inline void set_filter(ast::expression const &filter)#

Sets AST based filter for predicate pushdown.

The filter can utilize cudf::ast::column_name_reference to reference a column by its name, even if it’s not necessarily present in the requested projected columns. To refer to output column indices, you can use cudf::ast::column_reference.

For a parquet with columns [“A”, “B”, “C”, … “X”, “Y”, “Z”], Example 1: with/without column projection

use_columns({"A", "X", "Z"})
.filter(operation(ast_operator::LESS, column_name_reference{"C"}, literal{100}));

Column “C” need not be present in output table. Example 2: without column projection

filter(operation(ast_operator::LESS, column_reference{1}, literal{100}));

Here, 1 will refer to column “B” because output will contain all columns in order [“A”, …, “Z”]. Example 3: with column projection

use_columns({"A", "Z", "X"})
.filter(operation(ast_operator::LESS, column_reference{1}, literal{100}));

Here, 1 will refer to column “Z” because output will contain 3 columns in order [“A”, “Z”, “X”].

Parameters:: filter – AST expression to use as filter

inline void enable_convert_strings_to_categories(bool val)#

Sets to enable/disable conversion of strings to categories.

Parameters:: val – Boolean value to enable/disable conversion of string columns to categories

inline void enable_use_pandas_metadata(bool val)#

Sets to enable/disable use of pandas metadata to read.

Parameters:: val – Boolean indicating whether to use pandas metadata

inline void enable_use_arrow_schema(bool val)#

Sets to enable/disable use of arrow schema to read.

Parameters:: val – Boolean indicating whether to use arrow schema

inline void enable_allow_mismatched_pq_schemas(bool val)#

Sets to enable/disable reading of matching projected and filter columns from mismatched Parquet sources.

Parameters:: val – Boolean indicating whether to read matching projected and filter columns from mismatched Parquet sources.

inline void enable_ignore_missing_columns(bool val)#

Sets to enable/disable ignoring of non-existent projected columns while reading.

Parameters:: val – Boolean indicating whether to ignore non-existent projected columns while reading.

inline void set_column_schema(std::vector<reader_column_schema> val)#

Sets reader column schema.

Parameters:: val – Tree of schema nodes to enable/disable conversion of binary to string columns. Note default is to convert to string columns.

void set_skip_rows(int64_t val)#

Sets number of rows to skip.

Parameters:: val – Number of rows to skip from start

void set_num_rows(int64_t val)#

Sets number of rows to read.

Note

Although this allows one to request more than size_type::max() rows, if any single read would produce a table larger than this row limit, an error is thrown.

Parameters:: val – Number of rows to read after skip

void set_skip_bytes(size_t val)#

Sets bytes to skip before starting reading row groups.

Parameters:: val – Bytes to skip before starting reading row groups

void set_num_bytes(size_t val)#

Sets number of bytes after skipping to end reading row groups at.

Parameters:: val – Number of bytes after skipping to end reading row groups at

inline void set_timestamp_type(data_type type)#

Sets timestamp_type used to cast timestamp columns.

Parameters:: type – The timestamp data_type to which all timestamp columns need to be cast

Public Static Functions

static parquet_reader_options_builder builder(source_info src = source_info{})#

Creates a parquet_reader_options_builder to build parquet_reader_options. By default, build with empty data source info.

Parameters:: src – Source information to read parquet file
Returns:: Builder to build reader options

class parquet_reader_options_builder#

#include <parquet.hpp>

Builds parquet_reader_options to use for read_parquet().

Public Functions

parquet_reader_options_builder() = default#

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack. The hybrid_scan_reader also uses this to construct parquet_reader_options without a source.

inline explicit parquet_reader_options_builder(source_info src)#

Constructor from source info.

Parameters:: src – The source information used to read parquet file

inline parquet_reader_options_builder &columns(std::vector<std::string> col_names)#

Sets names of the columns to be read.

Parameters:: col_names – Vector of column names
Returns:: this for chaining

inline parquet_reader_options_builder &row_groups(std::vector<std::vector<size_type>> row_groups)#

Sets vector of individual row groups to read.

Parameters:: row_groups – Vector of row groups to read
Returns:: this for chaining

inline parquet_reader_options_builder &filter(ast::expression const &filter)#

Sets AST based filter for predicate pushdown.

For a parquet with columns [“A”, “B”, “C”, … “X”, “Y”, “Z”], Example 1: with/without column projection

use_columns({"A", "X", "Z"})
.filter(operation(ast_operator::LESS, column_name_reference{"C"}, literal{100}));

Column “C” need not be present in output table. Example 2: without column projection

filter(operation(ast_operator::LESS, column_reference{1}, literal{100}));

Here, 1 will refer to column “B” because output will contain all columns in order [“A”, …, “Z”]. Example 3: with column projection

use_columns({"A", "Z", "X"})
.filter(operation(ast_operator::LESS, column_reference{1}, literal{100}));

Here, 1 will refer to column “Z” because output will contain 3 columns in order [“A”, “Z”, “X”].

Parameters:: filter – AST expression to use as filter
Returns:: this for chaining

inline parquet_reader_options_builder &convert_strings_to_categories(bool val)#

Sets enable/disable conversion of strings to categories.

Parameters:: val – Boolean value to enable/disable conversion of string columns to categories
Returns:: this for chaining

inline parquet_reader_options_builder &use_pandas_metadata(bool val)#

Sets to enable/disable use of pandas metadata to read.

Parameters:: val – Boolean value whether to use pandas metadata
Returns:: this for chaining

inline parquet_reader_options_builder &use_arrow_schema(bool val)#

Sets to enable/disable use of arrow schema to read.

Parameters:: val – Boolean value whether to use arrow schema
Returns:: this for chaining

inline parquet_reader_options_builder &allow_mismatched_pq_schemas(bool val)#

Sets to enable/disable reading of matching projected and filter columns from mismatched Parquet sources.

Parameters:: val – Boolean value whether to read matching projected and filter columns from mismatched Parquet sources.
Returns:: this for chaining.

inline parquet_reader_options_builder &ignore_missing_columns(bool val)#

Sets to enable/disable ignoring of non-existent projected columns while reading.

Parameters:: val – Boolean indicating whether to ignore non-existent projected columns while reading.
Returns:: this for chaining.

inline parquet_reader_options_builder &set_column_schema(std::vector<reader_column_schema> val)#

Sets reader metadata.

Parameters:: val – Tree of metadata information.
Returns:: this for chaining

inline parquet_reader_options_builder &skip_rows(int64_t val)#

Sets number of rows to skip.

Parameters:: val – Number of rows to skip from start
Returns:: this for chaining

inline parquet_reader_options_builder &num_rows(int64_t val)#

Sets number of rows to read.

Note

Although this allows one to request more than size_type::max() rows, if any single read would produce a table larger than this row limit, an error is thrown.

Parameters:: val – Number of rows to read after skip
Returns:: this for chaining

inline parquet_reader_options_builder &skip_bytes(size_t val)#

Sets bytes to skip before starting reading row groups.

Parameters:: val – Bytes to skip before starting reading row groups
Returns:: this for chaining

inline parquet_reader_options_builder &num_bytes(size_t val)#

Sets number of bytes after skipping to end reading row groups at.

Parameters:: val – Number of bytes after skipping to end reading row groups at
Returns:: this for chaining

inline parquet_reader_options_builder &timestamp_type(data_type type)#

timestamp_type used to cast timestamp columns.

Parameters:: type – The timestamp data_type to which all timestamp columns need to be cast
Returns:: this for chaining

inline parquet_reader_options_builder &use_jit_filter(bool use_jit_filter)#

Enable/disable use of JIT for filter step.

Parameters:: use_jit_filter – Boolean value whether to use JIT filter
Returns:: this for chaining

inline operator parquet_reader_options&&()#: move parquet_reader_options member once it’s built.

inline parquet_reader_options &&build()#

move parquet_reader_options member once it’s built.

This has been added since Cython does not support overloading of conversion operators.

Returns:: Built parquet_reader_options object’s r-value reference

class chunked_parquet_reader#

#include <parquet.hpp>

The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk.

This class is designed to address the reading issue when reading very large Parquet files such that the sizes of their column exceed the limit that can be stored in cudf column. By reading the file content by chunks using this class, each chunk is guaranteed to have its sizes stay within the given limit.

Public Functions

chunked_parquet_reader()#

Default constructor, this should never be used.

This is added just to satisfy cython. This is added to not leak detail API

chunked_parquet_reader(std::size_t chunk_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Constructor for chunked reader.

This constructor requires the same parquet_reader_option parameter as in cudf::read_parquet(), and an additional parameter to specify the size byte limit of the output table for each reading.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per read, or 0 if there is no limit
options – The options used to read Parquet file
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

chunked_parquet_reader(std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const &options, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Constructor for chunked reader.

This constructor requires the same parquet_reader_option parameter as in cudf::read_parquet(), with additional parameters to specify the size byte limit of the output table for each reading, and a byte limit on the amount of temporary memory to use when reading. pass_read_limit affects how many row groups we can read at a time by limiting the amount of memory dedicated to decompression space. pass_read_limit is a hint, not an absolute limit - if a single row group cannot fit within the limit given, it will still be loaded.

Parameters:

chunk_read_limit – Limit on total number of bytes to be returned per read, or 0 if there is no limit
pass_read_limit – Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit
options – The options used to read Parquet file
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource to use for device memory allocation

~chunked_parquet_reader()#

Destructor, destroying the internal reader instance.

Since the declaration of the internal reader object does not exist in this header, this destructor needs to be defined in a separate source file which can access to that object’s declaration.

bool has_next() const#

Check if there is any data in the given file has not yet read.

Returns:: A boolean value indicating if there is any data left to read

table_with_metadata read_chunk() const#

Read a chunk of rows in the given Parquet file.

The sequence of returned tables, if concatenated by their order, guarantees to form a complete dataset as reading the entire given file at once.

An empty table will be returned if the given file is empty, or all the data in the file has been read and returned by the previous calls.

Returns:: An output cudf::table along with its metadata

class byte_range_info#

#include <byte_range_info.hpp>

stores offset and size used to indicate a byte range

Public Functions

byte_range_info(int64_t offset, int64_t size)#

Constructs a byte_range_info object.

Parameters:

offset – offset in bytes
size – size in bytes

byte_range_info(byte_range_info const &other) noexcept = default#

Copy constructor.

Parameters:: other – byte_range_info object to copy

byte_range_info &operator=(byte_range_info const &other) noexcept = default#

Copy assignment operator.

Parameters:: other – byte_range_info object to copy
Returns:: this object after copying

inline int64_t offset() const#

Get the offset in bytes.

Returns:: Offset in bytes

inline int64_t size() const#

Get the size in bytes.

Returns:: Size in bytes

inline bool is_empty() const#

Returns whether the span is empty.

Returns:: true iff the range is empty, i.e. size() == 0

class device_data_chunk#

#include <data_chunk_source.hpp>

A contract guaranteeing stream-ordered memory access to the underlying device data.

This class guarantees access to the underlying data for the stream on which the data was allocated. Possible implementations may own the device data, or may only have a view over the data. Any work enqueued to the stream on which this data was allocated is guaranteed to be performed prior to the destruction of the underlying data, but otherwise no guarantees are made regarding if or when the underlying data gets destroyed.

Public Functions

virtual char const *data() const = 0#

Returns a pointer to the underlying device data.

Returns:: A pointer to the underlying device data

virtual std::size_t size() const = 0#

Returns the size of the underlying device data.

Returns:: The size of the underlying device data

virtual operator device_span<char const>() const = 0#

Returns a span over the underlying device data.

Returns:: A span over the underlying device data

class data_chunk_reader#

#include <data_chunk_source.hpp>

a reader capable of producing views over device memory.

The data chunk reader API encapsulates the idea of statefully traversing and loading a data source. A data source may be a file, a region of device memory, or a region of host memory. Reading data from these data sources efficiently requires different strategies depending on the type of data source, type of compression, capabilities of the host and device, the data’s destination. Whole-file decompression should be hidden behind this interface.

Public Functions

virtual void skip_bytes(std::size_t size) = 0#

Skips the specified number of bytes in the data source.

Parameters:: size – The number of bytes to skip

virtual std::unique_ptr<device_data_chunk> get_next_chunk(std::size_t size, rmm::cuda_stream_view stream) = 0#

Get the next chunk of bytes from the data source.

Performs any necessary work to read and prepare the underlying data source for consumption as a view over device memory. Common implementations may read from a file, copy data from host memory, allocate temporary memory, perform iterative decompression, or even launch device kernels.

Parameters:

size – number of bytes to read
stream – stream to associate allocations or perform work required to obtain chunk

Returns:

a chunk of data up to size bytes. May return less than size bytes if reader reaches end of underlying data source. Returned data must be accessed in stream order relative to the specified stream

class data_chunk_source#

#include <data_chunk_source.hpp>

a data source capable of creating a reader which can produce views of the data source in device memory.

Public Functions

virtual std::unique_ptr<data_chunk_reader> create_reader() const = 0#

Get a reader for the data source.

Returns:: data_chunk_reader object for the data source

struct parse_options#

#include <multibyte_split.hpp>

Parsing options for multibyte_split.

Public Members

byte_range_info byte_range = create_byte_range_info_max()#: Only rows starting inside this byte range will be part of the output column.

bool strip_delimiters = false#: Whether delimiters at the end of rows should be stripped from the output column.

Io Readers#

This Page