Settings for read_parquet(). More...

#include <parquet.hpp>

Public Member Functions
	parquet_reader_options ()=default
	Default constructor. More...

source_info const &	get_source () const
	Returns source info. More...

bool	is_enabled_convert_strings_to_categories () const
	Returns boolean depending on whether strings should be converted to categories. More...

bool	is_enabled_use_pandas_metadata () const
	Returns boolean depending on whether to use pandas metadata while reading. More...

bool	is_enabled_use_arrow_schema () const
	Returns boolean depending on whether to use arrow schema while reading. More...

bool	is_enabled_allow_mismatched_pq_schemas () const
	Returns boolean depending on whether to read matching projected and filter columns from mismatched Parquet sources. More...

std::optional< std::vector< reader_column_schema > >	get_column_schema () const
	Returns optional tree of metadata. More...

int64_t	get_skip_rows () const
	Returns number of rows to skip from the start. More...

std::optional< size_type > const &	get_num_rows () const
	Returns number of rows to read. More...

size_t	get_skip_bytes () const
	Returns bytes to skip before starting reading row groups. More...

std::optional< size_t > const &	get_num_bytes () const
	Returns number of bytes after skipping to end reading row groups at. More...

auto const &	get_columns () const
	Returns names of column to be read, if set. More...

auto const &	get_row_groups () const
	Returns list of individual row groups to be read. More...

auto const &	get_filter () const
	Returns AST based filter for predicate pushdown. More...

data_type	get_timestamp_type () const
	Returns timestamp type used to cast timestamp columns. More...

void	set_columns (std::vector< std::string > col_names)
	Sets the names of columns to be read from all input sources. More...

void	set_row_groups (std::vector< std::vector< size_type >> row_groups)
	Specifies which row groups to read from each input source. More...

void	set_filter (ast::expression const &filter)
	Sets AST based filter for predicate pushdown. More...

void	enable_convert_strings_to_categories (bool val)
	Sets to enable/disable conversion of strings to categories. More...

void	enable_use_pandas_metadata (bool val)
	Sets to enable/disable use of pandas metadata to read. More...

void	enable_use_arrow_schema (bool val)
	Sets to enable/disable use of arrow schema to read. More...

void	enable_allow_mismatched_pq_schemas (bool val)
	Sets to enable/disable reading of matching projected and filter columns from mismatched Parquet sources. More...

void	set_column_schema (std::vector< reader_column_schema > val)
	Sets reader column schema. More...

void	set_skip_rows (int64_t val)
	Sets number of rows to skip. More...

void	set_num_rows (size_type val)
	Sets number of rows to read. More...

void	set_skip_bytes (size_t val)
	Sets bytes to skip before starting reading row groups. More...

void	set_num_bytes (size_t val)
	Sets number of bytes after skipping to end reading row groups at. More...

void	set_timestamp_type (data_type type)
	Sets timestamp_type used to cast timestamp columns. More...

Static Public Member Functions
static parquet_reader_options_builder	builder (source_info src=source_info{})
	Creates a `parquet_reader_options_builder` to build `parquet_reader_options`. By default, build with empty data source info. More...

Detailed Description

Settings for read_parquet().

Definition at line 78 of file parquet.hpp.

Constructor & Destructor Documentation

◆ parquet_reader_options()

cudf::io::parquet_reader_options::parquet_reader_options ( )

explicitdefault

Default constructor.

This has been added since Cython requires a default constructor to create objects on stack. The hybrid_scan_reader also uses this to create parquet_reader_options without a source.

Member Function Documentation

◆ builder()

static parquet_reader_options_builder cudf::io::parquet_reader_options::builder ( source_info src = source_info{} )

static

Creates a parquet_reader_options_builder to build parquet_reader_options. By default, build with empty data source info.

Parameters

src	Source information to read parquet file

Returns: Builder to build reader options

◆ enable_allow_mismatched_pq_schemas()

void cudf::io::parquet_reader_options::enable_allow_mismatched_pq_schemas ( bool val )

inline

Sets to enable/disable reading of matching projected and filter columns from mismatched Parquet sources.

Parameters

val	Boolean indicating whether to read matching projected and filter columns from mismatched Parquet sources.

Definition at line 351 of file parquet.hpp.

◆ enable_convert_strings_to_categories()

void cudf::io::parquet_reader_options::enable_convert_strings_to_categories ( bool val )

inline

Sets to enable/disable conversion of strings to categories.

Parameters

val	Boolean value to enable/disable conversion of string columns to categories

Definition at line 328 of file parquet.hpp.

◆ enable_use_arrow_schema()

void cudf::io::parquet_reader_options::enable_use_arrow_schema ( bool val )

inline

Sets to enable/disable use of arrow schema to read.

Parameters

val	Boolean indicating whether to use arrow schema

Definition at line 342 of file parquet.hpp.

◆ enable_use_pandas_metadata()

void cudf::io::parquet_reader_options::enable_use_pandas_metadata ( bool val )

inline

Sets to enable/disable use of pandas metadata to read.

Parameters

val	Boolean indicating whether to use pandas metadata

Definition at line 335 of file parquet.hpp.

◆ get_column_schema()

std::optional<std::vector<reader_column_schema> > cudf::io::parquet_reader_options::get_column_schema ( ) const

inline

Returns optional tree of metadata.

Returns: vector of reader_column_schema objects.

Definition at line 187 of file parquet.hpp.

◆ get_columns()

auto const& cudf::io::parquet_reader_options::get_columns ( ) const

inline

Returns names of column to be read, if set.

Returns: Names of column to be read; nullopt if the option is not set

Definition at line 228 of file parquet.hpp.

◆ get_filter()

auto const& cudf::io::parquet_reader_options::get_filter ( ) const

inline

Returns AST based filter for predicate pushdown.

Returns: AST expression to use as filter

Definition at line 242 of file parquet.hpp.

◆ get_num_bytes()

std::optional<size_t> const& cudf::io::parquet_reader_options::get_num_bytes ( ) const

inline

Returns number of bytes after skipping to end reading row groups at.

Returns: Number of bytes after skipping to end reading row groups at; only valid for single parquet source case

Definition at line 221 of file parquet.hpp.

◆ get_num_rows()

std::optional<size_type> const& cudf::io::parquet_reader_options::get_num_rows ( ) const

inline

Returns number of rows to read.

Returns: Number of rows to read; nullopt if the option hasn't been set (in which case the file is read until the end)

Definition at line 205 of file parquet.hpp.

◆ get_row_groups()

auto const& cudf::io::parquet_reader_options::get_row_groups ( ) const

inline

Returns list of individual row groups to be read.

Returns: List of individual row groups to be read

Definition at line 235 of file parquet.hpp.

◆ get_skip_bytes()

size_t cudf::io::parquet_reader_options::get_skip_bytes ( ) const

inline

Returns bytes to skip before starting reading row groups.

Returns: Bytes to skip before starting reading row groups; only valid for single parquet source case

Definition at line 213 of file parquet.hpp.

◆ get_skip_rows()

int64_t cudf::io::parquet_reader_options::get_skip_rows ( ) const

inline

Returns number of rows to skip from the start.

Returns: Number of rows to skip from the start

Definition at line 197 of file parquet.hpp.

◆ get_source()

source_info const& cudf::io::parquet_reader_options::get_source ( ) const

inline

Returns source info.

Returns: Source info

Definition at line 144 of file parquet.hpp.

◆ get_timestamp_type()

data_type cudf::io::parquet_reader_options::get_timestamp_type ( ) const

inline

Returns timestamp type used to cast timestamp columns.

Returns: Timestamp type used to cast timestamp columns

Definition at line 249 of file parquet.hpp.

◆ is_enabled_allow_mismatched_pq_schemas()

bool cudf::io::parquet_reader_options::is_enabled_allow_mismatched_pq_schemas ( ) const

inline

Returns boolean depending on whether to read matching projected and filter columns from mismatched Parquet sources.

Returns: true if mismatched projected and filter columns will be read from mismatched Parquet sources.

Definition at line 177 of file parquet.hpp.

◆ is_enabled_convert_strings_to_categories()

bool cudf::io::parquet_reader_options::is_enabled_convert_strings_to_categories ( ) const

inline

Returns boolean depending on whether strings should be converted to categories.

Returns: true if strings should be converted to categories

Definition at line 151 of file parquet.hpp.

◆ is_enabled_use_arrow_schema()

bool cudf::io::parquet_reader_options::is_enabled_use_arrow_schema ( ) const

inline

Returns boolean depending on whether to use arrow schema while reading.

Returns: true if arrow schema is used while reading

Definition at line 168 of file parquet.hpp.

◆ is_enabled_use_pandas_metadata()

bool cudf::io::parquet_reader_options::is_enabled_use_pandas_metadata ( ) const

inline

Returns boolean depending on whether to use pandas metadata while reading.

Returns: true if pandas metadata is used while reading

Definition at line 161 of file parquet.hpp.

◆ set_column_schema()

void cudf::io::parquet_reader_options::set_column_schema ( std::vector< reader_column_schema > val )

inline

Sets reader column schema.

Parameters

val	Tree of schema nodes to enable/disable conversion of binary to string columns. Note default is to convert to string columns.

Definition at line 359 of file parquet.hpp.

◆ set_columns()

void cudf::io::parquet_reader_options::set_columns ( std::vector< std::string > col_names )

inline

Sets the names of columns to be read from all input sources.

Applies the same list of column names across all sources. Unlike set_row_groups, which allows per-source configuration, set_columns applies globally.

Columns that do not exist in the input files will be ignored silently. The output table will only include the columns that are actually found.

To select a nested column (e.g., a struct member), use dot notation.

Example: To read only the bar and baz fields, call: set_columns({"foo.bar", "foo.baz"});

Note: This function does not currently support per-source column selection.

Parameters

col_names A vector of column names to attempt to read from each input source.

Definition at line 270 of file parquet.hpp.

◆ set_filter()

void cudf::io::parquet_reader_options::set_filter ( ast::expression const & filter )

inline

Sets AST based filter for predicate pushdown.

The filter can utilize cudf::ast::column_name_reference to reference a column by its name, even if it's not necessarily present in the requested projected columns. To refer to output column indices, you can use cudf::ast::column_reference.

For a parquet with columns ["A", "B", "C", ... "X", "Y", "Z"], Example 1: with/without column projection

use_columns({"A", "X", "Z"})

.filter(operation(ast_operator::LESS, column_name_reference{"C"}, literal{100}));

cudf::filter

std::vector< std::unique_ptr< column > > filter(std::vector< column_view > const &predicate_columns, std::string const &predicate_udf, std::vector< column_view > const &filter_columns, bool is_ptx, std::optional< void * > user_data=std::nullopt, null_aware is_null_aware=null_aware::NO, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())

Creates a new column by applying a filter function against every element of the input columns.

Column "C" need not be present in output table. Example 2: without column projection

filter(operation(ast_operator::LESS, column_reference{1}, literal{100}));

Here, 1 will refer to column "B" because output will contain all columns in order ["A", ..., "Z"]. Example 3: with column projection

use_columns({"A", "Z", "X"})

.filter(operation(ast_operator::LESS, column_reference{1}, literal{100}));

Here, 1 will refer to column "Z" because output will contain 3 columns in order ["A", "Z", "X"].

Parameters

filter AST expression to use as filter

Definition at line 321 of file parquet.hpp.

◆ set_num_bytes()

void cudf::io::parquet_reader_options::set_num_bytes ( size_t val )

Sets number of bytes after skipping to end reading row groups at.

Parameters

val	Number of bytes after skipping to end reading row groups at

◆ set_num_rows()

void cudf::io::parquet_reader_options::set_num_rows ( size_type val )

Sets number of rows to read.

Parameters

val	Number of rows to read after skip

◆ set_row_groups()

void cudf::io::parquet_reader_options::set_row_groups ( std::vector< std::vector< size_type >> row_groups )

Specifies which row groups to read from each input source.

When reading from multiple sources (e.g., multiple files), this function allows selecting specific row groups for each source individually. The outer vector corresponds to the list of input sources, and each inner vector contains the row group indices to read from the respective source.

If no row groups should be read from a given source, its entry should be an empty vector.

Example: To read row groups [0, 2] from the first input and [1] from the second input, call: set_row_groups({{0, 2}, {1}});

Parameters

row_groups A vector of vectors, one per input source, each specifying the row group indices to read from that source.

◆ set_skip_bytes()

void cudf::io::parquet_reader_options::set_skip_bytes ( size_t val )

Sets bytes to skip before starting reading row groups.

Parameters

val	Bytes to skip before starting reading row groups

◆ set_skip_rows()

void cudf::io::parquet_reader_options::set_skip_rows ( int64_t val )

Sets number of rows to skip.

Parameters

val	Number of rows to skip from start

◆ set_timestamp_type()

void cudf::io::parquet_reader_options::set_timestamp_type ( data_type type )

inline

Sets timestamp_type used to cast timestamp columns.

Parameters

type	The timestamp data_type to which all timestamp columns need to be cast

Definition at line 397 of file parquet.hpp.

The documentation for this class was generated from the following file:

parquet.hpp

Public Member Functions

Static Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ parquet_reader_options()

Member Function Documentation

◆ builder()

◆ enable_allow_mismatched_pq_schemas()

◆ enable_convert_strings_to_categories()

◆ enable_use_arrow_schema()

◆ enable_use_pandas_metadata()

◆ get_column_schema()

◆ get_columns()

◆ get_filter()

◆ get_num_bytes()

◆ get_num_rows()

◆ get_row_groups()

◆ get_skip_bytes()

◆ get_skip_rows()

◆ get_source()

◆ get_timestamp_type()

◆ is_enabled_allow_mismatched_pq_schemas()

◆ is_enabled_convert_strings_to_categories()

◆ is_enabled_use_arrow_schema()

◆ is_enabled_use_pandas_metadata()

◆ set_column_schema()

◆ set_columns()

◆ set_filter()

◆ set_num_bytes()

◆ set_num_rows()

◆ set_row_groups()

◆ set_skip_bytes()

◆ set_skip_rows()

◆ set_timestamp_type()