Settings for read_parquet().
More...
#include <parquet.hpp>
Public Member Functions | |
| parquet_reader_options ()=default | |
| Default constructor. More... | |
| source_info const & | get_source () const |
| Returns source info. More... | |
| bool | is_enabled_convert_strings_to_categories () const |
| Returns boolean depending on whether strings should be converted to categories. More... | |
| bool | is_enabled_use_pandas_metadata () const |
| Returns boolean depending on whether to use pandas metadata while reading. More... | |
| bool | is_enabled_use_arrow_schema () const |
| Returns boolean depending on whether to use arrow schema while reading. More... | |
| bool | is_enabled_allow_mismatched_pq_schemas () const |
| Returns boolean depending on whether to read matching projected and filter columns from mismatched Parquet sources. More... | |
| bool | is_enabled_ignore_missing_columns () const |
| Returns boolean depending on whether to ignore non-existent projected columns while reading. More... | |
| std::optional< std::vector< reader_column_schema > > | get_column_schema () const |
| Returns optional tree of metadata. More... | |
| int64_t | get_skip_rows () const |
| Returns number of rows to skip from the start. More... | |
| std::optional< int64_t > const & | get_num_rows () const |
| Returns number of rows to read. More... | |
| size_t | get_skip_bytes () const |
| Returns bytes to skip before starting reading row groups. More... | |
| std::optional< size_t > const & | get_num_bytes () const |
| Returns number of bytes after skipping to end reading row groups at. More... | |
| auto const & | get_columns () const |
| Returns names of column to be read, if set. More... | |
| auto const & | get_column_names () const |
| Returns names of column to be read, if set. More... | |
| auto const & | get_column_indices () const |
| Returns indices of top-level columns to be read, if set. More... | |
| auto const & | get_row_groups () const |
| Returns list of individual row groups to be read. More... | |
| auto const & | get_filter () const |
| Returns AST based filter for predicate pushdown. More... | |
| data_type | get_timestamp_type () const |
| Returns timestamp type used to cast timestamp columns. More... | |
| bool | is_enabled_use_jit_filter () const |
| Returns whether to use JIT compilation for filtering. More... | |
| void | set_source (source_info src) |
| Set a new source location. More... | |
| void | set_columns (std::vector< std::string > column_names) |
| Sets the names of columns to be read from all input sources. More... | |
| void | set_column_names (std::vector< std::string > column_names) |
| Sets the names of columns to be read from all input sources. More... | |
| void | set_column_indices (std::vector< cudf::size_type > col_indices) |
| Sets the indices of top-level columns to be read from all input sources. More... | |
| void | set_row_groups (std::vector< std::vector< size_type >> row_groups) |
| Specifies which row groups to read from each input source. More... | |
| void | set_filter (ast::expression const &filter) |
| Sets AST based filter for predicate pushdown. More... | |
| void | enable_convert_strings_to_categories (bool val) |
| Sets to enable/disable conversion of strings to categories. More... | |
| void | enable_use_pandas_metadata (bool val) |
| Sets to enable/disable use of pandas metadata to read. More... | |
| void | enable_use_arrow_schema (bool val) |
| Sets to enable/disable use of arrow schema to read. More... | |
| void | enable_allow_mismatched_pq_schemas (bool val) |
| Sets to enable/disable reading of matching projected and filter columns from mismatched Parquet sources. More... | |
| void | enable_ignore_missing_columns (bool val) |
| Sets to enable/disable ignoring of non-existent projected columns while reading. More... | |
| void | set_column_schema (std::vector< reader_column_schema > val) |
| Sets reader column schema. More... | |
| void | set_skip_rows (int64_t val) |
| Sets number of rows to skip. More... | |
| void | set_num_rows (int64_t val) |
| Sets number of rows to read. More... | |
| void | set_skip_bytes (size_t val) |
| Sets bytes to skip before starting reading row groups. More... | |
| void | set_num_bytes (size_t val) |
| Sets number of bytes after skipping to end reading row groups at. More... | |
| void | set_timestamp_type (data_type type) |
| Sets timestamp_type used to cast timestamp columns. More... | |
Static Public Member Functions | |
| static parquet_reader_options_builder | builder (source_info src=source_info{}) |
Creates a parquet_reader_options_builder to build parquet_reader_options. By default, build with empty data source info. More... | |
Settings for read_parquet().
Definition at line 66 of file parquet.hpp.
|
explicitdefault |
Default constructor.
This has been added since Cython requires a default constructor to create objects on stack. The hybrid_scan_reader also uses this to create parquet_reader_options without a source.
|
static |
Creates a parquet_reader_options_builder to build parquet_reader_options. By default, build with empty data source info.
| src | Source information to read parquet file |
|
inline |
Sets to enable/disable reading of matching projected and filter columns from mismatched Parquet sources.
| val | Boolean indicating whether to read matching projected and filter columns from mismatched Parquet sources. |
Definition at line 438 of file parquet.hpp.
|
inline |
Sets to enable/disable conversion of strings to categories.
| val | Boolean value to enable/disable conversion of string columns to categories |
Definition at line 415 of file parquet.hpp.
|
inline |
Sets to enable/disable ignoring of non-existent projected columns while reading.
| val | Boolean indicating whether to ignore non-existent projected columns while reading. |
Definition at line 446 of file parquet.hpp.
|
inline |
Sets to enable/disable use of arrow schema to read.
| val | Boolean indicating whether to use arrow schema |
Definition at line 429 of file parquet.hpp.
|
inline |
Sets to enable/disable use of pandas metadata to read.
| val | Boolean indicating whether to use pandas metadata |
Definition at line 422 of file parquet.hpp.
|
inline |
Returns indices of top-level columns to be read, if set.
nullopt if the option is not set Definition at line 249 of file parquet.hpp.
|
inline |
Returns names of column to be read, if set.
nullopt if the option is not set Definition at line 242 of file parquet.hpp.
|
inline |
Returns optional tree of metadata.
Definition at line 191 of file parquet.hpp.
|
inline |
Returns names of column to be read, if set.
nullopt if the option is not set Definition at line 232 of file parquet.hpp.
|
inline |
Returns AST based filter for predicate pushdown.
Definition at line 263 of file parquet.hpp.
|
inline |
Returns number of bytes after skipping to end reading row groups at.
Definition at line 225 of file parquet.hpp.
|
inline |
Returns number of rows to read.
nullopt if the option hasn't been set (in which case the file is read until the end) Definition at line 209 of file parquet.hpp.
|
inline |
Returns list of individual row groups to be read.
Definition at line 256 of file parquet.hpp.
|
inline |
Returns bytes to skip before starting reading row groups.
Definition at line 217 of file parquet.hpp.
|
inline |
Returns number of rows to skip from the start.
Definition at line 201 of file parquet.hpp.
|
inline |
|
inline |
Returns timestamp type used to cast timestamp columns.
Definition at line 270 of file parquet.hpp.
|
inline |
Returns boolean depending on whether to read matching projected and filter columns from mismatched Parquet sources.
true if mismatched projected and filter columns will be read from mismatched Parquet sources. Definition at line 172 of file parquet.hpp.
|
inline |
Returns boolean depending on whether strings should be converted to categories.
true if strings should be converted to categories Definition at line 146 of file parquet.hpp.
|
inline |
Returns boolean depending on whether to ignore non-existent projected columns while reading.
true if non-existent projected columns will be ignored while reading. Definition at line 184 of file parquet.hpp.
|
inline |
Returns boolean depending on whether to use arrow schema while reading.
true if arrow schema is used while reading Definition at line 163 of file parquet.hpp.
|
inline |
Returns whether to use JIT compilation for filtering.
true if JIT compilation should be used for filtering Definition at line 277 of file parquet.hpp.
|
inline |
Returns boolean depending on whether to use pandas metadata while reading.
true if pandas metadata is used while reading Definition at line 156 of file parquet.hpp.
|
inline |
Sets the indices of top-level columns to be read from all input sources.
Applies the same list of top-level column indices across all sources. Unlike set_row_groups, which allows per-source configuration, set_column_indices applies globally.
Note that set_column_indices can only be used to select top-level columns. unlike set_columns which can also select nested columns.
| col_indices | A vector of column indices to attempt to read from each input source. |
Definition at line 352 of file parquet.hpp.
|
inline |
Sets the names of columns to be read from all input sources.
Applies the same list of column names across all sources. Unlike set_row_groups, which allows per-source configuration, set_columns applies globally.
Columns that do not exist in the input files will be ignored silently and the output table will only include the columns that are actually found. This behavior can be changed by setting enable_ignore_missing_columns to false.
To select a nested column (e.g., a struct member), use dot notation.
Example: To read only the bar and baz fields, call: set_column_names({"foo.bar", "foo.baz"});
| column_names | A vector of column names to attempt to read from each input source. |
Definition at line 334 of file parquet.hpp.
|
inline |
Sets reader column schema.
| val | Tree of schema nodes to enable/disable conversion of binary to string columns. Note default is to convert to string columns. |
Definition at line 454 of file parquet.hpp.
|
inline |
Sets the names of columns to be read from all input sources.
set_column_names instead.Applies the same list of column names across all sources. Unlike set_row_groups, which allows per-source configuration, set_columns applies globally.
Columns that do not exist in the input files will be ignored silently and the output table will only include the columns that are actually found. This behavior can be changed by setting enable_ignore_missing_columns to false.
To select a nested column (e.g., a struct member), use dot notation.
Example: To read only the bar and baz fields, call: set_columns({"foo.bar", "foo.baz"});
| column_names | A vector of column names to attempt to read from each input source. |
Definition at line 308 of file parquet.hpp.
|
inline |
Sets AST based filter for predicate pushdown.
The filter can utilize cudf::ast::column_name_reference to reference a column by its name, even if it's not necessarily present in the requested projected columns. To refer to output column indices, you can use cudf::ast::column_reference.
For a parquet with columns ["A", "B", "C", ... "X", "Y", "Z"], Example 1: with/without column projection
Column "C" need not be present in output table. Example 2: without column projection
Here, 1 will refer to column "B" because output will contain all columns in order ["A", ..., "Z"]. Example 3: with column projection
Here, 1 will refer to column "Z" because output will contain 3 columns in order ["A", "Z", "X"].
| filter | AST expression to use as filter |
Definition at line 408 of file parquet.hpp.
| void cudf::io::parquet_reader_options::set_num_bytes | ( | size_t | val | ) |
Sets number of bytes after skipping to end reading row groups at.
| val | Number of bytes after skipping to end reading row groups at |
| void cudf::io::parquet_reader_options::set_num_rows | ( | int64_t | val | ) |
Sets number of rows to read.
size_type::max() rows, if any single read would produce a table larger than this row limit, an error is thrown.| val | Number of rows to read after skip |
| void cudf::io::parquet_reader_options::set_row_groups | ( | std::vector< std::vector< size_type >> | row_groups | ) |
Specifies which row groups to read from each input source.
When reading from multiple sources (e.g., multiple files), this function allows selecting specific row groups for each source individually. The outer vector corresponds to the list of input sources, and each inner vector contains the row group indices to read from the respective source.
If no row groups should be read from a given source, its entry should be an empty vector.
Example: To read row groups [0, 2] from the first input and [1] from the second input, call: set_row_groups({{0, 2}, {1}});
| row_groups | A vector of vectors, one per input source, each specifying the row group indices to read from that source. |
| void cudf::io::parquet_reader_options::set_skip_bytes | ( | size_t | val | ) |
Sets bytes to skip before starting reading row groups.
| val | Bytes to skip before starting reading row groups |
| void cudf::io::parquet_reader_options::set_skip_rows | ( | int64_t | val | ) |
Sets number of rows to skip.
| val | Number of rows to skip from start |
|
inline |
Set a new source location.
| src | New source_info. |
Definition at line 284 of file parquet.hpp.
|
inline |
Sets timestamp_type used to cast timestamp columns.
| type | The timestamp data_type to which all timestamp columns need to be cast |
Definition at line 495 of file parquet.hpp.