Parquet#

class pylibcudf.io.parquet.ChunkedParquetReader(SourceInfo source_info, list columns=None, list row_groups=None, bool use_pandas_metadata=True, bool convert_strings_to_categories=False, int64_t skip_rows=0, size_type nrows=-1, size_t chunk_read_limit=0, size_t pass_read_limit=1024000000, bool allow_mismatched_pq_schemas=False)#

Reads chunks of a Parquet file into a TableWithMetadata.

Parameters:
source_infoSourceInfo

The SourceInfo object to read the Parquet file from.

columnslist, default None

The names of the columns to be read

row_groupslist[list[size_type]], default None

List of row groups to be read.

use_pandas_metadatabool, default True

If True, return metadata about the index column in the per-file user metadata of the TableWithMetadata

convert_strings_to_categoriesbool, default False

Whether to convert string columns to the category type

skip_rowsint64_t, default 0

The number of rows to skip from the start of the file.

nrowssize_type, default -1

The number of rows to read. By default, read the entire file.

chunk_read_limitsize_t, default 0

Limit on total number of bytes to be returned per read, or 0 if there is no limit.

pass_read_limitsize_t, default 1024000000

Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit.

allow_mismatched_pq_schemasbool, default False

Whether to read (matching) columns specified in columns from the input files with otherwise mismatched schemas.

Methods

has_next(self)

Returns True if there is another chunk in the Parquet file to be read.

read_chunk(self)

Read the next chunk into a TableWithMetadata

has_next(self) bool#

Returns True if there is another chunk in the Parquet file to be read.

Returns:
True if we have not finished reading the file.
read_chunk(self) TableWithMetadata#

Read the next chunk into a TableWithMetadata

Returns:
TableWithMetadata

The Table and its corresponding metadata (column names) that were read in.

pylibcudf.io.parquet.read_parquet(SourceInfo source_info, list columns=None, list row_groups=None, Expression filters=None, bool convert_strings_to_categories=False, bool use_pandas_metadata=True, int64_t skip_rows=0, size_type nrows=-1, bool allow_mismatched_pq_schemas=False)#

Reads an Parquet file into a TableWithMetadata.

Parameters:
source_infoSourceInfo

The SourceInfo object to read the Parquet file from.

columnslist, default None

The string names of the columns to be read.

row_groupslist[list[size_type]], default None

List of row groups to be read.

filtersExpression, default None

An AST pylibcudf.expressions.Expression to use for predicate pushdown.

convert_strings_to_categoriesbool, default False

Whether to convert string columns to the category type

use_pandas_metadatabool, default True

If True, return metadata about the index column in the per-file user metadata of the TableWithMetadata

skip_rowsint64_t, default 0

The number of rows to skip from the start of the file.

nrowssize_type, default -1

The number of rows to read. By default, read the entire file.

allow_mismatched_pq_schemasbool, default False

If True, enable reading (matching) columns specified in columns from the input files with otherwise mismatched schemas.

Returns:
TableWithMetadata

The Table and its corresponding metadata (column names) that were read in.