Parquet Metadata#

class pylibcudf.io.parquet_metadata.ColumnChunk#

Metadata for a row group’s column chunk.

Attributes

column_index_length

Size of the chunk's ColumnIndex, in bytes.

column_index_offset

File offset of the chunk's ColumnIndex.

file_offset

Deprecated byte offset to column metadata.

file_path

Relative file path for this column chunk.

meta_data

Column metadata for this chunk.

offset_index_length

Size of the chunk's OffsetIndex, in bytes.

offset_index_offset

File offset of the chunk's OffsetIndex.

schema_idx

Derived index in the flattened schema.

column_index_length#

Size of the chunk’s ColumnIndex, in bytes.

column_index_offset#

File offset of the chunk’s ColumnIndex.

file_offset#

Deprecated byte offset to column metadata.

file_path#

Relative file path for this column chunk.

meta_data#

Column metadata for this chunk.

offset_index_length#

Size of the chunk’s OffsetIndex, in bytes.

offset_index_offset#

File offset of the chunk’s OffsetIndex.

schema_idx#

Derived index in the flattened schema.

class pylibcudf.io.parquet_metadata.ColumnChunkMetaData#

Metadata payload for a column chunk.

Attributes

num_values

Number of values in this chunk.

path_in_schema

Column path components in the flattened schema.

total_compressed_size

Total compressed page bytes for this chunk.

total_uncompressed_size

Total uncompressed page bytes for this chunk.

num_values#

Number of values in this chunk.

path_in_schema#

Column path components in the flattened schema.

total_compressed_size#

Total compressed page bytes for this chunk.

total_uncompressed_size#

Total uncompressed page bytes for this chunk.

class pylibcudf.io.parquet_metadata.FileMetaData#

Parquet file footer metadata.

For details, see cudf::io::parquet::FileMetaData

Attributes

created_by

Get the application that created the file.

num_rows

Get the total number of rows.

row_group_num_rows

Get row counts for each row group in this file.

row_groups

Get row group metadata in this file.

version

Get the file format version.

Methods

from_bytes(cls, const uint8_t[)

Build FileMetaData from parquet footer bytes.

See also

read_parquet_footers

Read one FileMetaData per source directly from pylibcudf.io.types.SourceInfo.

created_by#

Get the application that created the file.

classmethod from_bytes(cls, const uint8_t[::1] footer_bytes)#

Build FileMetaData from parquet footer bytes.

Parameters:
footer_bytesBuffer

A contiguous bytes-like object containing parquet footer bytes. The bytes are forwarded as-is to cudf::io::parquet::experimental::hybrid_scan_reader without Python-side preprocessing. This method does not strip the parquet footer suffix (4-byte footer length + PAR1 magic), so callers should generally pass only the footer region bytes.

Returns:
FileMetaData

Parsed parquet file footer metadata.

num_rows#

Get the total number of rows.

row_group_num_rows#

Get row counts for each row group in this file.

Returns:
row_counts

A list with the row count per row group in this file.

Notes

Equivalent to, but faster than, checking each row groups’ num_rows:

>>> [rg.num_rows for rg in file_metadata.row_groups]
row_groups#

Get row group metadata in this file.

version#

Get the file format version.

class pylibcudf.io.parquet_metadata.ParquetColumnSchema#

Schema of a parquet column, including the nested columns.

Parameters:
parquet_column_schema

Methods

child(self, int idx)

Returns schema of the child with the given index.

children(self)

Returns schemas of all child columns.

cudf_type(self)

Returns the cudf data type for this column.

name(self)

Returns parquet column name; can be empty.

num_children(self)

Returns the number of child columns.

child(self, int idx) ParquetColumnSchema#

Returns schema of the child with the given index.

Parameters:
idxint

Child Index

Returns:
ParquetColumnSchema

Child schema

children(self) list#

Returns schemas of all child columns.

Returns:
list[ParquetColumnSchema]

Child schemas.

cudf_type(self) DataType#

Returns the cudf data type for this column.

This is the resolved cudf data type mapped from the parquet physical/logical types.

Returns:
DataType

cudf data type

name(self) str#

Returns parquet column name; can be empty.

Returns:
str

Column name

num_children(self) int#

Returns the number of child columns.

Returns:
int

Children count

class pylibcudf.io.parquet_metadata.ParquetMetadata#

Information about content of a parquet file.

Parameters:
parquet_metadata

Methods

columnchunk_metadata(self)

Returns a map of leaf column names to lists of total_uncompressed_size metadata from all column chunks in the file footer.

metadata(self)

Returns the key-value metadata in the file footer.

num_rowgroups(self)

Returns the total number of rowgroups in the file.

num_rowgroups_per_file(self)

Returns the number of rowgroups in each file.

num_rows(self)

Returns the number of rows of the root column.

rowgroup_metadata(self)

Returns the row group metadata in the file footer.

schema(self)

Returns the parquet schema.

columnchunk_metadata(self) dict#

Returns a map of leaf column names to lists of total_uncompressed_size metadata from all column chunks in the file footer.

Returns:
dict[str, list[int]]

Map of leaf column names to lists of total_uncompressed_size metadata from all their column chunks.

metadata(self) dict#

Returns the key-value metadata in the file footer.

Returns:
dict[str, str]

Key value metadata as a map.

num_rowgroups(self) int#

Returns the total number of rowgroups in the file.

Returns:
int

Number of row groups.

num_rowgroups_per_file(self) list#

Returns the number of rowgroups in each file.

num_rows(self) int#

Returns the number of rows of the root column.

Returns:
int

Number of rows

rowgroup_metadata(self) list#

Returns the row group metadata in the file footer.

Returns:
list[dict[str, int]]

Vector of row group metadata as maps.

schema(self) ParquetSchema#

Returns the parquet schema.

Returns:
ParquetSchema

Parquet schema

class pylibcudf.io.parquet_metadata.ParquetSchema#

Schema of a parquet file.

Parameters:
parquet_schema

Methods

column_types(self)

Returns a dictionary mapping column names to their cudf data types.

root(self)

Returns the schema of the struct column that contains all columns as fields.

column_types(self) dict#

Returns a dictionary mapping column names to their cudf data types.

Returns:
dict[str, DataType]

Dictionary mapping column names to DataType objects

root(self) ParquetColumnSchema#

Returns the schema of the struct column that contains all columns as fields.

Returns:
ParquetColumnSchema

Root column schema

class pylibcudf.io.parquet_metadata.RowGroup#

Parquet row group metadata.

Attributes

columns

Column chunk metadata for each column in this row group.

file_offset

Optional byte offset to first page in this row group.

num_rows

Number of rows in this row group.

ordinal

Optional row group ordinal within the file.

sorting_columns

Optional row sort order metadata.

total_byte_size

Total uncompressed byte size in this row group.

total_compressed_size

Optional total compressed bytes for this row group.

columns#

Column chunk metadata for each column in this row group.

file_offset#

Optional byte offset to first page in this row group.

num_rows#

Number of rows in this row group.

ordinal#

Optional row group ordinal within the file.

sorting_columns#

Optional row sort order metadata.

total_byte_size#

Total uncompressed byte size in this row group.

total_compressed_size#

Optional total compressed bytes for this row group.

class pylibcudf.io.parquet_metadata.SortingColumn#

Sort metadata for a row group column.

Attributes

column_idx

Column index (within the row group).

descending

Whether this column is sorted in descending order.

nulls_first

Whether null values are ordered before non-null values.

column_idx#

Column index (within the row group).

descending#

Whether this column is sorted in descending order.

nulls_first#

Whether null values are ordered before non-null values.

pylibcudf.io.parquet_metadata.read_parquet_footers(SourceInfo src_info) list#

Read parquet file footers as FileMetaData objects.

Parameters:
src_infoSourceInfo

Dataset source.

Returns:
list[FileMetaData]

One footer metadata object per input source.

pylibcudf.io.parquet_metadata.read_parquet_metadata(SourceInfo src_info) ParquetMetadata#

Reads metadata of parquet dataset.

Parameters:
src_infoSourceInfo

Dataset source.

Returns:
ParquetMetadata

Parquet_metadata with parquet schema, number of rows, number of row groups and key-value metadata.

See also

read_parquet_footers

To read the pre-materialized file footer metadata used in pylibcudf.io.parquet.read_parquet().