Parquet Metadata#

class pylibcudf.io.parquet_metadata.ColumnChunk#

Metadata for a row group’s column chunk.

Attributes

`column_index_length`	Size of the chunk's ColumnIndex, in bytes.
`column_index_offset`	File offset of the chunk's ColumnIndex.
`file_offset`	Deprecated byte offset to column metadata.
`file_path`	Relative file path for this column chunk.
`meta_data`	Column metadata for this chunk.
`offset_index_length`	Size of the chunk's OffsetIndex, in bytes.
`offset_index_offset`	File offset of the chunk's OffsetIndex.
`schema_idx`	Derived index in the flattened schema.

column_index_length#: Size of the chunk’s ColumnIndex, in bytes.

column_index_offset#: File offset of the chunk’s ColumnIndex.

file_offset#: Deprecated byte offset to column metadata.

file_path#: Relative file path for this column chunk.

meta_data#: Column metadata for this chunk.

offset_index_length#: Size of the chunk’s OffsetIndex, in bytes.

offset_index_offset#: File offset of the chunk’s OffsetIndex.

schema_idx#: Derived index in the flattened schema.

class pylibcudf.io.parquet_metadata.ColumnChunkMetaData#

Metadata payload for a column chunk.

Attributes

`num_values`	Number of values in this chunk.
`path_in_schema`	Column path components in the flattened schema.
`total_compressed_size`	Total compressed page bytes for this chunk.
`total_uncompressed_size`	Total uncompressed page bytes for this chunk.

num_values#: Number of values in this chunk.

path_in_schema#: Column path components in the flattened schema.

total_compressed_size#: Total compressed page bytes for this chunk.

total_uncompressed_size#: Total uncompressed page bytes for this chunk.

class pylibcudf.io.parquet_metadata.FileMetaData#

Parquet file footer metadata.

For details, see cudf::io::parquet::FileMetaData

Attributes

`created_by`	Get the application that created the file.
`num_rows`	Get the total number of rows.
`row_group_num_rows`	Get row counts for each row group in this file.
`row_groups`	Get row group metadata in this file.
`version`	Get the file format version.

Methods

from_bytes(cls, const uint8_t[)

Build FileMetaData from parquet footer bytes.

See also

read_parquet_footers: Read one FileMetaData per source directly from pylibcudf.io.types.SourceInfo.

created_by#: Get the application that created the file.

classmethod from_bytes(cls, const uint8_t[::1] footer_bytes)#

Build FileMetaData from parquet footer bytes.

Parameters:

footer_bytesBuffer: A contiguous bytes-like object containing parquet footer bytes. The bytes are forwarded as-is to cudf::io::parquet::experimental::hybrid_scan_reader without Python-side preprocessing. This method does not strip the parquet footer suffix (4-byte footer length + PAR1 magic), so callers should generally pass only the footer region bytes.

Returns:

FileMetaData: Parsed parquet file footer metadata.

num_rows#: Get the total number of rows.

row_group_num_rows#

Get row counts for each row group in this file.

Returns:

row_counts: A list with the row count per row group in this file.

Notes

Equivalent to, but faster than, checking each row groups’ num_rows:

>>> [rg.num_rows for rg in file_metadata.row_groups]

row_groups#: Get row group metadata in this file.

version#: Get the file format version.

class pylibcudf.io.parquet_metadata.ParquetColumnSchema#

Schema of a parquet column, including the nested columns.

Parameters:

parquet_column_schema

Methods

`child`(self, int idx)	Returns schema of the child with the given index.
`children`(self)	Returns schemas of all child columns.
`cudf_type`(self)	Returns the cudf data type for this column.
`name`(self)	Returns parquet column name; can be empty.
`num_children`(self)	Returns the number of child columns.

child(self, int idx) → ParquetColumnSchema#

Returns schema of the child with the given index.

Parameters:

idxint: Child Index

Returns:

ParquetColumnSchema: Child schema

children(self) → list#

Returns schemas of all child columns.

Returns:

list[ParquetColumnSchema]: Child schemas.

cudf_type(self) → DataType#

Returns the cudf data type for this column.

This is the resolved cudf data type mapped from the parquet physical/logical types.

Returns:

DataType: cudf data type

name(self) → str#

Returns parquet column name; can be empty.

Returns:

str: Column name

num_children(self) → int#

Returns the number of child columns.

Returns:

int: Children count

class pylibcudf.io.parquet_metadata.ParquetMetadata#

Information about content of a parquet file.

Parameters:

parquet_metadata

Methods

`columnchunk_metadata`(self)	Returns a map of leaf column names to lists of total_uncompressed_size metadata from all column chunks in the file footer.
`metadata`(self)	Returns the key-value metadata in the file footer.
`num_rowgroups`(self)	Returns the total number of rowgroups in the file.
`num_rowgroups_per_file`(self)	Returns the number of rowgroups in each file.
`num_rows`(self)	Returns the number of rows of the root column.
`rowgroup_metadata`(self)	Returns the row group metadata in the file footer.
`schema`(self)	Returns the parquet schema.

columnchunk_metadata(self) → dict#

Returns a map of leaf column names to lists of total_uncompressed_size metadata from all column chunks in the file footer.

Returns:

dict[str, list[int]]: Map of leaf column names to lists of total_uncompressed_size metadata from all their column chunks.

metadata(self) → dict#

Returns the key-value metadata in the file footer.

Returns:

dict[str, str]: Key value metadata as a map.

num_rowgroups(self) → int#

Returns the total number of rowgroups in the file.

Returns:

int: Number of row groups.

num_rowgroups_per_file(self) → list#: Returns the number of rowgroups in each file.

num_rows(self) → int#

Returns the number of rows of the root column.

Returns:

int: Number of rows

rowgroup_metadata(self) → list#

Returns the row group metadata in the file footer.

Returns:

list[dict[str, int]]: Vector of row group metadata as maps.

schema(self) → ParquetSchema#

Returns the parquet schema.

Returns:

ParquetSchema: Parquet schema

class pylibcudf.io.parquet_metadata.ParquetSchema#

Schema of a parquet file.

Parameters:

parquet_schema

Methods

`column_types`(self)	Returns a dictionary mapping column names to their cudf data types.
`root`(self)	Returns the schema of the struct column that contains all columns as fields.

column_types(self) → dict#

Returns a dictionary mapping column names to their cudf data types.

Returns:

dict[str, DataType]: Dictionary mapping column names to DataType objects

root(self) → ParquetColumnSchema#

Returns the schema of the struct column that contains all columns as fields.

Returns:

ParquetColumnSchema: Root column schema

class pylibcudf.io.parquet_metadata.RowGroup#

Parquet row group metadata.

Attributes

`columns`	Column chunk metadata for each column in this row group.
`file_offset`	Optional byte offset to first page in this row group.
`num_rows`	Number of rows in this row group.
`ordinal`	Optional row group ordinal within the file.
`sorting_columns`	Optional row sort order metadata.
`total_byte_size`	Total uncompressed byte size in this row group.
`total_compressed_size`	Optional total compressed bytes for this row group.

columns#: Column chunk metadata for each column in this row group.

file_offset#: Optional byte offset to first page in this row group.

num_rows#: Number of rows in this row group.

ordinal#: Optional row group ordinal within the file.

sorting_columns#: Optional row sort order metadata.

total_byte_size#: Total uncompressed byte size in this row group.

total_compressed_size#: Optional total compressed bytes for this row group.

class pylibcudf.io.parquet_metadata.SortingColumn#

Sort metadata for a row group column.

Attributes

`column_idx`	Column index (within the row group).
`descending`	Whether this column is sorted in descending order.
`nulls_first`	Whether null values are ordered before non-null values.

column_idx#: Column index (within the row group).

descending#: Whether this column is sorted in descending order.

nulls_first#: Whether null values are ordered before non-null values.

pylibcudf.io.parquet_metadata.read_parquet_footers(SourceInfo src_info) → list#

Read parquet file footers as FileMetaData objects.

Parameters:

src_infoSourceInfo: Dataset source.

Returns:

list[FileMetaData]: One footer metadata object per input source.

pylibcudf.io.parquet_metadata.read_parquet_metadata(SourceInfo src_info) → ParquetMetadata#

Reads metadata of parquet dataset.

Parameters:

src_infoSourceInfo: Dataset source.

Returns:

ParquetMetadata: Parquet_metadata with parquet schema, number of rows, number of row groups and key-value metadata.