Parquet Metadata#
- class pylibcudf.io.parquet_metadata.ColumnChunk#
Metadata for a row group’s column chunk.
Attributes
Size of the chunk's ColumnIndex, in bytes.
File offset of the chunk's ColumnIndex.
Deprecated byte offset to column metadata.
Relative file path for this column chunk.
Column metadata for this chunk.
Size of the chunk's OffsetIndex, in bytes.
File offset of the chunk's OffsetIndex.
Derived index in the flattened schema.
- column_index_length#
Size of the chunk’s ColumnIndex, in bytes.
- column_index_offset#
File offset of the chunk’s ColumnIndex.
- file_offset#
Deprecated byte offset to column metadata.
- file_path#
Relative file path for this column chunk.
- meta_data#
Column metadata for this chunk.
- offset_index_length#
Size of the chunk’s OffsetIndex, in bytes.
- offset_index_offset#
File offset of the chunk’s OffsetIndex.
- schema_idx#
Derived index in the flattened schema.
- class pylibcudf.io.parquet_metadata.ColumnChunkMetaData#
Metadata payload for a column chunk.
Attributes
Number of values in this chunk.
Column path components in the flattened schema.
Total compressed page bytes for this chunk.
Total uncompressed page bytes for this chunk.
- num_values#
Number of values in this chunk.
- path_in_schema#
Column path components in the flattened schema.
- total_compressed_size#
Total compressed page bytes for this chunk.
- total_uncompressed_size#
Total uncompressed page bytes for this chunk.
- class pylibcudf.io.parquet_metadata.FileMetaData#
Parquet file footer metadata.
For details, see
cudf::io::parquet::FileMetaDataAttributes
Get the application that created the file.
Get the total number of rows.
Get row counts for each row group in this file.
Get row group metadata in this file.
Get the file format version.
Methods
from_bytes(cls, const uint8_t[)Build
FileMetaDatafrom parquet footer bytes.See also
read_parquet_footersRead one
FileMetaDataper source directly frompylibcudf.io.types.SourceInfo.
- created_by#
Get the application that created the file.
- classmethod from_bytes(cls, const uint8_t[::1] footer_bytes)#
Build
FileMetaDatafrom parquet footer bytes.- Parameters:
- footer_bytesBuffer
A contiguous bytes-like object containing parquet footer bytes. The bytes are forwarded as-is to
cudf::io::parquet::experimental::hybrid_scan_readerwithout Python-side preprocessing. This method does not strip the parquet footer suffix (4-byte footer length +PAR1magic), so callers should generally pass only the footer region bytes.
- Returns:
- FileMetaData
Parsed parquet file footer metadata.
- num_rows#
Get the total number of rows.
- row_group_num_rows#
Get row counts for each row group in this file.
- Returns:
- row_counts
A list with the row count per row group in this file.
Notes
Equivalent to, but faster than, checking each row groups’ num_rows:
>>> [rg.num_rows for rg in file_metadata.row_groups]
- row_groups#
Get row group metadata in this file.
- version#
Get the file format version.
- class pylibcudf.io.parquet_metadata.ParquetColumnSchema#
Schema of a parquet column, including the nested columns.
- Parameters:
- parquet_column_schema
Methods
child(self, int idx)Returns schema of the child with the given index.
children(self)Returns schemas of all child columns.
cudf_type(self)Returns the cudf data type for this column.
name(self)Returns parquet column name; can be empty.
num_children(self)Returns the number of child columns.
- child(self, int idx) ParquetColumnSchema#
Returns schema of the child with the given index.
- Parameters:
- idxint
Child Index
- Returns:
- ParquetColumnSchema
Child schema
- children(self) list#
Returns schemas of all child columns.
- Returns:
- list[ParquetColumnSchema]
Child schemas.
- class pylibcudf.io.parquet_metadata.ParquetMetadata#
Information about content of a parquet file.
- Parameters:
- parquet_metadata
Methods
columnchunk_metadata(self)Returns a map of leaf column names to lists of total_uncompressed_size metadata from all column chunks in the file footer.
metadata(self)Returns the key-value metadata in the file footer.
num_rowgroups(self)Returns the total number of rowgroups in the file.
num_rowgroups_per_file(self)Returns the number of rowgroups in each file.
num_rows(self)Returns the number of rows of the root column.
rowgroup_metadata(self)Returns the row group metadata in the file footer.
schema(self)Returns the parquet schema.
- columnchunk_metadata(self) dict#
Returns a map of leaf column names to lists of total_uncompressed_size metadata from all column chunks in the file footer.
- Returns:
- dict[str, list[int]]
Map of leaf column names to lists of total_uncompressed_size metadata from all their column chunks.
- metadata(self) dict#
Returns the key-value metadata in the file footer.
- Returns:
- dict[str, str]
Key value metadata as a map.
- num_rowgroups(self) int#
Returns the total number of rowgroups in the file.
- Returns:
- int
Number of row groups.
- rowgroup_metadata(self) list#
Returns the row group metadata in the file footer.
- Returns:
- list[dict[str, int]]
Vector of row group metadata as maps.
- schema(self) ParquetSchema#
Returns the parquet schema.
- Returns:
- ParquetSchema
Parquet schema
- class pylibcudf.io.parquet_metadata.ParquetSchema#
Schema of a parquet file.
- Parameters:
- parquet_schema
Methods
column_types(self)Returns a dictionary mapping column names to their cudf data types.
root(self)Returns the schema of the struct column that contains all columns as fields.
- column_types(self) dict#
Returns a dictionary mapping column names to their cudf data types.
- Returns:
- dict[str, DataType]
Dictionary mapping column names to DataType objects
- root(self) ParquetColumnSchema#
Returns the schema of the struct column that contains all columns as fields.
- Returns:
- ParquetColumnSchema
Root column schema
- class pylibcudf.io.parquet_metadata.RowGroup#
Parquet row group metadata.
Attributes
Column chunk metadata for each column in this row group.
Optional byte offset to first page in this row group.
Number of rows in this row group.
Optional row group ordinal within the file.
Optional row sort order metadata.
Total uncompressed byte size in this row group.
Optional total compressed bytes for this row group.
- columns#
Column chunk metadata for each column in this row group.
- file_offset#
Optional byte offset to first page in this row group.
- num_rows#
Number of rows in this row group.
- ordinal#
Optional row group ordinal within the file.
- sorting_columns#
Optional row sort order metadata.
- total_byte_size#
Total uncompressed byte size in this row group.
- total_compressed_size#
Optional total compressed bytes for this row group.
- class pylibcudf.io.parquet_metadata.SortingColumn#
Sort metadata for a row group column.
Attributes
Column index (within the row group).
Whether this column is sorted in descending order.
Whether null values are ordered before non-null values.
- column_idx#
Column index (within the row group).
- descending#
Whether this column is sorted in descending order.
- nulls_first#
Whether null values are ordered before non-null values.
Read parquet file footers as
FileMetaDataobjects.- Parameters:
- src_infoSourceInfo
Dataset source.
- Returns:
- list[FileMetaData]
One footer metadata object per input source.
- pylibcudf.io.parquet_metadata.read_parquet_metadata(SourceInfo src_info) ParquetMetadata#
Reads metadata of parquet dataset.
- Parameters:
- src_infoSourceInfo
Dataset source.
- Returns:
- ParquetMetadata
Parquet_metadata with parquet schema, number of rows, number of row groups and key-value metadata.
See also
read_parquet_footersTo read the pre-materialized file footer metadata used in
pylibcudf.io.parquet.read_parquet().