Class DeletionVector

java.lang.Object
ai.rapids.cudf.DeletionVector

public class DeletionVector extends Object
Provides JNI wrappers for reading Parquet files with deletion vector support. Deletion vectors are used in Delta Lake and other table formats to track deleted rows without physically rewriting data files. This class provides APIs to read Parquet files while applying deletion vectors using 64-bit roaring bitmap serialization format. The APIs in this file are experimental and subject to change.
  • Constructor Details

    • DeletionVector

      public DeletionVector()
  • Method Details

    • readParquet

      public static Table readParquet(ParquetOptions opts, HostMemoryBuffer[] dataBuffers, DeletionVector.DeletionVectorInfo[] deletionVectorInfos)
      Reads a Parquet file with deletion vector support. Reads a Parquet file, prepends an index column to the table, and applies the deletion vector filter. If row group metadata is not provided, the index column will be a simple sequence from 0 to the number of rows. If the deletion vector is null or empty, the table with the prepended index column is returned as-is without filtering.
      Parameters:
      opts - ParquetOptions
      dataBuffers - Array of HostMemoryBuffers containing the Parquet file data.
      deletionVectorInfos - Array of DeletionVectorInfo objects representing deletion vectors for each Parquet file to read.
      Returns:
      A Table containing the filtered data with a prepended UINT64 index column.
    • readParquet

      public static Table readParquet(ParquetOptions opts, HostMemoryBuffer[] dataBuffers, int[][] rowGroups, DeletionVector.DeletionVectorInfo[] deletionVectorInfos)
      Reads a Parquet file with deletion vector support. Reads a Parquet file, prepends an index column to the table, and applies the deletion vector filter. If row group metadata is not provided, the index column will be a simple sequence from 0 to the number of rows. If the deletion vector is null or empty, the table with the prepended index column is returned as-is without filtering.
      Parameters:
      opts - ParquetOptions
      dataBuffers - Array of HostMemoryBuffers containing the Parquet file data.
      rowGroups - Row group indices to read
      deletionVectorInfos - Array of DeletionVectorInfo objects representing deletion vectors for each Parquet file to read.
      Returns:
      A Table containing the filtered data with a prepended UINT64 index column.
    • readParquet

      public static Table readParquet(ParquetOptions opts, String[] inputFilePaths, int[][] rowGroups, DeletionVector.DeletionVectorInfo[] deletionVectorInfos)
      Reads a Parquet file with deletion vector support. Reads a Parquet file, prepends an index column to the table, and applies the deletion vector filter. If row group metadata is not provided, the index column will be a simple sequence from 0 to the number of rows. If the deletion vector is null or empty, the table with the prepended index column is returned as-is without filtering.
      Parameters:
      opts - ParquetOptions
      inputFilePaths - Array of input Parquet file paths.
      rowGroups - Row group indices to read
      deletionVectorInfos - Array of DeletionVectorInfo objects representing deletion vectors for each Parquet file to read.
      Returns:
      A Table containing the filtered data with a prepended UINT64 index column.
    • newParquetChunkedReader

      public static DeletionVector.ParquetChunkedReader newParquetChunkedReader(long chunkSizeByteLimit, long passReadLimit, ParquetOptions opts, HostMemoryBuffer[] dataBuffers, DeletionVector.DeletionVectorInfo[] deletionVectorInfos)
      Construct the reader instance from a read limit and data in host memory buffers.
      Parameters:
      chunkSizeByteLimit - Limit on total number of bytes to be returned per read, or 0 if there is no limit.
      passReadLimit - Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit
      opts - The options for Parquet reading.
      dataBuffers - Array of HostMemoryBuffers containing the Parquet file data.
      rowGroups - Row group indices to read
      deletionVectorInfos - Array of DeletionVectorInfo objects representing deletion vectors for each Parquet file to read.
    • newParquetChunkedReader

      public static DeletionVector.ParquetChunkedReader newParquetChunkedReader(long chunkSizeByteLimit, long passReadLimit, ParquetOptions opts, HostMemoryBuffer[] dataBuffers, int[][] rowGroups, DeletionVector.DeletionVectorInfo[] deletionVectorInfos)
      Construct the reader instance from a read limit and data in host memory buffers.
      Parameters:
      chunkSizeByteLimit - Limit on total number of bytes to be returned per read, or 0 if there is no limit.
      passReadLimit - Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit
      opts - The options for Parquet reading.
      dataBuffers - Array of HostMemoryBuffers containing the Parquet file data.
      rowGroups - Row group indices to read
      deletionVectorInfos - Array of DeletionVectorInfo objects representing deletion vectors for each Parquet file to read.
    • newParquetChunkedReader

      public static DeletionVector.ParquetChunkedReader newParquetChunkedReader(long chunkSizeByteLimit, long passReadLimit, ParquetOptions opts, String[] inputFilePaths, int[][] rowGroups, DeletionVector.DeletionVectorInfo[] deletionVectorInfos)
      Construct the reader instance from a read limit and data in host memory buffers.
      Parameters:
      chunkSizeByteLimit - Limit on total number of bytes to be returned per read, or 0 if there is no limit.
      passReadLimit - Limit on the amount of memory used for reading and decompressing data or 0 if there is no limit
      opts - The options for Parquet reading.
      inputFilePaths - Array of input file paths containing the Parquet file data.
      rowGroups - Row group indices to read
      deletionVectorInfos - Array of DeletionVectorInfo objects representing deletion vectors for each Parquet file to read.