Class JCudfSerialization

java.lang.Object
ai.rapids.cudf.JCudfSerialization

public class JCudfSerialization extends Object
Serialize and deserialize CUDF tables and columns using a custom format. The goal of this is to provide a way to efficiently serialize and deserialize cudf data for distributed processing within a single application. Typically after a partition like operation has happened. It is not intended for inter-application communication or for long term storage of data, there are much better standards based formats for all of that.

The goal is to transfer data from a local GPU to a remote GPU as quickly and efficiently as possible using build in java communication channels. There is no guarantee of compatibility between different releases of CUDF. This is to allow us to adapt if internal memory layouts and formats change.

This version optimizes for reduced memory transfers, and as such will try to do the fewest number of transfers possible when putting the data back onto the GPU. This means that it will slice a single large memory buffer into smaller buffers used by the resulting ColumnVectors. The downside of this is that generally none of the memory can be released until all of the ColumnVectors are closed. It is assumed that this will not be a problem because for processing efficiency after the data is transferred it will likely be combined with other similar batches from other processes into a single larger buffer.

  • Constructor Details

    • JCudfSerialization

      public JCudfSerialization()
  • Method Details

    • getSerializedSizeInBytes

      public static long getSerializedSizeInBytes(HostColumnVector[] columns, long rowOffset, long numRows)
      Get the size in bytes needed to serialize the given data. The columns should be in host memory before calling this.
      Parameters:
      columns - columns to be serialized.
      rowOffset - the first row to serialize.
      numRows - the number of rows to serialize.
      Returns:
      the size in bytes needed to serialize the data including the header.
    • writeToStream

      public static void writeToStream(Table t, OutputStream out, long rowOffset, long numRows) throws IOException
      Write all or part of a table out in an internal format.
      Parameters:
      t - the table to be written.
      out - the stream to write the serialized table out to.
      rowOffset - the first row to write out.
      numRows - the number of rows to write out.
      Throws:
      IOException
    • writeToStream

      public static void writeToStream(ColumnVector[] columns, OutputStream out, long rowOffset, long numRows) throws IOException
      Write all or part of a set of columns out in an internal format.
      Parameters:
      columns - the columns to be written.
      out - the stream to write the serialized table out to.
      rowOffset - the first row to write out.
      numRows - the number of rows to write out.
      Throws:
      IOException
    • writeToStream

      public static void writeToStream(HostColumnVector[] columns, OutputStream out, long rowOffset, long numRows) throws IOException
      Write all or part of a set of columns out in an internal format.
      Parameters:
      columns - the columns to be written.
      out - the stream to write the serialized table out to.
      rowOffset - the first row to write out.
      numRows - the number of rows to write out.
      Throws:
      IOException
    • writeRowsToStream

      public static void writeRowsToStream(OutputStream out, long numRows) throws IOException
      Write a rowcount only header to the output stream in a case where a columnar batch with no columns but a non zero row count is received
      Parameters:
      out - the stream to write the serialized table out to.
      numRows - the number of rows to write out.
      Throws:
      IOException
    • writeConcatedStream

      public static void writeConcatedStream(JCudfSerialization.SerializedTableHeader[] headers, HostMemoryBuffer[] dataBuffers, OutputStream out) throws IOException
      Take the data from multiple batches stored in the parsed headers and the dataBuffer and write it out to out as if it were a single buffer.
      Parameters:
      headers - the headers parsed from multiple streams.
      dataBuffers - an array of buffers that hold the data, one per header.
      out - what to write the data out to.
      Throws:
      IOException - on any error.
    • readAndConcat

      public static Table readAndConcat(JCudfSerialization.SerializedTableHeader[] headers, HostMemoryBuffer[] dataBuffers) throws IOException
      Throws:
      IOException
    • concatToContiguousTable

      public static ContiguousTable concatToContiguousTable(JCudfSerialization.SerializedTableHeader[] headers, HostMemoryBuffer[] dataBuffers) throws IOException
      Concatenate multiple tables in host memory into a contiguous table in device memory.
      Parameters:
      headers - table headers corresponding to the host table buffers
      dataBuffers - host table buffer for each input table to be concatenated
      Returns:
      contiguous table in device memory
      Throws:
      IOException
    • concatToHostBuffer

      public static JCudfSerialization.HostConcatResult concatToHostBuffer(JCudfSerialization.SerializedTableHeader[] headers, HostMemoryBuffer[] dataBuffers, HostMemoryAllocator hostMemoryAllocator) throws IOException
      Concatenate multiple tables in host memory into a single host table buffer.
      Parameters:
      headers - table headers corresponding to the host table buffers
      dataBuffers - host table buffer for each input table to be concatenated
      hostMemoryAllocator - allocator for host memory buffers
      Returns:
      host table header and buffer
      Throws:
      IOException
    • concatToHostBuffer

      public static JCudfSerialization.HostConcatResult concatToHostBuffer(JCudfSerialization.SerializedTableHeader[] headers, HostMemoryBuffer[] dataBuffers) throws IOException
      Throws:
      IOException
    • unpackHostColumnVectors

      public static HostColumnVector[] unpackHostColumnVectors(JCudfSerialization.SerializedTableHeader header, HostMemoryBuffer hostBuffer)
      Deserialize a serialized contiguous table into an array of host columns.
      Parameters:
      header - serialized table header
      hostBuffer - buffer containing the data for all columns in the serialized table
      Returns:
      array of host columns representing the data from the serialized table
    • readTableIntoBuffer

      public static void readTableIntoBuffer(InputStream in, JCudfSerialization.SerializedTableHeader header, HostMemoryBuffer buffer) throws IOException
      After reading a header for a table read the data portion into a host side buffer.
      Parameters:
      in - the stream to read the data from.
      header - the header that finished just moments ago.
      buffer - the buffer to write the data into. If there is not enough room to store the data in buffer it will not be read and header will still have dataRead set to false.
      Throws:
      IOException
    • readTableFrom

    • readTableFrom

      public static JCudfSerialization.TableAndRowCountPair readTableFrom(InputStream in, HostMemoryAllocator hostMemoryAllocator) throws IOException
      Read a serialize table from the given InputStream.
      Parameters:
      in - the stream to read the table data from.
      hostMemoryAllocator - a host memory allocator for an intermediate host memory buffer
      Returns:
      the deserialized table in device memory, or null if the stream has no table to read from, an end of the stream at the very beginning.
      Throws:
      IOException - on any error.
      EOFException - if the data stream ended unexpectedly in the middle of processing.
    • readTableFrom

      public static JCudfSerialization.TableAndRowCountPair readTableFrom(InputStream in) throws IOException
      Throws:
      IOException