Namespaces | Classes | Typedefs | Enumerations | Functions | Variables
cudf::io Namespace Reference

IO interfaces. More...

Namespaces

 orc
 Orc I/O interfaces.
 
 parquet
 Parquet I/O interfaces.
 

Classes

class  arrow_io_source
 Implementation class for reading from an Apache Arrow file. The file could be a memory-mapped file or other implementation supported by Arrow. More...
 
class  avro_reader_options
 Settings to use for read_avro(). More...
 
class  avro_reader_options_builder
 Builder to build options for read_avro(). More...
 
class  csv_reader_options
 Settings to use for read_csv(). More...
 
class  csv_reader_options_builder
 Builder to build options for read_csv(). More...
 
class  csv_writer_options
 Settings to use for write_csv(). More...
 
class  csv_writer_options_builder
 Builder to build options for writer_csv() More...
 
class  data_sink
 Interface class for storing the output data from the writers. More...
 
class  datasource
 Interface class for providing input data to the readers. More...
 
struct  schema_element
 Allows specifying the target types for nested JSON data via json_reader_options' set_dtypes method. More...
 
class  json_reader_options
 Input arguments to the read_json interface. More...
 
class  json_reader_options_builder
 Builds settings to use for read_json(). More...
 
class  json_writer_options
 Settings to use for write_json(). More...
 
class  json_writer_options_builder
 Builder to build options for writer_json() More...
 
struct  host_mr_options
 Options to configure the default host memory resource. More...
 
class  orc_reader_options
 Settings to use for read_orc(). More...
 
class  orc_reader_options_builder
 Builds settings to use for read_orc(). More...
 
class  chunked_orc_reader
 The chunked orc reader class to read an ORC file iteratively into a series of tables, chunk by chunk. More...
 
class  orc_writer_options
 Settings to use for write_orc(). More...
 
class  orc_writer_options_builder
 Builds settings to use for write_orc(). More...
 
class  chunked_orc_writer_options
 Settings to use for write_orc_chunked(). More...
 
class  chunked_orc_writer_options_builder
 Builds settings to use for write_orc_chunked(). More...
 
class  orc_chunked_writer
 Chunked orc writer class writes an ORC file in a chunked/stream form. More...
 
struct  raw_orc_statistics
 Holds column names and buffers containing raw file-level and stripe-level statistics. More...
 
struct  minmax_statistics
 Base class for column statistics that include optional minimum and maximum. More...
 
struct  sum_statistics
 Base class for column statistics that include an optional sum. More...
 
struct  integer_statistics
 Statistics for integral columns. More...
 
struct  double_statistics
 Statistics for floating point columns. More...
 
struct  string_statistics
 Statistics for string columns. More...
 
struct  bucket_statistics
 Statistics for boolean columns. More...
 
struct  decimal_statistics
 Statistics for decimal columns. More...
 
struct  timestamp_statistics
 Statistics for timestamp columns. More...
 
struct  column_statistics
 Contains per-column ORC statistics. More...
 
struct  parsed_orc_statistics
 Holds column names and parsed file-level and stripe-level statistics. More...
 
struct  orc_column_schema
 Schema of an ORC column, including the nested columns. More...
 
struct  orc_schema
 Schema of an ORC file. More...
 
class  orc_metadata
 Information about content of an ORC file. More...
 
class  parquet_reader_options
 Settings for read_parquet(). More...
 
class  parquet_reader_options_builder
 Builds parquet_reader_options to use for read_parquet(). More...
 
class  chunked_parquet_reader
 The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk. More...
 
struct  sorting_column
 Struct used to describe column sorting metadata. More...
 
class  parquet_writer_options
 Settings for write_parquet(). More...
 
class  parquet_writer_options_builder
 Class to build parquet_writer_options. More...
 
class  chunked_parquet_writer_options
 Settings for write_parquet_chunked(). More...
 
class  chunked_parquet_writer_options_builder
 Builds options for chunked_parquet_writer_options. More...
 
class  parquet_chunked_writer
 chunked parquet writer class to handle options and write tables in chunks. More...
 
struct  parquet_column_schema
 Schema of a parquet column, including the nested columns. More...
 
struct  parquet_schema
 Schema of a parquet file. More...
 
class  parquet_metadata
 Information about content of a parquet file. More...
 
class  writer_compression_statistics
 Statistics about compression performed by a writer. More...
 
struct  column_name_info
 Detailed name (and optionally nullability) information for output columns. More...
 
struct  table_metadata
 Table metadata returned by IO readers. More...
 
struct  table_with_metadata
 Table with table metadata used by io readers to return the metadata by value. More...
 
struct  host_buffer
 Non-owning view of a host memory buffer. More...
 
struct  source_info
 Source information for read interfaces. More...
 
struct  sink_info
 Destination information for write interfaces. More...
 
class  column_in_metadata
 Metadata for a column. More...
 
class  table_input_metadata
 Metadata for a table. More...
 
struct  partition_info
 Information used while writing partitioned datasets. More...
 
class  reader_column_schema
 schema element for reader More...
 

Typedefs

using no_statistics = std::monostate
 Monostate type alias for the statistics variant.
 
using date_statistics = minmax_statistics< int32_t >
 Statistics for date(time) columns.
 
using binary_statistics = sum_statistics< int64_t >
 Statistics for binary columns. More...
 
using statistics_type = std::variant< no_statistics, integer_statistics, double_statistics, string_statistics, bucket_statistics, decimal_statistics, date_statistics, binary_statistics, timestamp_statistics >
 Variant type for ORC type-specific column statistics. More...
 

Enumerations

enum class  json_recovery_mode_t { FAIL , RECOVER_WITH_NULL }
 Control the error recovery behavior of the json parser. More...
 
enum class  compression_type {
  NONE , AUTO , SNAPPY , GZIP ,
  BZIP2 , BROTLI , ZIP , XZ ,
  ZLIB , LZ4 , LZO , ZSTD
}
 Compression algorithms. More...
 
enum class  io_type {
  FILEPATH , HOST_BUFFER , DEVICE_BUFFER , VOID ,
  USER_IMPLEMENTED
}
 Data source or destination types. More...
 
enum class  quote_style { MINIMAL , ALL , NONNUMERIC , NONE }
 Behavior when handling quotations in field data. More...
 
enum  statistics_freq { STATISTICS_NONE = 0 , STATISTICS_ROWGROUP = 1 , STATISTICS_PAGE = 2 , STATISTICS_COLUMN = 3 }
 Column statistics granularity type for parquet/orc writers. More...
 
enum class  column_encoding {
  USE_DEFAULT = -1 , DICTIONARY , PLAIN , DELTA_BINARY_PACKED ,
  DELTA_LENGTH_BYTE_ARRAY , DELTA_BYTE_ARRAY , BYTE_STREAM_SPLIT , DIRECT ,
  DIRECT_V2 , DICTIONARY_V2
}
 Valid encodings for use with column_in_metadata::set_encoding() More...
 
enum  dictionary_policy { NEVER = 0 , ADAPTIVE = 1 , ALWAYS = 2 }
 Control use of dictionary encoding for parquet writer. More...
 

Functions

table_with_metadata read_avro (avro_reader_options const &options, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Reads an Avro dataset into a set of columns. More...
 
table_with_metadata read_csv (csv_reader_options options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Reads a CSV dataset into a set of columns. More...
 
void write_csv (csv_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Writes a set of columns to CSV format. More...
 
table_with_metadata read_json (json_reader_options options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Reads a JSON dataset into a set of columns. More...
 
void write_json (json_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Writes a set of columns to JSON format. More...
 
rmm::host_async_resource_ref set_host_memory_resource (rmm::host_async_resource_ref mr)
 Set the rmm resource to be used for host memory allocations by cudf::detail::hostdevice_vector. More...
 
rmm::host_async_resource_ref get_host_memory_resource ()
 Get the rmm resource being used for host memory allocations by cudf::detail::hostdevice_vector. More...
 
bool config_default_host_memory_resource (host_mr_options const &opts)
 Configure the size of the default host memory resource. More...
 
table_with_metadata read_orc (orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Reads an ORC dataset into a set of columns. More...
 
void write_orc (orc_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Writes a set of columns to ORC format. More...
 
raw_orc_statistics read_raw_orc_statistics (source_info const &src_info, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Reads file-level and stripe-level statistics of ORC dataset. More...
 
parsed_orc_statistics read_parsed_orc_statistics (source_info const &src_info, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Reads file-level and stripe-level statistics of ORC dataset. More...
 
orc_metadata read_orc_metadata (source_info const &src_info, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Reads metadata of ORC dataset. More...
 
table_with_metadata read_parquet (parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Reads a Parquet dataset into a set of columns. More...
 
std::unique_ptr< std::vector< uint8_t > > write_parquet (parquet_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Writes a set of columns to parquet format. More...
 
std::unique_ptr< std::vector< uint8_t > > merge_row_group_metadata (std::vector< std::unique_ptr< std::vector< uint8_t >>> const &metadata_list)
 Merges multiple raw metadata blobs that were previously created by write_parquet into a single metadata blob. More...
 
parquet_metadata read_parquet_metadata (source_info const &src_info)
 Reads metadata of parquet dataset. More...
 
template<typename T >
constexpr auto is_byte_like_type ()
 Returns true if the type is byte-like, meaning it is reasonable to pass as a pointer to bytes. More...
 

Variables

constexpr size_t default_stripe_size_bytes = 64 * 1024 * 1024
 64MB default orc stripe size
 
constexpr size_type default_stripe_size_rows = 1000000
 1M rows default orc stripe rows
 
constexpr size_type default_row_index_stride = 10000
 10K rows default orc row index stride
 
constexpr size_t default_row_group_size_bytes = 128 * 1024 * 1024
 128MB per row group
 
constexpr size_type default_row_group_size_rows = 1000000
 1 million rows per row group
 
constexpr size_t default_max_page_size_bytes = 512 * 1024
 512KB per page
 
constexpr size_type default_max_page_size_rows = 20000
 20k rows per page
 
constexpr int32_t default_column_index_truncate_length = 64
 truncate to 64 bytes
 
constexpr size_t default_max_dictionary_size = 1024 * 1024
 1MB dictionary size
 
constexpr size_type default_max_page_fragment_size = 5000
 5000 rows per page fragment
 

Detailed Description

IO interfaces.

Function Documentation

◆ config_default_host_memory_resource()

bool cudf::io::config_default_host_memory_resource ( host_mr_options const &  opts)

Configure the size of the default host memory resource.

Exceptions
cudf::logic_errorif called after the default host memory resource has been created
Parameters
optsOptions to configure the default host memory resource
Returns
True if this call successfully configured the host memory resource, false if a a resource was already configured.

◆ get_host_memory_resource()

rmm::host_async_resource_ref cudf::io::get_host_memory_resource ( )

Get the rmm resource being used for host memory allocations by cudf::detail::hostdevice_vector.

Returns
The rmm resource used for host-side allocations

◆ set_host_memory_resource()

rmm::host_async_resource_ref cudf::io::set_host_memory_resource ( rmm::host_async_resource_ref  mr)

Set the rmm resource to be used for host memory allocations by cudf::detail::hostdevice_vector.

hostdevice_vector is a utility class that uses a pair of host and device-side buffers for bouncing state between the cpu and the gpu. The resource set with this function (typically a pinned memory allocator) is what it uses to allocate space for it's host-side buffer.

Parameters
mrThe rmm resource to be used for host-side allocations
Returns
The previous resource that was in use