text#

class pylibcudf.io.text.ByteRangeInfo(size_t offset, size_t size)#

Information about a byte range in a file.

For details, see cudf::io::text::byte_range_info

Parameters:

offsetint: Offset in bytes from the start of the file
sizeint: Size of the range in bytes

Attributes

`offset`	Get the offset in bytes.
`size`	Get the size in bytes.

offset#: Get the offset in bytes.

size#: Get the size in bytes.

class pylibcudf.io.text.DataChunkSource#

Data source for multibyte_split

Parameters:

datastr: Filename or data itself.

class pylibcudf.io.text.ParseOptions(byte_range=None, *, strip_delimiters=False)#

Parsing options for multibyte_split

Parameters:

byte_rangelist | tuple, default None: Only rows starting inside this byte range will be part of the output column.
strip_delimitersbool, default True: Whether delimiters at the end of rows should be stripped from the output column.

pylibcudf.io.text.make_source(str data) → DataChunkSource#

Creates a data source capable of producing device-buffered views of the given string.

Parameters:

datastr: The host data to be exposed as a data chunk source.

Returns:

DataChunkSource: The data chunk source for the provided host data.

pylibcudf.io.text.make_source_from_bgzip_file(str filename, int virtual_begin=-1, int virtual_end=-1) → DataChunkSource#

Creates a data source capable of producing device-buffered views of a BGZIP compressed file with virtual record offsets.

Parameters:

filenamestr: The filename of the BGZIP-compressed file to be exposed as a data chunk source.
virtual_beginint: The virtual (Tabix) offset of the first byte to be read. Its upper 48 bits describe the offset into the compressed file, its lower 16 bits describe the block-local offset.
virtual_endint, default None: The virtual (Tabix) offset one past the last byte to be read

Returns:

DataChunkSource: The data chunk source for the provided filename.

pylibcudf.io.text.make_source_from_file(str filename) → DataChunkSource#

Creates a data source capable of producing device-buffered views of the file.

Parameters:

filenamestr: The filename of the file to be exposed as a data chunk source.

Returns:

DataChunkSource: The data chunk source for the provided filename.

pylibcudf.io.text.multibyte_split(DataChunkSource source, str delimiter, ParseOptions options=None, Stream stream=None, DeviceMemoryResource mr=None) → Column#

Splits the source text into a strings column using a multiple byte delimiter.

For details, see multibyte_split()

Parameters:

source: The source string.
delimiterstr: UTF-8 encoded string for which to find offsets in the source.
optionsParseOptions: The parsing options to use (including byte range).
streamStream, optional: CUDA stream for device memory operations and kernel launches

Returns:

Column: The strings found by splitting the source by the delimiter within the relevant byte range.

text#

This Page