text#

class pylibcudf.io.text.DataChunkSource#

Data source for multibyte_split

Parameters:
datastr

Filename or data itself.

class pylibcudf.io.text.ParseOptions(byte_range=None, *, strip_delimiters=False)#

Parsing options for multibyte_split

Parameters:
byte_rangelist | tuple, default None

Only rows starting inside this byte range will be part of the output column.

strip_delimitersbool, default True

Whether delimiters at the end of rows should be stripped from the output column.

pylibcudf.io.text.make_source(str data) DataChunkSource#

Creates a data source capable of producing device-buffered views of the given string.

Parameters:
datastr

The host data to be exposed as a data chunk source.

Returns:
DataChunkSource

The data chunk source for the provided host data.

pylibcudf.io.text.make_source_from_bgzip_file(str filename, int virtual_begin=-1, int virtual_end=-1) DataChunkSource#

Creates a data source capable of producing device-buffered views of a BGZIP compressed file with virtual record offsets.

Parameters:
filenamestr

The filename of the BGZIP-compressed file to be exposed as a data chunk source.

virtual_beginint

The virtual (Tabix) offset of the first byte to be read. Its upper 48 bits describe the offset into the compressed file, its lower 16 bits describe the block-local offset.

virtual_endint, default None

The virtual (Tabix) offset one past the last byte to be read

Returns:
DataChunkSource

The data chunk source for the provided filename.

pylibcudf.io.text.make_source_from_file(str filename) DataChunkSource#

Creates a data source capable of producing device-buffered views of the file.

Parameters:
filenamestr

The filename of the file to be exposed as a data chunk source.

Returns:
DataChunkSource

The data chunk source for the provided filename.

pylibcudf.io.text.multibyte_split(DataChunkSource source, str delimiter, ParseOptions options=None, Stream stream=None) Column#

Splits the source text into a strings column using a multiple byte delimiter.

For details, see cudf::io::text::multibyte_split()

Parameters:
source

The source string.

delimiterstr

UTF-8 encoded string for which to find offsets in the source.

optionsParseOptions

The parsing options to use (including byte range).

streamStream, optional

CUDA stream for device memory operations and kernel launches

Returns:
Column

The strings found by splitting the source by the delimiter within the relevant byte range.