text#

class pylibcudf.io.text.DataChunkSource#

Data source for multibyte_split

Parameters:
datastr

Filename or data itself.

class pylibcudf.io.text.ParseOptions(byte_range=None, *, strip_delimiters=False)#

Parsing options for multibyte_split

Parameters:
byte_rangelist | tuple, default None

Only rows starting inside this byte range will be part of the output column.

strip_delimitersbool, default True

Whether delimiters at the end of rows should be stripped from the output column.

pylibcudf.io.text.make_source(unicode data) DataChunkSource#

Creates a data source capable of producing device-buffered views of the given string.

Parameters:
datastr

The host data to be exposed as a data chunk source.

Returns:
DataChunkSource

The data chunk source for the provided host data.

pylibcudf.io.text.make_source_from_bgzip_file(unicode filename, int virtual_begin=-1, int virtual_end=-1) DataChunkSource#

Creates a data source capable of producing device-buffered views of a BGZIP compressed file with virtual record offsets.

Parameters:
filenamestr

The filename of the BGZIP-compressed file to be exposed as a data chunk source.

virtual_beginint

The virtual (Tabix) offset of the first byte to be read. Its upper 48 bits describe the offset into the compressed file, its lower 16 bits describe the block-local offset.

virtual_endint, default None

The virtual (Tabix) offset one past the last byte to be read

Returns:
DataChunkSource

The data chunk source for the provided filename.

pylibcudf.io.text.make_source_from_file(unicode filename) DataChunkSource#

Creates a data source capable of producing device-buffered views of the file.

Parameters:
filenamestr

The filename of the file to be exposed as a data chunk source.

Returns:
DataChunkSource

The data chunk source for the provided filename.

pylibcudf.io.text.multibyte_split(DataChunkSource source, unicode delimiter, ParseOptions options=None) Column#

Splits the source text into a strings column using a multiple byte delimiter.

For details, see cudf::io::text::multibyte_split()

Parameters:
source

The source string.

delimiterstr

UTF-8 encoded string for which to find offsets in the source.

optionsParseOptions

The parsing options to use (including byte range).

Returns:
Column

The strings found by splitting the source by the delimiter within the relevant byte range.