text#

class pylibcudf.io.text.DataChunkSource#

Data source for multibyte_split

Parameters:

class pylibcudf.io.text.ParseOptions(byte_range=None, *, strip_delimiters=False)#

Parsing options for multibyte_split

Parameters:

byte_rangelist | tuple, default None: Only rows starting inside this byte range will be part of the output column.
strip_delimitersbool, default True: Whether delimiters at the end of rows should be stripped from the output column.

pylibcudf.io.text.make_source(unicode data) → DataChunkSource#

Creates a data source capable of producing device-buffered views of the given string.

Parameters:

Returns:

pylibcudf.io.text.make_source_from_bgzip_file(unicode filename, int virtual_begin=-1, int virtual_end=-1) → DataChunkSource#

Creates a data source capable of producing device-buffered views of a BGZIP compressed file with virtual record offsets.

Parameters:

filenamestr: The filename of the BGZIP-compressed file to be exposed as a data chunk source.
virtual_beginint: The virtual (Tabix) offset of the first byte to be read. Its upper 48 bits describe the offset into the compressed file, its lower 16 bits describe the block-local offset.
virtual_endint, default None: The virtual (Tabix) offset one past the last byte to be read

Returns:

pylibcudf.io.text.make_source_from_file(unicode filename) → DataChunkSource#

Creates a data source capable of producing device-buffered views of the file.

Parameters:

Returns:

pylibcudf.io.text.multibyte_split(DataChunkSource source, unicode delimiter, ParseOptions options=None) → Column#

Splits the source text into a strings column using a multiple byte delimiter.

Parameters:

Returns:

Column: The strings found by splitting the source by the delimiter within the relevant byte range.