text#
- class pylibcudf.io.text.DataChunkSource#
Data source for multibyte_split
- Parameters:
- datastr
Filename or data itself.
- class pylibcudf.io.text.ParseOptions(byte_range=None, *, strip_delimiters=False)#
Parsing options for multibyte_split
- Parameters:
- byte_rangelist | tuple, default None
Only rows starting inside this byte range will be part of the output column.
- strip_delimitersbool, default True
Whether delimiters at the end of rows should be stripped from the output column.
- pylibcudf.io.text.make_source(unicode data) DataChunkSource #
Creates a data source capable of producing device-buffered views of the given string.
- Parameters:
- datastr
The host data to be exposed as a data chunk source.
- Returns:
- DataChunkSource
The data chunk source for the provided host data.
- pylibcudf.io.text.make_source_from_bgzip_file(unicode filename, int virtual_begin=-1, int virtual_end=-1) DataChunkSource #
Creates a data source capable of producing device-buffered views of a BGZIP compressed file with virtual record offsets.
- Parameters:
- filenamestr
The filename of the BGZIP-compressed file to be exposed as a data chunk source.
- virtual_beginint
The virtual (Tabix) offset of the first byte to be read. Its upper 48 bits describe the offset into the compressed file, its lower 16 bits describe the block-local offset.
- virtual_endint, default None
The virtual (Tabix) offset one past the last byte to be read
- Returns:
- DataChunkSource
The data chunk source for the provided filename.
- pylibcudf.io.text.make_source_from_file(unicode filename) DataChunkSource #
Creates a data source capable of producing device-buffered views of the file.
- Parameters:
- filenamestr
The filename of the file to be exposed as a data chunk source.
- Returns:
- DataChunkSource
The data chunk source for the provided filename.
- pylibcudf.io.text.multibyte_split(DataChunkSource source, unicode delimiter, ParseOptions options=None) Column #
Splits the source text into a strings column using a multiple byte delimiter.
For details, see
cudf::io::text::multibyte_split()
- Parameters:
- source
The source string.
- delimiterstr
UTF-8 encoded string for which to find offsets in the source.
- optionsParseOptions
The parsing options to use (including byte range).
- Returns:
- Column
The strings found by splitting the source by the delimiter within the relevant byte range.