cudf.read_orc#
- cudf.read_orc(filepath_or_buffer, engine='cudf', columns=None, filters=None, stripes=None, skiprows=None, num_rows=None, use_index=True, timestamp_type=None, use_python_file_object=None, storage_options=None, bytes_per_thread=None)[source]#
Load an ORC dataset into a DataFrame
- Parameters:
- filepath_or_bufferstr, path object, bytes, or file-like object
Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO).
- engine{ ‘cudf’, ‘pyarrow’ }, default ‘cudf’
Parser engine to use.
- columnslist, default None
If not None, only these columns will be read from the file.
- filterslist of tuple, list of lists of tuples default None
If not None, specifies a filter predicate used to filter out row groups using statistics stored for each row group as Parquet metadata. Row groups that do not match the given filter predicate are not read. The predicate is expressed in disjunctive normal form (DNF) like [[(‘x’, ‘=’, 0), …], …]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the outermost list combines these filters as a disjunction (OR). Predicates may also be passed as a list of tuples. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) notation of list of lists of tuples.
- stripes: list, default None
If not None, only these stripe will be read from the file. Stripes are concatenated with index ignored.
- skiprowsint, default None
If not None, the number of rows to skip from the start of the file. This parameter is deprecated.
- num_rowsint, default None
If not None, the total number of rows to read. This parameter is deprecated.
- use_indexbool, default True
If True, use row index if available for faster seeking.
- use_python_file_objectboolean, default True
If True, Arrow-backed PythonFile objects will be used in place of fsspec AbstractBufferedFile objects at IO time.
Deprecated since version 24.08: use_python_file_object is deprecated and will be removed in a future version of cudf, as PyArrow NativeFiles will no longer be accepted as input/output in cudf readers/writers in the future.
- storage_optionsdict, optional, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Request
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open
. Please seefsspec
andurllib
for more details.- bytes_per_threadint, default None
Determines the number of bytes to be allocated per thread to read the files in parallel. When there is a file of large size, we get slightly better throughput by decomposing it and transferring multiple “blocks” in parallel (using a python thread pool). Default allocation is 268435456 bytes. This parameter is functional only when use_python_file_object=False.
- Returns:
- DataFrame
See also
Notes
cuDF supports local and remote data stores. See configuration details for available sources here.
Examples
>>> import cudf >>> df = cudf.read_orc(filename) >>> df num1 datetime text 0 123 2018-11-13T12:00:00.000 5451 1 456 2018-11-14T12:35:01.000 5784 2 789 2018-11-15T18:02:59.000 6117