cudf.read_parquet(filepath_or_buffer, engine='cudf', columns=None, storage_options=None, filters=None, row_groups=None, strings_to_categorical=False, use_pandas_metadata=True, use_python_file_object=True, categorical_partitions=True, open_file_options=None, bytes_per_thread=None, dataset_kwargs=None, *args, **kwargs)#

Load a Parquet dataset into a DataFrame

filepath_or_bufferstr, path object, bytes, file-like object, or a list

of such objects. Contains one or more of the following: either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), Python bytes of raw binary data, or any object with a read() method (such as builtin open() file handler function or BytesIO).

engine{ ‘cudf’, ‘pyarrow’ }, default ‘cudf’

Parser engine to use.

columnslist, default None

If not None, only these columns will be read.

storage_optionsdict, optional, default None

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to Please see fsspec and urllib for more details.

filterslist of tuple, list of lists of tuples, default None

If not None, specifies a filter predicate used to filter out row groups using statistics stored for each row group as Parquet metadata. Row groups that do not match the given filter predicate are not read. The filters will also be applied to the rows of the in-memory DataFrame after IO. The predicate is expressed in disjunctive normal form (DNF) like [[(‘x’, ‘=’, 0), …], …]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR). Predicates may also be passed as a list of tuples. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) notation of list of lists of tuples.

row_groupsint, or list, or a list of lists default None

If not None, specifies, for each input file, which row groups to read. If reading multiple inputs, a list of lists should be passed, one list for each input.

strings_to_categoricalboolean, default False

If True, return string columns as GDF_CATEGORY dtype; if False, return a as GDF_STRING dtype.

Deprecated since version 23.08: This parameter is deprecated and will be removed in a future version of cudf.

categorical_partitionsboolean, default True

Whether directory-partitioned columns should be interpreted as categorical or raw dtypes.

use_pandas_metadataboolean, default True

If True and dataset has custom PANDAS schema metadata, ensure that index columns are also loaded.

use_python_file_objectboolean, default True

If True, Arrow-backed PythonFile objects will be used in place of fsspec AbstractBufferedFile objects at IO time. Setting this argument to False will require the entire file to be copied to host memory, and is highly discouraged.

open_file_optionsdict, optional

Dictionary of key-value pairs to pass to the function used to open remote files. By default, this will be fsspec.parquet.open_parquet_file. To deactivate optimized precaching, set the “method” to None under the “precache_options” key. Note that the open_file_func key can also be used to specify a custom file-open function.

bytes_per_threadint, default None

Determines the number of bytes to be allocated per thread to read the files in parallel. When there is a file of large size, we get slightly better throughput by decomposing it and transferring multiple “blocks” in parallel (using a python thread pool). Default allocation is 268435456 bytes. This parameter is functional only when use_python_file_object=False.



  • cuDF supports local and remote data stores. See configuration details for available sources here.


>>> import cudf
>>> df = cudf.read_parquet(filename)
>>> df
  num1                datetime text
0  123 2018-11-13T12:00:00.000 5451
1  456 2018-11-14T12:35:01.000 5784
2  789 2018-11-15T18:02:59.000 6117