cudf.read_csv#
- cudf.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, prefix=None, mangle_dupe_cols=True, dtype=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=0, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, skip_blank_lines=True, parse_dates=None, dayfirst=False, compression='infer', thousands=None, decimal='.', lineterminator='\n', quotechar='"', quoting=0, doublequote=True, comment=None, delim_whitespace=False, byte_range=None, use_python_file_object=True, storage_options=None, bytes_per_thread=None)#
Load a comma-separated-values (CSV) dataset into a DataFrame
- Parameters:
- filepath_or_bufferstr, path object, or file-like object
Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), or any object with a read() method (such as builtin open() file handler function or StringIO).
- sepchar, default ‘,’
Delimiter to be used.
- delimiterchar, default None
Alternative argument name for sep.
- headerint, default ‘infer’
Row number to use as the column names. Default behavior is to infer the column names: if no names are passed, header=0; if column names are passed explicitly, header=None.
- nameslist of str, default None
List of column names to be used. Needs to include names of all columns in the file, or names of all columns selected using usecols (only when usecols holds integer indices). When usecols is not used to select column indices, names can contain more names than there are columns i.n the file. In this case the extra columns will only contain null rows.
- index_colint, string or False, default None
Column to use as the row labels of the DataFrame. Passing index_col=False explicitly disables index column inference and discards the last column.
- usecolslist of int or str, default None
Returns subset of the columns given in the list. All elements must be either integer indices (column number) or strings that correspond to column names. When an integer index is passed for each name in the names parameter, the names are interpreted as names in the output table, not as names in the input file.
- prefixstr, default None
Prefix to add to column numbers when parsing without a header row.
- mangle_dupe_colsboolean, default True
Duplicate columns will be specified as ‘X’,’X.1’,…’X.N’.
- dtypetype, str, list of types, or dict of column -> type, default None
Data type(s) for data or columns. If dtype is a type/str, all columns are mapped to the particular type passed. If list, types are applied in the same order as the column names. If dict, types are mapped to the column names. E.g. {‘a’: np.float64, ‘b’: int32, ‘c’: ‘float’} If None, dtypes are inferred from the dataset. Use str to preserve data and not infer or interpret to dtype.
- true_valueslist, default None
Values to consider as boolean True
- false_valueslist, default None
Values to consider as boolean False
- skipinitialspacebool, default False
Skip spaces after delimiter.
- skiprowsint, default 0
Number of rows to be skipped from the start of file.
- skipfooterint, default 0
Number of rows to be skipped at the bottom of file.
- nrowsint, default None
If specified, maximum number of rows to read
- na_valuesscalar, str, or list-like, optional
Additional strings to recognize as nulls. By default the following values are interpreted as nulls: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
- keep_default_nabool, default True
Whether or not to include the default NA values when parsing the data.
- na_filterbool, default True
Detect missing values (empty strings and the values in na_values). Passing False can improve performance.
- skip_blank_linesbool, default True
If True, discard and do not parse empty lines If False, interpret empty lines as NaN values
- parse_dateslist of int or names, default None
If list of columns, then attempt to parse each entry as a date. Columns may not always be recognized as dates, for instance due to unusual or non-standard formats. To guarantee a date and increase parsing speed, explicitly specify dtype=’date’ for the desired columns.
- dayfirstbool, default False
DD/MM format dates, international and European format.
- compression{‘infer’, ‘gzip’, ‘zip’, None}, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’, then detect compression from the following extensions: ‘.gz’,’.zip’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in, otherwise the first non-zero-sized file will be used. Set to None for no decompression.
- thousandschar, default None
Character used as a thousands delimiter.
- decimalchar, default ‘.’
Character used as a decimal point.
- lineterminatorchar, default ‘n’
Character to indicate end of line.
- quotecharchar, default ‘”’
Character to indicate start and end of quote item.
- quotingstr or int, default 0
Controls quoting behavior. Set to one of 0 (csv.QUOTE_MINIMAL), 1 (csv.QUOTE_ALL), 2 (csv.QUOTE_NONNUMERIC) or 3 (csv.QUOTE_NONE). Quoting is enabled with all values except 3.
- doublequotebool, default True
When quoting is enabled, indicates whether to interpret two consecutive quotechar inside fields as single quotechar
- commentchar, default None
Character used as a comments indicator. If found at the beginning of a line, the line will be ignored altogether.
- delim_whitespacebool, default False
Determines whether to use whitespace as delimiter.
- byte_rangelist or tuple, default None
Byte range within the input file to be read. The first number is the offset in bytes, the second number is the range size in bytes. Set the size to zero to read all data after the offset location. Reads the row that starts before or at the end of the range, even if it ends after the end of the range.
- use_python_file_objectboolean, default True
If True, Arrow-backed PythonFile objects will be used in place of fsspec AbstractBufferedFile objects at IO time. This option is likely to improve performance when making small reads from larger CSV files.
- storage_optionsdict, optional, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Request
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open
. Please seefsspec
andurllib
for more details.- bytes_per_threadint, default None
Determines the number of bytes to be allocated per thread to read the files in parallel. When there is a file of large size, we get slightly better throughput by decomposing it and transferring multiple “blocks” in parallel (using a python thread pool). Default allocation is 268435456 bytes. This parameter is functional only when use_python_file_object=False.
- Returns
- ——-
- GPU ``DataFrame`` object.
See also
Notes
cuDF supports local and remote data stores. See configuration details for available sources here.
Examples
Create a test csv file
>>> import cudf >>> filename = 'foo.csv' >>> lines = [ ... "num1,datetime,text", ... "123,2018-11-13T12:00:00,abc", ... "456,2018-11-14T12:35:01,def", ... "789,2018-11-15T18:02:59,ghi" ... ] >>> with open(filename, 'w') as fp: ... fp.write('\n'.join(lines)+'\n')
Read the file with
cudf.read_csv
>>> cudf.read_csv(filename) num1 datetime text 0 123 2018-11-13T12:00:00.000 5451 1 456 2018-11-14T12:35:01.000 5784 2 789 2018-11-15T18:02:59.000 6117