train_test_split#

cuml.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)[source]#

Split arrays or matrices into random train and test subsets.

Parameters:
*arrayssequence of indexables with same length / shape[0]

Allowed inputs are cudf DataFrames/Series, cupy arrays, numba device arrays, numpy arrays, pandas DataFrames/Series, or any array-like objects with a shape attribute.

test_sizefloat or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

train_sizefloat or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_stateint, default=None

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

shufflebool, default=True

Whether or not to shuffle the data before splitting.

stratifyarray-like, default=None

If not None, data is split in a stratified fashion, using this as the class labels.

Returns:
splittinglist, length=2 * len(arrays)

List containing train-test split of inputs. Output types match input types (cudf inputs return cudf outputs, cupy inputs return cupy outputs, etc.)

Examples

>>> import cupy as cp
>>> from cuml.model_selection import train_test_split
>>> X = cp.arange(10).reshape((5, 2))
>>> y = cp.array([0, 0, 1, 1, 1])
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.2, random_state=42
... )