MBSGDClassifier#

class cuml.linear_model.MBSGDClassifier(*, loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, verbose=False, output_type=None)[source]#

Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Classifier implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGDClassifier:

Reduce the batch size
Increase the eta0
Increase the number of iterations

Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.

Parameters:

loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘hinge’)

‘hinge’ uses linear SVM

‘log’ uses logistic regression

‘squared_loss’ uses linear regression

penalty{‘l1’, ‘l2’, ‘elasticnet’, None} (default = ‘l2’)

The penalty (aka regularization term) to apply.

‘l1’: L1 norm (Lasso) regularization
‘l2’: L2 norm (Ridge) regularization (the default)
‘elasticnet’: Elastic Net regularization, a weighted average of L1 and L2
None: no penalty is added

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

l1_ratiofloat (default=0.15)

The l1_ratio is used only when penalty = elasticnet. The value for l1_ratio should be 0 <= l1_ratio <= 1. When l1_ratio = 0 then the penalty = 'l2' and if l1_ratio = 1 then penalty = 'l1'

batch_sizeint (default = 32)

It sets the number of samples that will be included in each batch.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any improvement in the model

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

coef_: array, shape=(n_features,): The model coefficients.
intercept_: float: The independent term. If fit_intercept is False, will be 0.
classes_np.ndarray, shape=(n_classes,): Array of the class labels.

Methods

`fit`(X, y, *[, convert_dtype])	Fit the model with X and y.
`predict`(X, *[, convert_dtype])	Predicts the y for X.

Notes

For additional docs, see scikitlearn’s SGDClassifier.

Examples

>>> import cupy as cp
>>> import cuml
>>> X = cp.array([[1, 1], [1, 2], [2, 2], [2, 3]], dtype=cp.float32)
>>> y = cp.array([1, 1, 2, 2])
>>> X_test = cp.asarray([[3, 5], [2, 5]], dtype=cp.float32)
>>> model = cuml.MBSGDClassifier().fit(X, y)
>>> model.predict(X_test)
array([2, 2])

fit(X, y, *, convert_dtype=True) → MBSGDClassifier[source]#

Fit the model with X and y.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X, *, convert_dtype=True)[source]#

Predicts the y for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.