MBSGDClassifier#
- class cuml.linear_model.MBSGDClassifier(*, loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, verbose=False, output_type=None)[source]#
Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Classifier implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGDClassifier:
Reduce the batch size
Increase the eta0
Increase the number of iterations
Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.
- Parameters:
- loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘hinge’)
‘hinge’ uses linear SVM
‘log’ uses logistic regression
‘squared_loss’ uses linear regression
- penalty{‘l1’, ‘l2’, ‘elasticnet’, None} (default = ‘l2’)
The penalty (aka regularization term) to apply.
‘l1’: L1 norm (Lasso) regularization
‘l2’: L2 norm (Ridge) regularization (the default)
‘elasticnet’: Elastic Net regularization, a weighted average of L1 and L2
None: no penalty is added
- alphafloat (default = 0.0001)
The constant value which decides the degree of regularization
- l1_ratiofloat (default=0.15)
The l1_ratio is used only when
penalty = elasticnet. The value for l1_ratio should be0 <= l1_ratio <= 1. Whenl1_ratio = 0then thepenalty = 'l2'and ifl1_ratio = 1thenpenalty = 'l1'- batch_sizeint (default = 32)
It sets the number of samples that will be included in each batch.
- fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
- tolfloat (default = 1e-3)
The training process will stop if current_loss > previous_loss - tol
- shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
- eta0float (default = 0.001)
Initial learning rate
- power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
- learning_rate{‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)
constantkeeps the learning rate constantadaptivechanges the learning rate if the training loss or the validation accuracy does not improve forn_iter_no_changeepochs. The old learning rate is generally divided by 5- n_iter_no_changeint (default = 5)
the number of epochs to train without any improvement in the model
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- coef_: array, shape=(n_features,)
The model coefficients.
- intercept_: float
The independent term. If
fit_interceptis False, will be 0.- classes_np.ndarray, shape=(n_classes,)
Array of the class labels.
Methods
fit(X, y, *[, convert_dtype])Fit the model with X and y.
predict(X, *[, convert_dtype])Predicts the y for X.
Notes
For additional docs, see scikitlearn’s SGDClassifier.
Examples
>>> import cupy as cp >>> import cuml >>> X = cp.array([[1, 1], [1, 2], [2, 2], [2, 3]], dtype=cp.float32) >>> y = cp.array([1, 1, 2, 2]) >>> X_test = cp.asarray([[3, 5], [2, 5]], dtype=cp.float32) >>> model = cuml.MBSGDClassifier().fit(X, y) >>> model.predict(X_test) array([2, 2])
- fit(X, y, *, convert_dtype=True) MBSGDClassifier[source]#
Fit the model with X and y.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- predict(X, *, convert_dtype=True)[source]#
Predicts the y for X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.