CLX Predictive Maintenance
This is an introduction to CLX Predictive Maintenance.
Like any other Linux based machine, DGX’s generate a vast amount of logs. Analysts spend hours trying to identify the root causes of each failure. There could be infinitely many types of root causes of the failures. Some patterns might help to narrow it down; however, regular expressions can only help to identify previously known patterns. Moreover, this creates another manual task of maintaining a search script.
CLX predicitive maintenance module shows us how GPU’s can accelerate the analysis of the enormous amount of logs using machine learning. Another benefit of analyzing in a probabilistic way is that we can pin down unseen root causes. To achieve this, we will fine-tune a pre-trained BERT* model with a classification layer using HuggingFace library.
Once the model is capable of identifying even the new root causes, it can also be deployed as a process running in the machines to predict failures before they happen.
*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found here.
How to train a Predictive Maintenance model
To train a CLX Predictive Maintenance model you simply need a training dataset which contains a column of
log and their associated
label which can be either
First initialize your new model
from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier; seq_classifier = BinarySequenceClassifier() seq_classifier.init_model("bert-base-uncased")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Next, train your predictive maintenance model. The below example uses a small sample dataset for demonstration only. Ideally you will want a larger training set.
import cudf df = cudf.DataFrame() df["log"] = [ "Apr 6 15:17:57 local-dt-eno1 kernel: [ 1021.384311] docker0: port 1(veth3dd105f) entered blocking state", "Apr 6 15:17:57 local-dt-eno1 kernel: [ 1021.384315] docker0: port 1(veth3dd105f) entered disabled state", "Apr 6 15:17:57 local-dt-eno1 kernel: [ 1021.384418] device veth3dd105f entered promiscuous mode", "Apr 6 15:17:57 local-dt-eno1 kernel: docker0: port 1(veth3dd105f) entered blocking state", "Apr 6 15:17:57 local-dt-eno1 kernel: docker0: port 1(veth3dd105f) entered disabled state", "Apr 6 15:17:57 local-dt-eno1 kernel: device veth3dd105f entered promiscuous mode", "Apr 6 15:17:57 local-dt-eno1 kernel: [ 1021.654834] eth0: renamed from veth7677340", "Apr 6 15:17:57 local-dt-eno1 kernel: eth0: renamed from veth7677340", "Apr 6 15:17:57 local-dt-eno1 kernel: [ 1021.686871] IPv6: ADDRCONF(NETDEV_CHANGE): veth3dd105f: link becomes ready", "Apr 6 15:17:57 local-dt-eno1 kernel: [ 1021.686944] docker0: port 1(veth3dd105f) entered blocking state", ] df["label"] = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]
Split the dataset into training and test sets
from cuml import train_test_split X_train, X_test, y_train, y_test = train_test_split(df, 'label', train_size=0.8)
seq_classifier.train_model(X_train["log"], y_train, epochs=1)
Epoch: 0%| | 0/1 [00:00<?, ?it/s]/opt/conda/envs/clx_dev/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.12s/it]
Train loss: 1.9852845668792725
Ideally, you will want to train your model over a number of
epochs as detailed in our example Predictive Maintenance notebook.
Save a trained model checkpoint
Save a trained model
Load a model
Let’s create a new sequence classifier instance and load saved checkpoint.
pdm = BinarySequenceClassifier() pdm.init_model("clx_pdm_classifier.ckpt")
Let’s create a new sequence classifier instance and load saved model.
pdm2 = BinarySequenceClassifier() pdm2.init_model("clx_pdm_classifier.pth")
Use your new model for prediction
infer_df = cudf.DataFrame() infer_df["log"] = ["Apr 6 15:07:07 local-dt-eno1 kernel: [ 371.072371] audit: type=1400 audit(1617721627.183:67): apparmor=\"STATUS\" operation=\"profile_load\" profile=\"unconfined\" name=\"snap-update-ns.cmake\" pid=7066 comm=\"apparmor_parser\""] pdm.predict(infer_df["log"])
(0 True Name: 0, dtype: bool, 0 0.523229 Name: 0, dtype: float32)
This example shows how to use a BERT-based predictive maintenance. This approach can be implemented on the machines to warn the users well before the problems occur so corrective actions can be taken.