CLX Predictive Maintenance

This is an introduction to CLX Predictive Maintenance.

Introduction

Like any other Linux based machine, DGX’s generate a vast amount of logs. Analysts spend hours trying to identify the root causes of each failure. There could be infinitely many types of root causes of the failures. Some patterns might help to narrow it down; however, regular expressions can only help to identify previously known patterns. Moreover, this creates another manual task of maintaining a search script.

CLX predicitive maintenance module shows us how GPU’s can accelerate the analysis of the enormous amount of logs using machine learning. Another benefit of analyzing in a probabilistic way is that we can pin down unseen root causes. To achieve this, we will fine-tune a pre-trained BERT* model with a classification layer using HuggingFace library.

Once the model is capable of identifying even the new root causes, it can also be deployed as a process running in the machines to predict failures before they happen.

*BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found here.

How to train a Predictive Maintenance model

To train a CLX Predictive Maintenance model you simply need a training dataset which contains a column of log and their associated label which can be either 0 for ordinary or 1 for root cause.

First initialize your new model

[1]:
from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier;

seq_classifier = BinarySequenceClassifier()
seq_classifier.init_model("bert-base-uncased")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Next, train your predictive maintenance model. The below example uses a small sample dataset for demonstration only. Ideally you will want a larger training set.

[2]:
import cudf

df = cudf.DataFrame()
df["log"] = [
    "Apr  6 15:17:57 local-dt-eno1 kernel: [ 1021.384311] docker0: port 1(veth3dd105f) entered blocking state",
    "Apr  6 15:17:57 local-dt-eno1 kernel: [ 1021.384315] docker0: port 1(veth3dd105f) entered disabled state",
    "Apr  6 15:17:57 local-dt-eno1 kernel: [ 1021.384418] device veth3dd105f entered promiscuous mode",
    "Apr  6 15:17:57 local-dt-eno1 kernel: docker0: port 1(veth3dd105f) entered blocking state",
    "Apr  6 15:17:57 local-dt-eno1 kernel: docker0: port 1(veth3dd105f) entered disabled state",
    "Apr  6 15:17:57 local-dt-eno1 kernel: device veth3dd105f entered promiscuous mode",
    "Apr  6 15:17:57 local-dt-eno1 kernel: [ 1021.654834] eth0: renamed from veth7677340",
    "Apr  6 15:17:57 local-dt-eno1 kernel: eth0: renamed from veth7677340",
    "Apr  6 15:17:57 local-dt-eno1 kernel: [ 1021.686871] IPv6: ADDRCONF(NETDEV_CHANGE): veth3dd105f: link becomes ready",
    "Apr  6 15:17:57 local-dt-eno1 kernel: [ 1021.686944] docker0: port 1(veth3dd105f) entered blocking state",
]
df["label"] = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]

Split the dataset into training and test sets

[3]:
from cuml import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, 'label', train_size=0.8)
[4]:
seq_classifier.train_model(X_train["log"], y_train, epochs=1)
Epoch:   0%|                                                                                                                                                              | 0/1 [00:00<?, ?it/s]/opt/conda/envs/clx_dev/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.12s/it]
Train loss: 1.9852845668792725
[5]:
seq_classifier.evaluate_model(X_test["log"], y_test)
[5]:
0.0

Ideally, you will want to train your model over a number of epochs as detailed in our example Predictive Maintenance notebook.

Save a trained model checkpoint

[6]:
seq_classifier.save_model("clx_pdm_classifier.ckpt")

Save a trained model

[7]:
seq_classifier.save_model("clx_pdm_classifier.pth")

Load a model

Let’s create a new sequence classifier instance and load saved checkpoint.

[8]:
pdm = BinarySequenceClassifier()
pdm.init_model("clx_pdm_classifier.ckpt")

Let’s create a new sequence classifier instance and load saved model.

[9]:
pdm2 = BinarySequenceClassifier()
pdm2.init_model("clx_pdm_classifier.pth")

PDM Inferencing

Use your new model for prediction

[10]:
infer_df = cudf.DataFrame()
infer_df["log"] = ["Apr  6 15:07:07 local-dt-eno1 kernel: [  371.072371] audit: type=1400 audit(1617721627.183:67): apparmor=\"STATUS\" operation=\"profile_load\" profile=\"unconfined\" name=\"snap-update-ns.cmake\" pid=7066 comm=\"apparmor_parser\""]

pdm.predict(infer_df["log"])
[10]:
(0    True
 Name: 0, dtype: bool,
 0    0.523229
 Name: 0, dtype: float32)

Conclusion

This example shows how to use a BERT-based predictive maintenance. This approach can be implemented on the machines to warn the users well before the problems occur so corrective actions can be taken.