CLX cyBERT

Introduction

One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a BERT language model using a toy dataset of just 1000 previously parsed apache server logs as a labeled data. We will fine-tune a pretrained BERT model from HuggingFace with a classification layer for Named Entity Recognition.

How to train a cyBERT model

For in-depth example of cyBERT model training view this Jupyter Notebook.

Download pre-trained model

Let’s download a pre-trained model from s3.

[1]:
import s3fs

S3_BASE_PATH = "models.huggingface.co/bert/raykallen/cybert_apache_parser"
CONFIG_FILENAME = "config.json"
MODEL_FILENAME = "pytorch_model.bin"

fs = s3fs.S3FileSystem(anon=True)
fs.get(S3_BASE_PATH + "/" + MODEL_FILENAME, MODEL_FILENAME)
fs.get(S3_BASE_PATH + "/" + CONFIG_FILENAME, CONFIG_FILENAME)
[1]:
[None]

Let’s create a cybert instance and load the pre-trained model.

[2]:
from clx.analytics.cybert import Cybert

cyparse = Cybert()
cyparse.load_model(MODEL_FILENAME, CONFIG_FILENAME)

Sample Apache logs as input

[3]:
import cudf
input_logs = cudf.Series(['109.169.248.247 - -',
                          'POST /administrator/index.php HTTP/1.1 200 4494'])

cyBERT Inferencing

Use your model to parse apache logs

[4]:
parsed_df, confidence_df = cyparse.inference(input_logs)
[5]:
parsed_df
[5]:
remote_host other request_method request_url request_http_ver status response_bytes_clf
0 109.169.248.247 - NaN NaN NaN NaN NaN
1 NaN NaN POST /administrator/index.php HTTP/1.1 200 449
[6]:
confidence_df
[6]:
remote_host other request_method request_url request_http_ver status response_bytes_clf
0 0.999628 0.999579 NaN NaN NaN NaN NaN
1 NaN NaN 0.99822 0.999629 0.999936 0.999866 0.999751

Conclusion

This example shows that using a cyBERT-based parser for extracting apache logs. Users can experiment with other datasets by training model as per the requirements.