CLX cyBERT
Introduction
One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a BERT language model using a toy dataset of just 1000 previously parsed apache server logs as a labeled data. We will fine-tune a pretrained BERT model from HuggingFace with a classification layer for Named Entity Recognition.
How to train a cyBERT model
For in-depth example of cyBERT model training view this Jupyter Notebook.
Download pre-trained model
Let’s download a pre-trained model from s3.
[1]:
import s3fs
S3_BASE_PATH = "models.huggingface.co/bert/raykallen/cybert_apache_parser"
CONFIG_FILENAME = "config.json"
MODEL_FILENAME = "pytorch_model.bin"
fs = s3fs.S3FileSystem(anon=True)
fs.get(S3_BASE_PATH + "/" + MODEL_FILENAME, MODEL_FILENAME)
fs.get(S3_BASE_PATH + "/" + CONFIG_FILENAME, CONFIG_FILENAME)
[1]:
[None]
Let’s create a cybert instance and load the pre-trained model.
[2]:
from clx.analytics.cybert import Cybert
cyparse = Cybert()
cyparse.load_model(MODEL_FILENAME, CONFIG_FILENAME)
Sample Apache logs as input
[3]:
import cudf
input_logs = cudf.Series(['109.169.248.247 - -',
'POST /administrator/index.php HTTP/1.1 200 4494'])
cyBERT Inferencing
Use your model to parse apache logs
[4]:
parsed_df, confidence_df = cyparse.inference(input_logs)
/opt/conda/envs/clx_dev/lib/python3.8/site-packages/cudf/core/subword_tokenizer.py:189: UserWarning: When truncation is not True, the behavior currently differs from HuggingFace as cudf always returns overflowing tokens
warnings.warn(warning_msg)
[5]:
parsed_df
[5]:
remote_host | other | request_method | request_url | request_http_ver | status | response_bytes_clf | |
---|---|---|---|---|---|---|---|
0 | 109.169.248.247 | - | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | POST | /administrator/index.php | HTTP/1.1 | 200 | 449 |
[6]:
confidence_df
[6]:
remote_host | other | request_method | request_url | request_http_ver | status | response_bytes_clf | |
---|---|---|---|---|---|---|---|
0 | 0.999628 | 0.999579 | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | 0.99822 | 0.999629 | 0.999936 | 0.999866 | 0.999751 |
Conclusion
This example shows that using a cyBERT-based parser for extracting apache logs. Users can experiment with other datasets by training model as per the requirements.