{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CLX cyBERT\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a BERT language model using a toy dataset of just 1000 previously parsed apache server logs as a labeled data. We will fine-tune a pretrained BERT model from [HuggingFace](https://github.com/huggingface) with a classification layer for Named Entity Recognition." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to train a cyBERT model\n", "For in-depth example of cyBERT model training view this Jupyter [Notebook](https://github.com/rapidsai/clx/blob/main/notebooks/cybert/cybert_example_training.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download pre-trained model\n", "\n", "Let's download a pre-trained model from s3." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[None]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import s3fs\n", "\n", "S3_BASE_PATH = \"models.huggingface.co/bert/raykallen/cybert_apache_parser\"\n", "CONFIG_FILENAME = \"config.json\"\n", "MODEL_FILENAME = \"pytorch_model.bin\"\n", "\n", "fs = s3fs.S3FileSystem(anon=True)\n", "fs.get(S3_BASE_PATH + \"/\" + MODEL_FILENAME, MODEL_FILENAME)\n", "fs.get(S3_BASE_PATH + \"/\" + CONFIG_FILENAME, CONFIG_FILENAME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's create a cybert instance and load the pre-trained model." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from clx.analytics.cybert import Cybert\n", "\n", "cyparse = Cybert()\n", "cyparse.load_model(MODEL_FILENAME, CONFIG_FILENAME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Apache logs as input" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "input_logs = cudf.Series(['109.169.248.247 - -',\n", " 'POST /administrator/index.php HTTP/1.1 200 4494'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## cyBERT Inferencing\n", "\n", "Use your model to parse apache logs" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/envs/clx_dev/lib/python3.8/site-packages/cudf/core/subword_tokenizer.py:189: UserWarning: When truncation is not True, the behavior currently differs from HuggingFace as cudf always returns overflowing tokens\n", " warnings.warn(warning_msg)\n" ] } ], "source": [ "parsed_df, confidence_df = cyparse.inference(input_logs)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
remote_hostotherrequest_methodrequest_urlrequest_http_verstatusresponse_bytes_clf
0109.169.248.247-NaNNaNNaNNaNNaN
1NaNNaNPOST/administrator/index.phpHTTP/1.1200449
\n", "
" ], "text/plain": [ " remote_host other request_method request_url \\\n", "0 109.169.248.247 - NaN NaN \n", "1 NaN NaN POST /administrator/index.php \n", "\n", " request_http_ver status response_bytes_clf \n", "0 NaN NaN NaN \n", "1 HTTP/1.1 200 449 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parsed_df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
remote_hostotherrequest_methodrequest_urlrequest_http_verstatusresponse_bytes_clf
00.9996280.999579NaNNaNNaNNaNNaN
1NaNNaN0.998220.9996290.9999360.9998660.999751
\n", "
" ], "text/plain": [ " remote_host other request_method request_url request_http_ver \\\n", "0 0.999628 0.999579 NaN NaN NaN \n", "1 NaN NaN 0.99822 0.999629 0.999936 \n", "\n", " status response_bytes_clf \n", "0 NaN NaN \n", "1 0.999866 0.999751 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confidence_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This example shows that using a cyBERT-based parser for extracting apache logs. Users can experiment with other datasets by training model as per the requirements." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }