{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CLX cyBERT\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing. This notebook illustrates how to train a BERT language model using a toy dataset of just 1000 previously parsed apache server logs as a labeled data. We will fine-tune a pretrained BERT model from [HuggingFace](https://github.com/huggingface) with a classification layer for Named Entity Recognition."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How to train a cyBERT model\n",
"For in-depth example of cyBERT model training view this Jupyter [Notebook](https://github.com/rapidsai/clx/blob/main/notebooks/cybert/cybert_example_training.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download pre-trained model\n",
"\n",
"Let's download a pre-trained model from s3."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[None]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import s3fs\n",
"\n",
"S3_BASE_PATH = \"models.huggingface.co/bert/raykallen/cybert_apache_parser\"\n",
"CONFIG_FILENAME = \"config.json\"\n",
"MODEL_FILENAME = \"pytorch_model.bin\"\n",
"\n",
"fs = s3fs.S3FileSystem(anon=True)\n",
"fs.get(S3_BASE_PATH + \"/\" + MODEL_FILENAME, MODEL_FILENAME)\n",
"fs.get(S3_BASE_PATH + \"/\" + CONFIG_FILENAME, CONFIG_FILENAME)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's create a cybert instance and load the pre-trained model."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from clx.analytics.cybert import Cybert\n",
"\n",
"cyparse = Cybert()\n",
"cyparse.load_model(MODEL_FILENAME, CONFIG_FILENAME)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sample Apache logs as input"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"input_logs = cudf.Series(['109.169.248.247 - -',\n",
" 'POST /administrator/index.php HTTP/1.1 200 4494'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cyBERT Inferencing\n",
"\n",
"Use your model to parse apache logs"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/envs/clx_dev/lib/python3.8/site-packages/cudf/core/subword_tokenizer.py:189: UserWarning: When truncation is not True, the behavior currently differs from HuggingFace as cudf always returns overflowing tokens\n",
" warnings.warn(warning_msg)\n"
]
}
],
"source": [
"parsed_df, confidence_df = cyparse.inference(input_logs)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" remote_host | \n",
" other | \n",
" request_method | \n",
" request_url | \n",
" request_http_ver | \n",
" status | \n",
" response_bytes_clf | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 109.169.248.247 | \n",
" - | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" POST | \n",
" /administrator/index.php | \n",
" HTTP/1.1 | \n",
" 200 | \n",
" 449 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" remote_host other request_method request_url \\\n",
"0 109.169.248.247 - NaN NaN \n",
"1 NaN NaN POST /administrator/index.php \n",
"\n",
" request_http_ver status response_bytes_clf \n",
"0 NaN NaN NaN \n",
"1 HTTP/1.1 200 449 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parsed_df"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" remote_host | \n",
" other | \n",
" request_method | \n",
" request_url | \n",
" request_http_ver | \n",
" status | \n",
" response_bytes_clf | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.999628 | \n",
" 0.999579 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" 0.99822 | \n",
" 0.999629 | \n",
" 0.999936 | \n",
" 0.999866 | \n",
" 0.999751 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" remote_host other request_method request_url request_http_ver \\\n",
"0 0.999628 0.999579 NaN NaN NaN \n",
"1 NaN NaN 0.99822 0.999629 0.999936 \n",
"\n",
" status response_bytes_clf \n",
"0 NaN NaN \n",
"1 0.999866 0.999751 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"confidence_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"This example shows that using a cyBERT-based parser for extracting apache logs. Users can experiment with other datasets by training model as per the requirements."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}