CLX Phishing Detection Using cyBERT
This is an introduction to CLX Phishing Detection.
What is Phishing?
Phishing is a method used by fraudsters/hackers to obtain sensitive information from email users by pretending to be from legitimate institutions/people. Various machine learning methods are in use to detect and filter phishing/spam emails. In this we show how to train a *BERT language model and analyse the performance. We have fine-tuned a pre-trained BERT model with a classification layer using HuggingFace library. *BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found here.
How to train a Phishing Detection model
To train a CLX Phishing Detection model you simply need a training dataset which contains a column of email content and their associated label
which can be either 1
(malicious) or 0
(benign).
First initialize your new model
[1]:
from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier;
seq_classifier = BinarySequenceClassifier()
seq_classifier.init_model("bert-base-uncased")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Next, train your Phishing detector. The below example uses a small dataset for demonstration only. Ideally you will want a larger training set.
[2]:
import cudf
df = cudf.DataFrame()
df["email"] = [
"Dave =96 for IP if you wish. An arrested click-fraudster who claimed to=20==20make $30k a month via Google click-fraud is now freed because of a=20=20possible lack of cooperation from Google.http://yahoo.businessweek.com/technology/content/dec2006/=20tc20061204_923336.htmTony RajakumarVictrio Inc.tonyr@victrio.com-------------------------------------You are subscribed as R@MTo manage your subscription, go to http://v2.listbox.com/member/?listname=3Dip",
"over. SidLet me know. Thx.",
"DR.SOLOMON AZEEZ FEDERAL MINISTRY OF PETROLUEM RESOURCES(F.M.P.R) LAGOS NIGERIA. ATTN:SIR, REQUEST FOR ASSISTANCE- STRICTLYCONFIDENTIAL I am Dr SOLOMON AZEEZ, an accountant in the Ministry of petroleum Resources (MPR) and a member of a three-man Tender Board in charge of contract review and payment approvals. I came to know of you in my search for a reliable person to handle a very confidential transaction that involves the transfer of a huge sum of money to a foreign account. It may sound strange but exercise patience and read on. There were series of contracts executed by a consortium ",
"Not a surprising assessment from Embassy.",
"Monica -Huma Abedin <Huma@clintonemail.com>Tuesday June 29 2010 6:01 AM'hanleymr@state.gov'; HRe:is already is locked for tonite. I am seeing her right before actually.",
"Pis print.H <hrod17@clintonemail.com>Thursday October 8 2009 8:01 PM'JilotyLC@state.gov'Fw: WHI - powder coatingB6",
"Best regards, Ron Sinclear. ronsinclear@netscape.net",
"Yes",
]
df["label"] = [1, 0, 1, 0, 1, 0, 1, 0]
Split the dataset into training and test sets
[3]:
from cuml import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, 'label', train_size=0.8)
[4]:
seq_classifier.train_model(X_train["email"], y_train, epochs=1)
Epoch: 0%| | 0/1 [00:00<?, ?it/s]/opt/conda/envs/clx_dev/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.18s/it]
Train loss: 2.008297920227051
[5]:
seq_classifier.evaluate_model(X_test["email"], y_test)
[5]:
1.0
Ideally, you will want to train your model over a number of epochs
as detailed in our example Phishing Detection notebook.
Save a trained model checkpoint
[6]:
seq_classifier.save_model("clx_pd_classifier.ckpt")
Save a trained model
[7]:
seq_classifier.save_model("clx_pd_classifier.pth")
Load a model
Let’s create a new phishing detector instance from the saved checkpoint.
[8]:
phishing_detector = BinarySequenceClassifier()
phishing_detector.init_model("clx_pd_classifier.ckpt")
Let’s create a new phishing detector instance from the saved model.
[9]:
phishing_detector2 = BinarySequenceClassifier()
phishing_detector2.init_model("clx_pd_classifier.pth")
PD Inferencing
Use your new model to predict phishing emails
[10]:
infer_df = cudf.DataFrame()
infer_df["email"] = ["Best regards, Ron Sinclear. ronsinclear@netscape.net","over. SidLet me know. Thx."]
phishing_detector.predict(infer_df["email"])
[10]:
(0 True
1 False
Name: 0, dtype: bool,
0 0.632572
1 0.451712
Name: 0, dtype: float32)
Conclusion
This example shows that using a BERT-based phishing detector performs well in identifying the spam emails across these datasets. Users can experiment with other datasets, increase the coverage and change the number of epochs to fine-tune the results on their datasets. It is also an example of how CLX can be used with huggingface and other libraries to create custom solutions.