CLX Phishing Detection Using cyBERT

This is an introduction to CLX Phishing Detection.

What is Phishing?

Phishing is a method used by fraudsters/hackers to obtain sensitive information from email users by pretending to be from legitimate institutions/people. Various machine learning methods are in use to detect and filter phishing/spam emails. In this we show how to train a *BERT language model and analyse the performance. We have fine-tuned a pre-trained BERT model with a classification layer using HuggingFace library. *BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found here.

How to train a Phishing Detection model

To train a CLX Phishing Detection model you simply need a training dataset which contains a column of email content and their associated label which can be either 1 (malicious) or 0 (benign).

First initialize your new model

[1]:
from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier;

seq_classifier = BinarySequenceClassifier()
seq_classifier.init_model("bert-base-uncased")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Next, train your Phishing detector. The below example uses a small dataset for demonstration only. Ideally you will want a larger training set.

[2]:
import cudf

df = cudf.DataFrame()
df["email"] = [
    "Dave =96 for IP if you wish. An arrested click-fraudster who claimed to=20==20make $30k a month via Google click-fraud is now freed because of a=20=20possible lack of cooperation from Google.http://yahoo.businessweek.com/technology/content/dec2006/=20tc20061204_923336.htmTony RajakumarVictrio Inc.tonyr@victrio.com-------------------------------------You are subscribed as R@MTo manage your subscription, go to  http://v2.listbox.com/member/?listname=3Dip",
    "over. SidLet me know. Thx.",
    "DR.SOLOMON AZEEZ FEDERAL MINISTRY OF PETROLUEM RESOURCES(F.M.P.R) LAGOS NIGERIA. ATTN:SIR, REQUEST FOR ASSISTANCE- STRICTLYCONFIDENTIAL I am Dr SOLOMON AZEEZ, an accountant in the Ministry of petroleum Resources (MPR) and a member of a three-man Tender Board in charge of contract review and payment approvals. I came to know of you in my search for a reliable person to handle a very confidential transaction that involves the transfer of a huge sum of money to a foreign account. It may sound strange but exercise patience and read on. There were series of contracts executed by a consortium ",
    "Not a surprising assessment from Embassy.",
    "Monica -Huma Abedin <Huma@clintonemail.com>Tuesday June 29 2010 6:01 AM'hanleymr@state.gov'; HRe:is already is locked for tonite. I am seeing her right before actually.",
    "Pis print.H <hrod17@clintonemail.com>Thursday October 8 2009 8:01 PM'JilotyLC@state.gov'Fw: WHI - powder coatingB6",
    "Best regards, Ron Sinclear. ronsinclear@netscape.net",
    "Yes",
]
df["label"] = [1, 0, 1, 0, 1, 0, 1, 0]

Split the dataset into training and test sets

[3]:
from cuml import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, 'label', train_size=0.8)
[4]:
seq_classifier.train_model(X_train["email"], y_train, epochs=1)
Epoch:   0%|                                                                                                                                                              | 0/1 [00:00<?, ?it/s]/opt/conda/envs/clx_dev/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.18s/it]
Train loss: 2.008297920227051
[5]:
seq_classifier.evaluate_model(X_test["email"], y_test)
[5]:
1.0

Ideally, you will want to train your model over a number of epochs as detailed in our example Phishing Detection notebook.

Save a trained model checkpoint

[6]:
seq_classifier.save_model("clx_pd_classifier.ckpt")

Save a trained model

[7]:
seq_classifier.save_model("clx_pd_classifier.pth")

Load a model

Let’s create a new phishing detector instance from the saved checkpoint.

[8]:
phishing_detector = BinarySequenceClassifier()
phishing_detector.init_model("clx_pd_classifier.ckpt")

Let’s create a new phishing detector instance from the saved model.

[9]:
phishing_detector2 = BinarySequenceClassifier()
phishing_detector2.init_model("clx_pd_classifier.pth")

PD Inferencing

Use your new model to predict phishing emails

[10]:
infer_df = cudf.DataFrame()
infer_df["email"] = ["Best regards, Ron Sinclear. ronsinclear@netscape.net","over. SidLet me know. Thx."]

phishing_detector.predict(infer_df["email"])
[10]:
(0     True
 1    False
 Name: 0, dtype: bool,
 0    0.632572
 1    0.451712
 Name: 0, dtype: float32)

Conclusion

This example shows that using a BERT-based phishing detector performs well in identifying the spam emails across these datasets. Users can experiment with other datasets, increase the coverage and change the number of epochs to fine-tune the results on their datasets. It is also an example of how CLX can be used with huggingface and other libraries to create custom solutions.