CLX Phishing Detection Using cyBERT
This is an introduction to CLX Phishing Detection.
What is Phishing?
Phishing is a method used by fraudsters/hackers to obtain sensitive information from email users by pretending to be from legitimate institutions/people. Various machine learning methods are in use to detect and filter phishing/spam emails. In this we show how to train a *BERT language model and analyse the performance. We have fine-tuned a pre-trained BERT model with a classification layer using HuggingFace library. *BERT stands for Bidirectional Encoder Representations from Transformers. The paper can be found here.
How to train a Phishing Detection model
To train a CLX Phishing Detection model you simply need a training dataset which contains a column of email content and their associated
label which can be either
1 (malicious) or
First initialize your new model
from clx.analytics.binary_sequence_classifier import BinarySequenceClassifier; seq_classifier = BinarySequenceClassifier() seq_classifier.init_model("bert-base-uncased")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Next, train your Phishing detector. The below example uses a small dataset for demonstration only. Ideally you will want a larger training set.
import cudf df = cudf.DataFrame() df["email"] = [ "Dave =96 for IP if you wish. An arrested click-fraudster who claimed to=20==20make $30k a month via Google click-fraud is now freed because of a=20=20possible lack of cooperation from Google.http://yahoo.businessweek.com/technology/content/dec2006/=20tc20061204_923336.htmTony RajakumarVictrio Inc.email@example.com-------------------------------------You are subscribed as R@MTo manage your subscription, go to http://v2.listbox.com/member/?listname=3Dip", "over. SidLet me know. Thx.", "DR.SOLOMON AZEEZ FEDERAL MINISTRY OF PETROLUEM RESOURCES(F.M.P.R) LAGOS NIGERIA. ATTN:SIR, REQUEST FOR ASSISTANCE- STRICTLYCONFIDENTIAL I am Dr SOLOMON AZEEZ, an accountant in the Ministry of petroleum Resources (MPR) and a member of a three-man Tender Board in charge of contract review and payment approvals. I came to know of you in my search for a reliable person to handle a very confidential transaction that involves the transfer of a huge sum of money to a foreign account. It may sound strange but exercise patience and read on. There were series of contracts executed by a consortium ", "Not a surprising assessment from Embassy.", "Monica -Huma Abedin <Huma@clintonemail.com>Tuesday June 29 2010 6:01 AMfirstname.lastname@example.org'; HRe:is already is locked for tonite. I am seeing her right before actually.", "Pis print.H <email@example.com>Thursday October 8 2009 8:01 PM'JilotyLC@state.gov'Fw: WHI - powder coatingB6", "Best regards, Ron Sinclear. firstname.lastname@example.org", "Yes", ] df["label"] = [1, 0, 1, 0, 1, 0, 1, 0]
Split the dataset into training and test sets
from cuml import train_test_split X_train, X_test, y_train, y_test = train_test_split(df, 'label', train_size=0.8)
seq_classifier.train_model(X_train["email"], y_train, epochs=1)
Epoch: 0%| | 0/1 [00:00<?, ?it/s]/opt/conda/envs/clx_dev/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.18s/it]
Train loss: 2.008297920227051
Ideally, you will want to train your model over a number of
epochs as detailed in our example Phishing Detection notebook.
Save a trained model checkpoint
Save a trained model
Load a model
Let’s create a new phishing detector instance from the saved checkpoint.
phishing_detector = BinarySequenceClassifier() phishing_detector.init_model("clx_pd_classifier.ckpt")
Let’s create a new phishing detector instance from the saved model.
phishing_detector2 = BinarySequenceClassifier() phishing_detector2.init_model("clx_pd_classifier.pth")
Use your new model to predict phishing emails
infer_df = cudf.DataFrame() infer_df["email"] = ["Best regards, Ron Sinclear. email@example.com","over. SidLet me know. Thx."] phishing_detector.predict(infer_df["email"])
(0 True 1 False Name: 0, dtype: bool, 0 0.632572 1 0.451712 Name: 0, dtype: float32)
This example shows that using a BERT-based phishing detector performs well in identifying the spam emails across these datasets. Users can experiment with other datasets, increase the coverage and change the number of epochs to fine-tune the results on their datasets. It is also an example of how CLX can be used with huggingface and other libraries to create custom solutions.