CLX DGA Detection¶
This is an introduction to CLX DGA Detection.
What is DGA Detection?¶
Domain Generation Algorithms (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists.
When to use CLX DGA Detection?¶
Use CLX DGA Detection to build your own DGA Detection model that can then be used to predict whether a given domain is malicious or not. We will use a type of recurrent neural network called the Gated Recurrent Unit (GRU) for this example. The CLX and RAPIDS libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production environments.
For a more advanced, in-depth example of CLX DGA Detection view this Jupyter notebook.
How to train a CLX DGA Detection model¶
To train a CLX DGA Detection model you simply need a training data set which contains a column of domains and their associated
type which can be either
1 (benign) or
First initialize your new model
LR = 0.001 N_LAYERS = 3 CHAR_VOCAB = 128 HIDDEN_SIZE = 100 N_DOMAIN_TYPE = 2 # Will be 2 since there are a total of 2 different types from clx.analytics.dga_detector import DGADetector from clx.analytics.detector_dataset import DetectorDataset dd = DGADetector(lr=LR) dd.init_model( n_layers=N_LAYERS, char_vocab=CHAR_VOCAB, hidden_size=HIDDEN_SIZE, n_domain_type=N_DOMAIN_TYPE, )
Next, train your DGA detector. The below example uses a small dataset for demonstration only. Ideally you will want a larger training set.
To develop a more expansive training set, these resources are available:
import cudf train_df = cudf.DataFrame() train_df["domain"] = [ "google.com", "youtube.com", "tmall.com", "duiwlqeejymdb.com", "kofsmyaiufarb.net", "xskphhmrlcihr.biz", "yahoo.com", "linkedin.com", "twitter.com", "wejaecjhycwss.co.uk", "xtorhktvpblmr.info", "xvljisbfalkts.com", ] train_df["type"] = [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0] # DetectorDataset converts domains from string to ascii and creates partitioned dataframes based on given batch size train_df = DetectorDataset(train_df, 6)
When we train a model, the total loss is returned
/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/dlpack.py:82: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function. return cpp_dlpack.to_dlpack(gdf_cols)
Ideally, you will want to train your model over a number of
epochs as detailed in our example DGA Detection notebook.
Save a trained model¶
Load a model¶
Let’s create a new dga detector and load the saved model from above.
dga_detector = DGADetector(lr=0.001) dga_detector.load_model("clx_dga_classifier.pth")
Use your new model to predict malicious domains
test_df = cudf.DataFrame() test_df['domain'] = ['facebook.com','ylqblbltqkynb.net'] dga_detector.predict(test_df['domain'])
0 1 1 1 Name: is_dga, dtype: int64
DGA detector in CLX enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. DGA detection done with CLX and RAPIDS keeps data in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.