# CLX DGA Detection

This is an introduction to CLX DGA Detection.

## What is DGA Detection?

Domain Generation Algorithms (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists.

## When to use CLX DGA Detection?

Use CLX DGA Detection to build your own DGA Detection model that can then be used to predict whether a given domain is malicious or not. We will use a type of recurrent neural network called the Gated Recurrent Unit (GRU) for this example. The CLX and RAPIDS libraries enable users train their models with up-to-date domain names representative of both benign and DGA generated strings. Using a CLX workflow, this capability could also be used in production environments.

For a more advanced, in-depth example of CLX DGA Detection view this Jupyter notebook.

## How to train a CLX DGA Detection model

To train a CLX DGA Detection model you simply need a training data set which contains a column of domains and their associated type which can be either 1 (benign) or 0 (malicious).

[1]:

LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2  # Will be 2 since there are a total of 2 different types

dd.init_model(
n_layers=N_LAYERS,
char_vocab=CHAR_VOCAB,
hidden_size=HIDDEN_SIZE,
n_domain_type=N_DOMAIN_TYPE,
)


Next, train your DGA detector. The below example uses a small dataset for demonstration only. Ideally you will want a larger training set.

To develop a more expansive training set, these resources are available:

[2]:

import cudf

train_df = cudf.DataFrame()
train_df["domain"] = [
"tmall.com",
"duiwlqeejymdb.com",
"kofsmyaiufarb.net",
"xskphhmrlcihr.biz",
"yahoo.com",
"wejaecjhycwss.co.uk",
"xtorhktvpblmr.info",
"xvljisbfalkts.com",
]
train_df["type"] = [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]


When we train a model, the total loss is returned

[3]:

dd.train_model(train_df['domain'], train_df['type'])

/opt/conda/envs/clx_dev/lib/python3.8/site-packages/cudf/core/column/string.py:3538: FutureWarning: The expand parameter is deprecated and will be removed in a future version. Set expand=False to match future behavior.
warnings.warn(
Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.13s/it]


Ideally, you will want to train your model over a number of epochs as detailed in our example DGA Detection notebook.

## Save a trained model checkpoint

[4]:

dd.save_checkpoint("clx_dga_classifier.pth")


Let’s create a new dga detector and load the saved model from above.

[5]:

dga_detector = DGADetector(lr=0.001)


## DGA Inferencing

Use your new model to predict malicious domains

[6]:

test_df = cudf.DataFrame()

[6]:

0    1