CLX Asset Classification (Supervised)

This is an introduction to CLX Asset Classification.

Introduction

In this example, we will show how to use CLX to perform asset classification with some randomly generated dataset using cudf and cuml. This work could be expanded by using different log types (i.e, Windows Events) or different events from the machines as features to improve accuracy. Various labels can be selected to cover different types of machines or data-centres.

Train Asset Classification model

First initialize your new model

[1]:
from clx.analytics.asset_classification import AssetClassification

ac = AssetClassification()

Next, train your assest classification model. The below example uses a small sample dataset for demonstration only. Ideally you will want a larger training set.

[2]:
import random
import cudf

train_gdf = cudf.DataFrame()
train_gdf["eventcode"] = [0, 14, 14, 14, 14, 14, 14, 14, 9, 14, 9, 3, 3, 20, 3, 20, 9, 20, 20, 3]
train_gdf["keywords"] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["privileges"] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
train_gdf["message"] = [15, 7, 7, 7, 7, 7, 7, 7, 6, 7, 6, 8, 8, 24, 8, 24, 6, 24, 24, 8]
train_gdf["sourcename"] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
train_gdf["taskcategory"] = [4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 5, 7, 7, 7]
train_gdf["account_for_which_logon_failed_account_domain"] = [22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
train_gdf["detailed_authentication_information_authentication_package"] = [0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
train_gdf["detailed_authentication_information_key_length"] = [0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
train_gdf["detailed_authentication_information_logon_process"] = [5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
train_gdf["detailed_authentication_information_package_name_ntlm_only"] = [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["logon_type"] = [1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
train_gdf["network_information_workstation_name"] = [932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932]
train_gdf["new_logon_security_id"] = [38, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]
train_gdf["impersonation_level"] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
train_gdf["network_information_protocol"] = [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
train_gdf["network_information_direction"] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["filter_information_layer_name"] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["cont1"] = [-1.7320297079999998, -1.731987508, -1.7319453079999998, -1.7319031079999998, -1.7318609079999998, -1.731818709, -1.731776509, -1.7317343089999997, -1.731692109, -1.731649909, -1.731607709, -1.731565509, -1.7315233099999996, -1.7314811099999998, -1.7314389099999998, -1.7313967099999996, -1.7313545099999998, -1.7313123099999999, -1.7312701099999999, -1.731227911]
train_gdf["label"] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Split the dataset into training and test sets

[3]:
from cuml import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(train_gdf, 'label', train_size=0.8)
X_train["label"] = Y_train

Initialize variables

  • Categorical and Continuous feature columns

  • Batchsize

  • Number of epochs

[4]:
cat_cols = ["eventcode", "keywords", "privileges", "message", "sourcename", "taskcategory", "account_for_which_logon_failed_account_domain", "detailed_authentication_information_authentication_package", "detailed_authentication_information_key_length", "detailed_authentication_information_logon_process", "detailed_authentication_information_package_name_ntlm_only", "logon_type", "network_information_workstation_name", "new_logon_security_id", "impersonation_level", "network_information_protocol", "network_information_direction", "filter_information_layer_name"]
cont_cols = ["cont1"]
batch_size = 1000
epochs = 2
[5]:
ac.train_model(X_train, cat_cols, cont_cols, "label", batch_size, epochs, lr=0.01, wd=0.0)
training loss:  0.7478979229927063
valid loss 0.588 and accuracy 1.000
training loss:  0.5800438523292542
valid loss 0.445 and accuracy 1.000

Ideally, you will want to train your model over a number of epochs as detailed in our example Asset Classification notebook.

Save a trained model

[6]:
ac.save_model("clx_asset_classifier.pth")

Let’s create a new asset classifier instance and load saved model.

[7]:
asset_classifier = AssetClassification()
asset_classifier.load_model('clx_asset_classifier.pth')

AC Inferencing

Use your new model for prediction

[8]:
pred_results = ac.predict(X_test, cat_cols, cont_cols).to_array()
true_results = Y_test.to_array()
true_results
[8]:
array([0, 0, 0, 0])