CLX Asset Classification (Supervised)
This is an introduction to CLX Asset Classification.
Introduction
In this example, we will show how to use CLX to perform asset classification with some randomly generated dataset using cudf and cuml. This work could be expanded by using different log types (i.e, Windows Events) or different events from the machines as features to improve accuracy. Various labels can be selected to cover different types of machines or data-centres.
Train Asset Classification model
First initialize your new model
[1]:
from clx.analytics.asset_classification import AssetClassification
ac = AssetClassification()
Next, train your assest classification model. The below example uses a small sample dataset for demonstration only. Ideally you will want a larger training set.
[2]:
import random
import cudf
train_gdf = cudf.DataFrame()
train_gdf["eventcode"] = [0, 14, 14, 14, 14, 14, 14, 14, 9, 14, 9, 3, 3, 20, 3, 20, 9, 20, 20, 3]
train_gdf["keywords"] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["privileges"] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
train_gdf["message"] = [15, 7, 7, 7, 7, 7, 7, 7, 6, 7, 6, 8, 8, 24, 8, 24, 6, 24, 24, 8]
train_gdf["sourcename"] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
train_gdf["taskcategory"] = [4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 5, 7, 7, 7]
train_gdf["account_for_which_logon_failed_account_domain"] = [22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22]
train_gdf["detailed_authentication_information_authentication_package"] = [0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
train_gdf["detailed_authentication_information_key_length"] = [0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
train_gdf["detailed_authentication_information_logon_process"] = [5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
train_gdf["detailed_authentication_information_package_name_ntlm_only"] = [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["logon_type"] = [1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
train_gdf["network_information_workstation_name"] = [932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932, 932]
train_gdf["new_logon_security_id"] = [38, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]
train_gdf["impersonation_level"] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
train_gdf["network_information_protocol"] = [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
train_gdf["network_information_direction"] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["filter_information_layer_name"] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
train_gdf["cont1"] = [-1.7320297079999998, -1.731987508, -1.7319453079999998, -1.7319031079999998, -1.7318609079999998, -1.731818709, -1.731776509, -1.7317343089999997, -1.731692109, -1.731649909, -1.731607709, -1.731565509, -1.7315233099999996, -1.7314811099999998, -1.7314389099999998, -1.7313967099999996, -1.7313545099999998, -1.7313123099999999, -1.7312701099999999, -1.731227911]
train_gdf["label"] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Split the dataset into training and test sets
[3]:
from cuml import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(train_gdf, 'label', train_size=0.8)
X_train["label"] = Y_train
Initialize variables
Categorical and Continuous feature columns
Batchsize
Number of epochs
[4]:
cat_cols = ["eventcode", "keywords", "privileges", "message", "sourcename", "taskcategory", "account_for_which_logon_failed_account_domain", "detailed_authentication_information_authentication_package", "detailed_authentication_information_key_length", "detailed_authentication_information_logon_process", "detailed_authentication_information_package_name_ntlm_only", "logon_type", "network_information_workstation_name", "new_logon_security_id", "impersonation_level", "network_information_protocol", "network_information_direction", "filter_information_layer_name"]
cont_cols = ["cont1"]
batch_size = 1000
epochs = 2
[5]:
ac.train_model(X_train, cat_cols, cont_cols, "label", batch_size, epochs, lr=0.01, wd=0.0)
training loss: 0.7893539071083069
valid loss 0.593 and accuracy 1.000
training loss: 0.6023843884468079
valid loss 0.446 and accuracy 1.000
Ideally, you will want to train your model over a number of epochs
as detailed in our example Asset Classification notebook.
Save a trained model
[6]:
ac.save_model("clx_asset_classifier.pth")
Let’s create a new asset classifier instance and load saved model.
[7]:
asset_classifier = AssetClassification()
asset_classifier.load_model('clx_asset_classifier.pth')
AC Inferencing
Use your new model for prediction
[8]:
pred_results = ac.predict(X_test, cat_cols, cont_cols)
true_results = Y_test
true_results
[8]:
17 0
6 0
9 0
18 0
Name: label, dtype: int64