API Reference

IP

clx.ip.hostmask(ips, prefixlen=16)

Compute a column of hostmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters
  • ips – IP addresses

  • prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses

Returns

hostmasks

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.hostmask(cudf.Series(["192.168.0.1","10.0.0.1"], prefixlen=16)
0    0.0.255.255
1    0.0.255.255
Name: hostmask, dtype: object
clx.ip.int_to_ip(values)

Convert integer column to IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters

values (cudf.Series) – Integers to be converted

Returns

IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.int_to_ip(cudf.Series([3232235521, 167772161]))
0     5.79.97.178
1    94.130.74.45
dtype: object
clx.ip.ip_to_int(values)

Convert string column of IP addresses to integer values. Addresses must be IPv4. IPv6 not yet supported.

Parameters

values (cudf.Series) – IP addresses to be converted

Returns

Integer representations of IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.ip_to_int(cudf.Series(["192.168.0.1","10.0.0.1"]))
0      89088434
1    1585596973
dtype: int64
clx.ip.is_global(ips)

Indicates whether each address is global. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_global(cudf.Series(["127.0.0.1","207.46.13.151"]))
0    False
1    True
dtype: bool
clx.ip.is_ip(ips)

Indicates whether each address is an ip string. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_ip(cudf.Series(["192.168.0.1","10.123.0"]))
0     True
1    False
dtype: bool

Indicates whether each address is link local. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_link_local(cudf.Series(["127.0.0.1","169.254.123.123"]))
0    False
1    True
dtype: bool
clx.ip.is_loopback(ips)

Indicates whether each address is loopback. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_loopback(cudf.Series(["127.0.0.1","10.0.0.1"]))
0     True
1    False
dtype: bool
clx.ip.is_multicast(ips)

Indicates whether each address is multicast. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_multicast(cudf.Series(["127.0.0.1","224.0.0.0"]))
0    False
1    True
dtype: bool
clx.ip.is_private(ips)

Indicates whether each address is private. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_private(cudf.Series(["127.0.0.1","207.46.13.151"]))
0    True
1    False
dtype: bool
clx.ip.is_reserved(ips)

Indicates whether each address is reserved. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_reserved(cudf.Series(["127.0.0.1","10.0.0.1"]))
0    False
1    False
dtype: bool
clx.ip.is_unspecified(ips)

Indicates whether each address is unspecified. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_unspecified(cudf.Series(["127.0.0.1","10.0.0.1"]))
0    False
1    False
dtype: bool
clx.ip.mask(ips, masks)

Apply a mask to a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters
  • ips – IP addresses

  • masks (cudf.Series) – The host or subnet masks to be applied

Returns

masked IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> input_ips = cudf.Series(["192.168.0.1","10.0.0.1"])
>>> input_masks = cudf.Series(["255.255.0.0", "255.255.0.0"])
>>> clx.ip.mask(input_ips, input_masks)
0    192.168.0.0
1       10.0.0.0
Name: mask, dtype: object
clx.ip.netmask(ips, prefixlen=16)

Compute a column of netmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters
  • ips – IP addresses

  • prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses

Returns

netmasks

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.netmask(cudf.Series(["192.168.0.1","10.0.0.1"]), prefixlen=16)
0    255.255.0.0
1    255.255.0.0
Name: net_mask, dtype: object

Features

clx.features.binary(dataframe, entity_id, feature_id)

Create binary feature dataframe using provided dataset, entity, and feature.

Parameters
  • values (str) – dataframe

  • values – entity_id

  • values – feature_id

Returns

dataframe

Return type

cudf.DataFrame

Examples

>>> import cudf
>>> import clx.features
>>> df = cudf.DataFrame(
        {
            "time": [1, 2, 3],
            "user": ["u1", "u2", "u1",],
            "computer": ["c1", "c1", "c3"],
        }
    )
>>> output = clx.features.binary(df, "user", "computer")
>>> output
        c1  c3
    user
    u1      1.0     1.0
    u2      1.0     0.0
clx.features.frequency(dataframe, entity_id, feature_id)

Create frequency feature dataframe using provided dataset, entity, and feature.

Parameters
  • values (str) – dataframe

  • values – entity_id

  • values – feature_id

Returns

dataframe

Return type

cudf.DataFrame

Examples

>>> import cudf
>>> import clx.features
>>> df = cudf.DataFrame(
        {
            "time": [1, 2, 3],
            "user": ["u1", "u2", "u1",],
            "computer": ["c1", "c1", "c3"],
        }
    )
>>> output = clx.features.binary(df, "user", "computer")
>>> output
        c1  c3
    user
    u1      0.5     0.5
    u2      1.0     0.0

Analytics

class clx.analytics.asset_classification.AssetClassification(layers=[200, 100], drops=[0.001, 0.01], emb_drop=0.04, is_reg=False, is_multi=True, use_bn=True)

Supervised asset classification on tabular data containing categorical and/or continuous features.

Parameters
  • layers – linear layer follow the input layer

  • drops – drop out percentage

  • emb_drop – drop out percentage at embedding layers

  • is_reg – is regression

  • is_multi – is classification

  • use_bn – use batch normalization

Methods

load_model(fname)

Load a saved model.

predict(gdf, cat_cols, cont_cols)

Predict the class with the trained model

save_model(fname)

Save trained model

train_model(train_gdf, cat_cols, cont_cols, …)

This function is used for training fastai tabular model with a given training dataset.

load_model(fname)

Load a saved model.

Parameters

fname (str) – directory path to model

Examples

>>> from clx.analytics.asset_classification import AssetClassification
>>> ac = AssetClassification()
>>> ac.load_model("ac.mdl")
predict(gdf, cat_cols, cont_cols)

Predict the class with the trained model

Parameters
  • gdf (cudf.DataFrame) – prediction input dataset with categorized int16 feature columns

  • cat_cols – array of categorical column names in gdf

  • cont_col – array of continuous column names in gdf

Examples

>>> cat_cols = ["1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> cont_cols = ["10"]
>>> ac.predict(X_test, cat_cols, cont_cols).to_array()
0       0
1       0
2       0
3       0
4       2
    ..
8204    0
8205    4
8206    0
8207    3
8208    0
Length: 8209, dtype: int64
save_model(fname)

Save trained model

Parameters

save_to_path (str) – directory path to save model

Examples

>>> from clx.analytics.asset_classification import AssetClassification
>>> ac = AssetClassification()
>>> cat_cols = ["1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> cont_cols = ["10"]
>>> ac.train_model(X_train, cat_cols, cont_cols, "label", batch_size, epochs, lr=0.01, wd=0.0)
>>> ac.save_model("ac.mdl")
train_model(train_gdf, cat_cols, cont_cols, label_col, batch_size, epochs, lr=0.01, wd=0.0)

This function is used for training fastai tabular model with a given training dataset.

Parameters
  • train_gdf (cudf.DataFrame) – training dataset with categorized and/or continuous feature columns

  • cat_cols – array of categorical column names in train_gdf

  • cont_col – array of continuous column names in train_gdf

  • label_col (str) – column name of label column in train_gdf

  • batch_size (int) – train_gdf will be partitioned into multiple dataframes of this size

  • epochs (int) – number of epochs to be adjusted depending on convergence for a specific dataset

  • lr (float) – learning rate

  • wd (float) – wd

Examples

>>> from clx.analytics.asset_classification import AssetClassification
>>> ac = AssetClassification()
>>> cat_cols = ["1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> cont_cols = ["10"]
>>> ac.train_model(X_train, cat_cols, cont_cols, "label", batch_size, epochs, lr=0.01, wd=0.0)
class clx.analytics.cybert.Cybert

Cyber log parsing using BERT, DistilBERT, or ELECTRA. This class provides methods for loading models, prediction, and postprocessing.

Methods

inference(raw_data_col[, batch_size])

Cybert inference and postprocessing on dataset :param raw_data_col: logs to be processed :type raw_data_col: cudf.Series :param batch_size: Log data is processed in batches using a Pytorch dataloader.

load_model(model_filepath, config_filepath)

Load cybert model.

preprocess(raw_data_col[, stride_len, …])

Preprocess and tokenize data for cybert model inference.

inference(raw_data_col, batch_size=160)

Cybert inference and postprocessing on dataset :param raw_data_col: logs to be processed :type raw_data_col: cudf.Series :param batch_size: Log data is processed in batches using a Pytorch dataloader. The batch size parameter refers to the batch size indicated in torch.utils.data.DataLoader. :type batch_size: int :return: parsed_df :rtype: pandas.DataFrame :return: confidence_df :rtype: pandas.DataFrame

Examples

>>> import cudf
>>> from clx.analytics.cybert import Cybert
>>> cyparse = Cybert()
>>> cyparse.load_model('/path/to/model.pth', '/path/to/config.json')
>>> raw_data_col = cudf.Series(['Log event 1', 'Log event 2'])
>>> processed_df, confidence_df = cy.inference(raw_data_col)
load_model(model_filepath, config_filepath)

Load cybert model.

Parameters
  • model_filepath (str) – Filepath of the model (.pth or .bin) to be loaded

  • config_filepath (str) – Config file (.json) to be used

Examples

>>> from clx.analytics.cybert import Cybert
>>> cyparse = Cybert()
>>> cyparse.load_model('/path/to/model.bin', '/path/to/config.json')
preprocess(raw_data_col, stride_len=116, max_seq_len=128)

Preprocess and tokenize data for cybert model inference.

Parameters
  • raw_data_col (cudf.Series) – logs to be processed

  • stride_len (int) – Max stride length for processing, default is 116

  • max_seq_len (int) – Max sequence length for processing, default is 128

Examples

>>> import cudf
>>> from clx.analytics.cybert import Cybert
>>> cyparse = Cybert()
>>> cyparse.load_model('/path/to/model.pth', '/path/to/config.json')
>>> raw_df = cudf.Series(['Log event 1', 'Log event 2'])
>>> input_ids, attention_masks, meta_data = cyparse.preprocess(raw_df)
class clx.analytics.detector.Detector(lr=0.001)
Attributes
criterion
model
optimizer

Methods

leverage_model(model)

This function leverages model by setting parallelism parameters.

init_model

load_model

predict

save_model

train_model

leverage_model(model)

This function leverages model by setting parallelism parameters.

Parameters

model (RNNClassifier) – Model instance.

class clx.analytics.dga_dataset.DGADataset(df)
class clx.analytics.dga_detector.DGADetector(lr=0.001)

This class provides multiple functionalities such as build, train and evaluate the RNNClassifier model to distinguish legitimate and DGA domain names.

Methods

evaluate_model(dataloader)

This function evaluates the trained model to verify it’s accuracy.

init_model([char_vocab, hidden_size, …])

This function instantiates RNNClassifier model to train.

load_model(file_path)

This function load already saved model and sets cuda parameters.

predict(domains[, probability])

This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.

save_model(file_path)

This function saves model to given location.

train_model(train_data, labels[, …])

This function is used for training RNNClassifier model with a given training dataset.

evaluate_model(dataloader)

This function evaluates the trained model to verify it’s accuracy.

Parameters

dataloader (DataLoader) – Instance holds preprocessed data.

Returns

Model accuracy

Return type

decimal

Examples

>>> dd = DGADetector()
>>> dd.init_model()
>>> dd.evaluate_model(dataloader)
Evaluating trained model ...
Test set: Accuracy: 3/4 (0.75)
init_model(char_vocab=128, hidden_size=100, n_domain_type=2, n_layers=3)

This function instantiates RNNClassifier model to train. And also optimizes to scale it and keep running on parallelism.

Parameters
  • char_vocab (int) – Vocabulary size is set to 128 ASCII characters.

  • hidden_size (int) – Hidden size of the network.

  • n_domain_type (int) – Number of domain types.

  • n_layers (int) – Number of network layers.

load_model(file_path)

This function load already saved model and sets cuda parameters.

Parameters

file_path (string) – File path of a model to loaded.

predict(domains, probability=False)

This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series. :param domains: List of domains. :type domains: cudf.Series :return: Predicted results with respect to given domains. :rtype: cudf.Series Examples ——– >>> dd.predict([‘nvidia.com’, ‘dgadomain’]) 0 0.010 1 0.924 Name: dga_probability, dtype: decimal

save_model(file_path)

This function saves model to given location.

Parameters

file_path (string) – File path to save model.

train_model(train_data, labels, batch_size=1000, epochs=5, train_size=0.7)

This function is used for training RNNClassifier model with a given training dataset. It returns total loss to determine model prediction accuracy. :param train_data: Training data :type train_data: cudf.Series :param labels: labels data :type labels: cudf.Series :param batch_size: batch size :type batch_size: int :param epochs: Number of epochs for training :type epochs: int :param train_size: Training size for splitting training and test data :type train_size: int

Examples

>>> from clx.analytics.dga_detector import DGADetector
>>> dd = DGADetector()
>>> dd.init_model()
>>> dd.train_model(train_data, labels)
1.5728906989097595
class clx.eda.EDA(dataframe)
Attributes
analysis
dataframe

Methods

cuxfilter_dashboard()

Create cuxfilter dashboard for Exploratory Data Analysis.

save_analysis(dirpath)

Save analysis output to directory path.

cuxfilter_dashboard()

Create cuxfilter dashboard for Exploratory Data Analysis.

save_analysis(dirpath)

Save analysis output to directory path.

Parameters

dirpath (str) – Directory path to save analysis output.

class clx.analytics.loda.Loda(n_bins=None, n_random_cuts=100)

Anomaly detection using Lightweight Online Detector of Anomalies (LODA). LODA detects anomalies in a dataset by computing the likelihood of data points using an ensemble of one-dimensional histograms.

Parameters
  • n_bins (int) – Number of bins for each histogram. If None a heuristic is used to compute the number of bins.

  • n_random_cuts (int) – Number of projection to use.

Methods

explain(anomaly[, scaled])

Explain anomaly based on contributions (t-scores) of each feature across histograms.

fit(train_data)

Fit training data and construct histograms.

score(input_data)

Calculate anomaly scores using negative likelihood across n_random_cuts histograms.

explain(anomaly, scaled=True)

Explain anomaly based on contributions (t-scores) of each feature across histograms.

Parameters
  • anomaly (cupy.ndarray) – selected anomaly from input dataset

  • scaled (boolean) – set to scale output feature importance scores

Examples

>>> loda_ad.explain(x[5]) # x[5] is found anomaly
array([[1.        ],
[0.        ],
[0.69850349],
[0.91081035],
[0.78774349]])
fit(train_data)

Fit training data and construct histograms.

Parameters

train_data (cupy.ndarray) – NxD training sample

Examples

>>> from clx.analytics.loda import Loda
>>> import cupy as cp
>>> x = cp.random.randn(100,5) # 5-D multivariate synthetic dataset
>>> loda_ad = Loda(n_bins=None, n_random_cuts=100)
>>> loda_ad.fit(x)
score(input_data)

Calculate anomaly scores using negative likelihood across n_random_cuts histograms.

Parameters

input_data (cupy.ndarray) – NxD training sample

Examples

>>> from clx.analytics.loda import Loda
>>> import cupy as cp
>>> x = cp.random.randn(100,5) # 5-D multivariate synthetic dataset
>>> loda_ad = Loda(n_bins=None, n_random_cuts=100)
>>> loda_ad.fit(x)
>>> loda_ad.score(x)
array([0.04295848, 0.02853553, 0.04587308, 0.03750692, 0.05050418,
0.02671958, 0.03538646, 0.05606504, 0.03418612, 0.04040502,
0.03542846, 0.02801463, 0.04884918, 0.02943411, 0.02741364,
0.02702433, 0.03064191, 0.02575712, 0.03957355, 0.02729784,
...
0.03943715, 0.02701243, 0.02880341, 0.04086408, 0.04365477])
class clx.analytics.phishing_detector.PhishingDetector

Phishing detection using BERT. This class provides methods for training/loading BERT models, evaluation and prediction.

DEPRECATED: The phishing detection module will be removed in 0.19. Please use equivalent clx.analytics.sequence_classifier.

Methods

evaluate_model(emails, labels[, …])

Evaluate trained BERT model

init_model([model_or_path])

Load a pretrained BERT model.

predict(emails[, max_seq_len, threshold])

Predict the class with the trained model

save_model([save_to_path])

Save trained model

train_model(emails, labels[, learning_rate, …])

Train the classifier

evaluate_model(emails, labels, max_seq_len=128, batch_size=32)

Evaluate trained BERT model

Parameters
  • emails (cudf.Dataframe) – dataframe where each row contains one column holding email text

  • labels (cudf.Series) – series holding labels for each row in email dataframe

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.evaluate_model(emails_test, labels_test)
init_model(model_or_path='bert-base-uncased')

Load a pretrained BERT model. Default is bert-base-uncased.

Parameters

model_or_path (str) – directory path to model, default is bert-base-uncased

Examples

>>> from clx.analytics.phishing_detector import PhishingDetector
>>> phish_detect = PhishingDetector()
>>> phish_detect.init_model()  # bert-base-uncased
>>> phish_detect.init_model(model_path)
predict(emails, max_seq_len=128, threshold=0.5)

Predict the class with the trained model

Parameters
  • emails (cudf.Series) – series where each element is text from single email

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

  • threshold (float) – results with probabilities higher than this will be labeled as positive

Returns

predictions: predicted labels (False or True) for each email

Return type

cudf.Series

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.train_model(emails_train, labels_train)
>>> predictions = phish_detect.predict(new_emails, threshold=0.8)
save_model(save_to_path='.')

Save trained model

Parameters

save_to_path (str) – directory path to save model, default is current directory

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.train_model(emails_train, labels_train)
>>> phish_detect.save_model()
train_model(emails, labels, learning_rate=3e-05, max_seq_len=128, batch_size=32, epochs=5)

Train the classifier

Parameters
  • emails (cudf.DataFrame) – dataframe where each row contains one column holding email text

  • labels (cudf.Series) – series holding labels for each row in email dataframe

  • learning_rate (float) – learning rate

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

  • epoch (int) – epoch, default is 5

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.train_model(emails_train, labels_train)
class clx.analytics.model.rnn_classifier.RNNClassifier(input_size, hidden_size, output_size, n_layers, bidirectional=True)

Methods

forward(input, seq_lengths)

Defines the computation performed at every call.

forward(input, seq_lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class clx.analytics.model.tabular_model.TabularModel(emb_szs, n_cont, out_sz, layers, drops, emb_drop, use_bn, is_reg, is_multi)

Basic model for tabular data

Methods

forward(x_cat, x_cont)

Defines the computation performed at every call.

forward(x_cat, x_cont)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class clx.analytics.sequence_classifier.SequenceClassifier

Sequence Classifier using BERT. This class provides methods for training/loading BERT models, evaluation and prediction.

Methods

evaluate_model(test_data, labels[, …])

Evaluate trained model

init_model(model_or_path)

Load model from huggingface or locally saved model.

predict(input_data[, max_seq_len, …])

Predict the class with the trained model

save_model([save_to_path])

Save trained model

train_model(train_data, labels[, …])

Train the classifier

evaluate_model(test_data, labels, max_seq_len=128, batch_size=32)

Evaluate trained model

Parameters
  • test_data (cudf.Series) – test data to evaluate model

  • labels (cudf.Series) – labels for each element in test_data

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> sc.evaluate_model(emails_test, labels_test)
init_model(model_or_path)

Load model from huggingface or locally saved model.

Parameters

model_or_path (str) – huggingface pretrained model name or directory path to model

Examples

>>> from clx.analytics.sequence_classifier import SequenceClassifier
>>> sc = SequenceClassifier()
>>> sc.init_model("bert-base-uncased")  # huggingface pre-trained model
>>> sc.init_model(model_path) # locally saved model
predict(input_data, max_seq_len=128, batch_size=32, threshold=0.5)

Predict the class with the trained model

Parameters
  • input_data (cudf.Series) – input text data for prediction

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

  • threshold (float) – results with probabilities higher than this will be labeled as positive

Returns

predictions, probabilities: predictions are labels (0 or 1) based on minimum threshold

Return type

cudf.Series, cudf.Series

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> sc.train_model(emails_train, labels_train)
>>> predictions = sc.predict(emails_test, threshold=0.8)
save_model(save_to_path='.')

Save trained model

Parameters

save_to_path (str) – directory path to save model, default is current directory

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> sc.train_model(emails_train, labels_train)
>>> sc.save_model()
train_model(train_data, labels, learning_rate=3e-05, max_seq_len=128, batch_size=32, epochs=5)

Train the classifier

Parameters
  • train_data (cudf.Series) – text data for training

  • labels (cudf.Series) – labels for each element in train_data

  • learning_rate (float) – learning rate

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

  • epoch (int) – epoch, default is 5

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> sc.train_model(emails_train, labels_train)
clx.analytics.stats.rzscore(series, window)

Calculates rolling z-score

Parameters
seriescudf.Series

Series for which to calculate rolling z-score

windowint

Window size

Returns
cudf.Series

Series with rolling z-score values

Examples

>>> import clx.analytics.stats
>>> import cudf
>>> sequence = [3,4,5,6,1,10,34,2,1,11,45,34,2,9,19,43,24,13,23,10,98,84,10]
>>> series = cudf.Series(sequence)
>>> zscores_df = cudf.DataFrame()
>>> zscores_df['zscore'] = clx.analytics.stats.rzscore(series, 7)
>>> zscores_df
            zscore
0           null
1           null
2           null
3           null
4           null
5           null
6    2.374423424
7   -0.645941275
8   -0.683973734
9    0.158832461
10   1.847751909
11   0.880026019
12  -0.950835449
13  -0.360593742
14   0.111407599
15   1.228914145
16  -0.074966331
17  -0.570321249
18   0.327849973
19  -0.934372308
20   2.296828498
21   1.282966989
22  -0.795223674
clx.analytics.periodicity_detection.filter_periodogram(prdg, p_value)

Select important frequencies by filtering periodogram by p-value. Filtered out frequencies are set to zero.

Parameters

periodogram – periodogram to be filtered

Returns

CuPy array representing periodogram

Return type

cupy.core.core.ndarray

clx.analytics.periodicity_detection.to_periodogram(signal)

Returns periodogram of signal for finding frequencies that have high energy.

Parameters

signal (cudf.Series) – signal (time domain)

Returns

CuPy array representing periodogram

Return type

cupy.core.core.ndarray

clx.analytics.periodicity_detection.to_time_domain(prdg)

Convert the signal back to time domain.

Parameters

prdg (cupy.core.core.ndarray) – periodogram (frequency domain)

Returns

CuPy array representing reconstructed signal

Return type

cupy.core.core.ndarray

DNS Extractor

clx.dns.dns_extractor.extract_hostnames(url_series)

This function extracts hostnames from the given urls.

Parameters

url_series (cudf.Series) – Urls that are to be handled.

Returns

Hostnames extracted from the urls.

Return type

cudf.Series

Examples

>>> from cudf import DataFrame
>>> from clx.dns import dns_extractor as dns
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> dns.extract_hostnames(input_df["url"])
0       www.google.com
1            gmail.com
2           github.com
3    pandas.pydata.org
Name: 0, dtype: object
clx.dns.dns_extractor.generate_tld_cols(hostname_split_df, hostnames, col_len)

This function generates tld columns.

Parameters
  • hostname_split_df (cudf.DataFrame) – Hostname splits.

  • hostnames (cudf.DataFrame) – Hostnames.

  • col_len – Hostname splits dataframe columns length.

Returns

Tld columns with all combination.

Return type

cudf.DataFrame

Examples

>>> import cudf
>>> from clx.dns import dns_extractor as dns
>>> hostnames = cudf.Series(["www.google.com", "pandas.pydata.org"])
>>> hostname_splits = dns.get_hostname_split_df(hostnames)
>>> print(hostname_splits)
     2       1       0
0  com  google     www
1  org  pydata  pandas
>>> col_len = len(hostname_split_df.columns) - 1
>>> col_len = len(hostname_splits.columns) - 1
>>> dns.generate_tld_cols(hostname_splits, hostnames, col_len)
     2       1       0 tld2        tld1               tld0
0  com  google     www  com  google.com     www.google.com
1  org  pydata  pandas  org  pydata.org  pandas.pydata.org
clx.dns.dns_extractor.parse_url(url_series, req_cols=None)

This function extracts subdomain, domain and suffix for a given url.

Parameters
  • url_df_col (cudf.Series) – Urls that are to be handled.

  • req_cols (set(strings)) – Columns requested to extract such as (domain, subdomain, suffix and hostname).

Returns

Extracted information of requested columns.

Return type

cudf.DataFrame

Examples

>>> from cudf import DataFrame
>>> from clx.dns import dns_extractor as dns
>>>
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> dns.parse_url(input_df["url"])
            hostname  domain suffix subdomain
0     www.google.com  google    com       www
1          gmail.com   gmail    com
2         github.com  github    com
3  pandas.pydata.org  pydata    org    pandas
>>> dns.parse_url(input_df["url"], req_cols={'domain', 'suffix'})
   domain suffix
0  google    com
1   gmail    com
2  github    com
3  pydata    org

Heuristics

clx.heuristics.ports.major_ports(addr_col, port_col, min_conns=1, eph_min=10000)

Find major ports for each address. This is done by computing the mean number of connections across all ports for each address and then filters out all ports that don’t cross this threshold. Also adds column for IANA service name correspondingto each port.

Parameters
  • addr_col (cudf.Series) – Column of addresses as strings

  • port_col (cudf.Series) – Column of corresponding port numbers as ints

  • min_conns (int) – Filter out ip:port rows that don’t have at least this number of connections (default: 1)

  • eph_min (int) – Ports greater than or equal to this will be labeled as an ephemeral service (default: 10000)

Returns

DataFrame with columns for address, port, IANA service corresponding to port, and number of connections

Return type

cudf.DataFrame

Examples

>>> import clx.heuristics.ports as ports
>>> import cudf
>>> input_addr_col = cudf.Series(["10.0.75.1","10.0.75.1","10.0.75.1","10.0.75.255","10.110.104.107", "10.110.104.107"])
>>> input_port_col = cudf.Series([137,137,7680,137,7680, 7680])
>>> ports.major_ports(input_addr_col, input_port_col, min_conns=2, eph_min=7000)
            addr  port     service  conns
0      10.0.75.1   137  netbios-ns      2
1 10.110.104.107  7680   ephemeral      2

OSI (Open Source Integration)

class clx.osi.farsight.FarsightLookupClient(server, apikey, limit=None, http_proxy=None, https_proxy=None)

Wrapper class to query DNSDB record in various ways Example: by IP, DomainName

Parameters
  • server – Farsight server

  • apikey – API key

  • limit – limit

  • http_proxy – HTTP proxy

  • https_proxy – HTTPS proxy

Methods

query_rdata_ip(rdata_ip[, before, after])

Query to find DNSDB records matching a specific IP address with given time range.

query_rdata_name(rdata_name[, rrtype, …])

Query matches only a single DNSDB record of given owner name and time ranges.

query_rrset(oname[, rrtype, bailiwick, …])

Batch version of querying DNSDB by given domain name and time ranges.

query_rdata_ip(rdata_ip, before=None, after=None)

Query to find DNSDB records matching a specific IP address with given time range. :param rdata_ip: The VALUE is one of an IPv4 or IPv6 single address, with a prefix length, or with an address range. If a prefix is provided, the delimiter between the network address and prefix length is a single comma (“,”) character rather than the usual slash (“/”) character to avoid clashing with the HTTP URI path name separator.. :type rdata_ip: str :param before: Output results seen before this time. :type before: UNIX timestamp :param after: Output results seen after this time. :type after: UNIX timestamp :return: Response :rtype: dict

Examples

>>> from clx.osi.farsight import FarsightLookupClient
>>> client = FarsightLookupClient("https://localhost", "your-api-key", limit=1)
>> client.query_rdata_ip("100.0.0.1")
{"status_code": 200,...}
>>> client.query_rdata_ip("100.0.0.1", before=1428433465, after=1538014110)
{"status_code": 200,...}
query_rdata_name(rdata_name, rrtype=None, before=None, after=None)

Query matches only a single DNSDB record of given owner name and time ranges. :param rdata_name: DNS domain name. :type rdata_name: str :param rrtype: The resource record type of the resource record, either using the standard DNS type mnemonic, or an RFC 3597 generic type, i.e. the string TYPE immediately followed by the decimal RRtype number. :type rrtype: str :param before: Output results seen before this time. :type before: UNIX timestamp :param after: Output results seen after this time. :type after: UNIX timestamp :return: Response :rtype: dict

Examples

>>> from clx.osi.farsight import FarsightLookupClient
>>> client = FarsightLookupClient("https://localhost", "your-api-key", limit=1)
>>> client.query_rdata_name("www.farsightsecurity.com")
{"status_code": 200,...}
>>> client.query_rdata_name("www.farsightsecurity.com", rrtype="PTR", before=1386638408, after=1561176503)
{"status_code": 200,...}
query_rrset(oname, rrtype=None, bailiwick=None, before=None, after=None)

Batch version of querying DNSDB by given domain name and time ranges. :param oname: DNS domain name. :type oname: str :param rrtype: The resource record type of the resource record, either using the standard DNS type mnemonic, or an RFC 3597 generic type, i.e. the string TYPE immediately followed by the decimal RRtype number. :type rrtype: str :param bailiwick: The “bailiwick” of an RRset in DNSDB observed via passive DNS replication is the closest enclosing zone delegated to a nameserver which served the RRset. :type bailiwick: str :param before: Output results seen before this time. :type before: UNIX timestamp :param after: Output results seen after this time. :type after: UNIX timestamp :return: Response :rtype: dict

Examples

>>> from clx.osi.farsight import FarsightLookupClient
>>> client = FarsightLookupClient("https://localhost", "your-api-key")
>>> client.query_rrset("www.dnsdb.info")
{"status_code": 200,...}
>>> client.query_rrset("www.dnsdb.info", rrtype="CNAME", bailiwick="dnsdb.info.", before=1374184718, after=1564909243,)
{"status_code": 200,...}
class clx.osi.virus_total.VirusTotalClient(api_key=None, proxies=None)

Wrapper class to query VirusTotal database.

Parameters
  • apikey – API key

  • proxies – proxies

Attributes
api_key
proxies
vt_endpoint_dict

Methods

domain_report(domain)

Retrieve report using domain.

file_report(*resource)

Retrieve file scan reports :param *resource: The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report. You may also specify a scan_id returned by the /file/scan endpoint. :type *resource: str :return: Response :rtype: dict.

file_rescan(*resource)

This function rescan given files. :param *resource: The resource argument can be the MD5, SHA-1 or SHA-256 of the file you want to re-scan. :type *resource: str :return: Response :rtype: dict.

file_scan(file)

This function allows you to send a file for scanning with VirusTotal.

ipaddress_report(ip)

Retrieve report using ip address.

put_comment(resource, comment)

Post comment for a file or URL :param resource: Either an md5/sha1/sha256 hash of the file you want to review or the URL itself that you want to comment on.

scan_big_file(files)

Scanning files larger than 32MB :param file: File to be scanned :type file: str :return: Response :rtype: dict

url_report(*resource)

Retrieve URL scan reports :param *resource: The resource argument must be the URL to retrieve the most recent report. :type *resource: str :return: Response :rtype: dict.

url_scan(*url)

Retrieve URL scan reports :param *url: A URL for which you want to retrieve the most recent report. You may also specify a scan_id (sha256-timestamp as returned by the URL submission API) to access a specific report. :type *url: str :return: Response :rtype: dict.

domain_report(domain)

Retrieve report using domain. :param domain: A domain name :type domain: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.domain_report("027.ru")
{'status_code': 200, 'json_resp': {'BitDefender category': 'parked', 'undetected_downloaded_samples'...}}
file_report(*resource)

Retrieve file scan reports :param *resource: The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report. You may also specify a scan_id returned by the /file/scan endpoint. :type *resource: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.file_report(["99017f6eebbac24f351415dd410d522d"])
{'status_code': 200, 'json_resp': {'scans': {'Bkav': {'detected': True, 'version': '1.3.0.9899', 'result': 'W32.AIDetectVM.malware1'...}}
file_rescan(*resource)

This function rescan given files. :param *resource: The resource argument can be the MD5, SHA-1 or SHA-256 of the file you want to re-scan. :type *resource: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.file_rescan('70c0942965354dbb132c05458866b96709e37f44')
{'status_code': 200, 'json_resp': {'scan_id': ...}}
file_scan(file)

This function allows you to send a file for scanning with VirusTotal. Before performing submissions it would be nice to retrieve the latest report on the file. File size limit is 32MB, in order to submit files up to 200MB in size it is mandatory to use scan_big_file feature :param file: File to be scanned :type file: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.file_scan('test.sh')
{'status_code': 200, 'json_resp': {'scan_id': '0204e88255a0bd7807547e9186621f0478a6bb2c43e795fb5e6934e5cda0e1f6-1605914572', 'sha1': '70c0942965354dbb132c05458866b96709e37f44'...}
ipaddress_report(ip)

Retrieve report using ip address. :param ip: An IP address :type ip: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.ipaddress_report("90.156.201.27")
{'status_code': 200, 'json_resp': {'asn': 25532, 'undetected_urls...}}
put_comment(resource, comment)

Post comment for a file or URL :param resource: Either an md5/sha1/sha256 hash of the file you want to review or the URL itself that you want to comment on. :type resource: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.put_comment("75efd85cf6f8a962fe016787a7f57206ea9263086ee496fc62e3fc56734d4b53", "This is a test comment")
{'status_code': 200, 'json_resp': {'response_code': 0, 'verbose_msg': 'Duplicate comment'}}
scan_big_file(files)

Scanning files larger than 32MB :param file: File to be scanned :type file: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.scan_big_file('test.sh')
{'status_code': 200, 'json_resp': {'scan_id': '0204e88255a0bd7807547e9186621f0478a6bb2c43e795fb5e6934e5cda0e1f6-1605914572', 'sha1': '70c0942965354dbb132c05458866b96709e37f44'...}
url_report(*resource)

Retrieve URL scan reports :param *resource: The resource argument must be the URL to retrieve the most recent report. :type *resource: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.url_report(["virustotal.com"])
{'status_code': 200, 'json_resp': {'scan_id': 'a354494a73382ea0b4bc47f4c9e8d6c578027cd4598196dc88f05a22b5817293-1605914280'...}
url_scan(*url)

Retrieve URL scan reports :param *url: A URL for which you want to retrieve the most recent report. You may also specify a scan_id (sha256-timestamp as returned by the URL submission API) to access a specific report. :type *url: str :return: Response :rtype: dict

Examples

>>> from clx.osi.virus_total import VirusTotalClient
>>> client = VirusTotalClient(api_key='your-api-key')
>>> client.url_scan(["virustotal.com"])
{'status_code': 200, 'json_resp': {'permalink': 'https://www.virustotal.com/gui/url/...}}
class clx.osi.whois.WhoIsLookupClient(sep=',', datetime_format='%m-%d-%Y %H:%M:%S')

Methods

whois(domains[, arr2str])

Function to access parsed WhoIs data for a given domain.

datetime_arr_keys = ['creation_date', 'updated_date', 'expiration_date']

Wrapper class to query WhoIs API.

Parameters
  • sep – Delimiter to concat nested list values from the Whois response.

  • datetime_format – Format to convert WhoIs response datetime object.

whois(domains, arr2str=True)

Function to access parsed WhoIs data for a given domain. :param domains: Domains to perform whois lookup. :type domains: list :param arr2str: Convert WhoIs lookup response object to list of strings. :type arr2str: boolean :return: WhoIs information with respect to given domains. :rtype: list/obj

Examples

>>> from clx.osi.whois import WhoIsLookupClient
>>> domains = ["nvidia.com"]
>>> client = WhoIsLookupClient()
>>> client.whois(domains)
[{'domain_name': 'NVIDIA.COM', 'registrar': 'Safenames Ltd', 'whois_server': 'whois.safenames.net'...}]
class clx.osi.slashnext.SlashNextClient(api_key, snx_ir_workspace, base_url='https://oti.slashnext.cloud/api')
Attributes
conn

Methods

api_quota()

Find information about your API quota, like current usage, quota left etc.

download_html(scanid)

Downloads a web page HTML against a previous URL scan request.

download_screenshot(scanid[, resolution])

Downloads a screenshot of a web page against a previous URL scan request.

download_text(scanid)

Downloads the text of a web page against a previous URL scan request.

host_report(host)

Queries the SlashNext cloud database and retrieves a detailed report.

host_reputation(host)

Queries the SlashNext cloud database and retrieves the reputation of a host.

host_urls(host[, limit])

Queries the SlashNext cloud database and retrieves a list of all URLs.

scan_report(scanid[, extended_info])

Retrieve URL scan results against a previous scan request.

url_scan(url[, extended_info])

Perform a real-time URL reputation scan with SlashNext cloud-based SEER threat detection engine.

url_scan_sync(url[, extended_info, timeout])

Perform a real-time URL scan with SlashNext cloud-based SEER threat detection engine in a blocking mode.

verify_connection()

Verify SlashNext cloud database connection.

api_quota()

Find information about your API quota, like current usage, quota left etc. :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.api_quota()
>>> type(response_list[0])
<class 'dict'>
download_html(scanid)

Downloads a web page HTML against a previous URL scan request. :param scanid: Scan ID of the scan for which to get the report. Can be retrieved from the “slashnext-url-scan” action or “slashnext-url-scan-sync” action. :type scanid: str :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.download_html('2-ba57-755a7458c8a3')
>>> type(response_list[0])
<class 'dict'>
download_screenshot(scanid, resolution='high')

Downloads a screenshot of a web page against a previous URL scan request. :param scanid: Scan ID of the scan for which to get the report. Can be retrieved from the “slashnext-url-scan” action or “slashnext-url-scan-sync” action. :type scanid: str :param resolution: Resolution of the web page screenshot. Can be “high” or “medium”. Default is “high”. :type resolution: str :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.download_screenshot('2-ba57-755a7458c8a3')
>>> type(response_list[0])
<class 'dict'>
download_text(scanid)

Downloads the text of a web page against a previous URL scan request. :param scanid: Scan ID of the scan for which to get the report. Can be retrieved from the “slashnext-url-scan” action or “slashnext-url-scan-sync” action. :type scanid: str :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.download_text('2-ba57-755a7458c8a3')
>>> type(response_list[0])
<class 'dict'>
host_report(host)

Queries the SlashNext cloud database and retrieves a detailed report. :param host: The host to look up in the SlashNext Threat Intelligence database. Can be either a domain name or an IPv4 address. :type host: str :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.host_report('google.com')
>>> type(response_list[0])
<class 'dict'>
host_reputation(host)

Queries the SlashNext cloud database and retrieves the reputation of a host. :param host: The host to look up in the SlashNext Threat Intelligence database. Can be either a domain name or an IPv4 address. :type host: str :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.host_reputation('google.com')
>>> type(response_list[0])
<class 'dict'>
host_urls(host, limit=10)

Queries the SlashNext cloud database and retrieves a list of all URLs. :param host: The host to look up in the SlashNext Threat Intelligence database, for which to return a list of associated URLs. Can be either a domain name or an IPv4 address. :type host: str :param limit: The maximum number of URL records to fetch. Default is “10”. :type limit: int :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.host_urls('google.com', limit=1)
>>> type(response_list[0])
<class 'dict'>
scan_report(scanid, extended_info=True)

Retrieve URL scan results against a previous scan request. :param scanid: Scan ID of the scan for which to get the report. Can be retrieved from the “slashnext-url-scan” action or “slashnext-url-scan-sync” action. :type scanid: str :param extended_info: Whether to download forensics data, such as screenshot, HTML, and rendered text. :type extended_info: boolean :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.scan_report('2-ba57-755a7458c8a3', extended_info=False)
>>> type(response_list[0])
<class 'dict'>
url_scan(url, extended_info=True)

Perform a real-time URL reputation scan with SlashNext cloud-based SEER threat detection engine. :param url: The URL that needs to be scanned. :type url: str :param extended_info: Whether to download forensics data, such as screenshot, HTML, and rendered text. :type extended_info: boolean :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.url_scan('http://ajeetenterprises.in/js/kbrad/drive/index.php', extended_info=False)
>>> type(response_list[0])
<class 'dict'>
url_scan_sync(url, extended_info=True, timeout=60)

Perform a real-time URL scan with SlashNext cloud-based SEER threat detection engine in a blocking mode. :param url: The URL that needs to be scanned. :type url: str :param extended_info: Whether to download forensics data, such as screenshot, HTML, and rendered text. :type extended_info: boolean :param timeout: A timeout value in seconds. If no timeout value is specified, a default timeout value is 60 seconds. :type timeout: int :return Query response as list. :rtype: list

Examples

>>> from clx.osi.slashnext import SlashNextClient
>>> api_key = 'slashnext_cloud_apikey'
>>> snx_ir_workspace_dir = 'snx_ir_workspace'
>>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir)
>>> response_list = slashnext.url_scan_sync('http://ajeetenterprises.in/js/kbrad/drive/index.php', extended_info=False, timeout=10)
>>> type(response_list[0])
<class 'dict'>
verify_connection()

Verify SlashNext cloud database connection. Examples ——– >>> from clx.osi.slashnext import SlashNextClient >>> api_key = ‘slashnext_cloud_apikey’ >>> snx_ir_workspace_dir = ‘snx_ir_workspace’ >>> slashnext = SlashNextClient(api_key, snx_ir_workspace_dir) >>> slashnext.verify_connection() Successfully connected to SlashNext cloud. ‘success’

Parsers

class clx.parsers.event_parser.EventParser(columns, event_name)

This is an abstract class for all event log parsers.

Parameters
  • columns (set(string)) – Event column names.

  • event_name (string) – Event name

Attributes
columns

List of columns that are being processed.

event_name

Event name define type of logs that are being processed.

Methods

filter_by_pattern(df, column, pattern)

Retrieve events only which satisfies given regex pattern.

parse(dataframe, raw_column)

Abstract method ‘parse’ triggers the parsing functionality.

parse_raw_event(dataframe, raw_column, …)

Processes parsing of a specific type of raw event records received as a dataframe.

property columns

List of columns that are being processed.

Returns

Event column names.

Return type

set(string)

property event_name

Event name define type of logs that are being processed.

Returns

Event name

Return type

string

filter_by_pattern(df, column, pattern)

Retrieve events only which satisfies given regex pattern.

Parameters
  • df (cudf.DataFrame) – Raw events to be filtered.

  • column (string) – Raw data contained column name.

  • pattern (string) – Regex pattern to retrieve events that are required.

Returns

filtered dataframe.

Return type

cudf.DataFrame

abstract parse(dataframe, raw_column)

Abstract method ‘parse’ triggers the parsing functionality. Subclasses are required to implement and execute any parsing pre-processing steps.

parse_raw_event(dataframe, raw_column, event_regex)

Processes parsing of a specific type of raw event records received as a dataframe.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

  • event_regex (dict) – Required regular expressions for a given event type.

Returns

parsed information.

Return type

cudf.DataFrame

class clx.parsers.splunk_notable_parser.SplunkNotableParser

This is class parses splunk notable logs.

Methods

parse(dataframe, raw_column)

Parses the Splunk notable raw events.

parse(dataframe, raw_column)

Parses the Splunk notable raw events.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

Returns

parsed information.

Return type

cudf.DataFrame

class clx.parsers.windows_event_parser.WindowsEventParser(interested_eventcodes=None)

This is class parses windows event logs.

Parameters

interested_eventcodes (set(int)) – This parameter provides flexibility to parse only interested eventcodes.

Methods

clean_raw_data(dataframe, raw_column)

Lower casing and replacing escape characters.

get_columns()

Get columns of windows event codes.

parse(dataframe, raw_column)

Parses the Windows raw event.

clean_raw_data(dataframe, raw_column)

Lower casing and replacing escape characters.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

Returns

Clean raw information.

Return type

cudf.DataFrame

get_columns()

Get columns of windows event codes.

Returns

Columns of all configured eventcodes, if no interested eventcodes specified.

Return type

set(string)

parse(dataframe, raw_column)

Parses the Windows raw event.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

Returns

Parsed information.

Return type

cudf.DataFrame

clx.parsers.zeek.parse_log_file(filepath)

Parse Zeek log file and return cuDF dataframe. Uses header comments to get column names/types and configure parser.

Parameters

filepath (string) – filepath for Zeek log file

Returns

Zeek log dataframe

Return type

cudf.DataFrame

Utils

class clx.utils.data.dataloader.DataLoader(dataset, batchsize=1000)

Wrapper class is used to return dataframe partitions based on batchsize.

Attributes
dataset
dataset_len

Methods

get_chunks()

A generator function that yields each chunk of original input dataframe based on batchsize :return: Partitioned dataframe.

get_chunks()

A generator function that yields each chunk of original input dataframe based on batchsize :return: Partitioned dataframe. :rtype: cudf.DataFrame

class clx.utils.data.dataset.Dataset(df)
Attributes
data

Retruns dataframe

length

Returns dataframe length

property data

Retruns dataframe

property length

Returns dataframe length

clx.utils.data.utils

Workflow

class clx.workflow.workflow.Workflow(name, source=None, destination=None)
Attributes
destination

Dictionary of configuration parameters for the data destination (writer)

name

Name of the workflow for logging purposes.

source

Dictionary of configuration parameters for the data source (reader)

Methods

benchmark()

Decorator used to capture a benchmark for a given function

run_workflow()

Run workflow.

set_destination(destination)

Set destination.

set_source(source)

Set source.

stop_workflow()

Close workflow.

workflow(dataframe)

The pipeline function performs the data enrichment on the data.

benchmark()

Decorator used to capture a benchmark for a given function

property destination

Dictionary of configuration parameters for the data destination (writer)

property name

Name of the workflow for logging purposes.

run_workflow()

Run workflow. Reader (source) fetches data. Workflow implementation is executed. Workflow output is written to destination.

set_destination(destination)

Set destination.

Parameters

destination – dict of configuration parameters for the destination (writer)

set_source(source)

Set source.

Parameters

source – dict of configuration parameters for data source (reader)

property source

Dictionary of configuration parameters for the data source (reader)

stop_workflow()

Close workflow. This includes calling close() method on reader (source) and writer (destination)

abstract workflow(dataframe)

The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.

class clx.workflow.splunk_alert_workflow.SplunkAlertWorkflow(name, source=None, destination=None, interval='day', threshold=2.5, window=7, raw_data_col_name='_raw')
Attributes
interval

Interval can be set to day or hour by which z score will be calculated

raw_data_col_name

Dataframe column name containing raw splunk alert data

threshold

Threshold by which to flag z score.

window

Window by which to calculate rolling z score

Methods

workflow(dataframe)

The pipeline function performs the data enrichment on the data.

property interval

Interval can be set to day or hour by which z score will be calculated

property raw_data_col_name

Dataframe column name containing raw splunk alert data

property threshold

Threshold by which to flag z score. Threshold will be flagged for scores >threshold or <-threshold

property window

Window by which to calculate rolling z score

workflow(dataframe)

The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.

I/O

class clx.io.reader.kafka_reader.KafkaReader(batch_size, consumer, time_window=30)

Reads from Kafka based on config object.

Parameters
  • batch_size – batch size

  • consumer – Kafka consumer

  • time_window – Max window of time that queued events will wait to be pushed to workflow

Attributes
consumer
has_data
time_window

Methods

close()

Close Kafka reader

fetch_data()

Fetch data from Kafka based on provided config object

close()

Close Kafka reader

fetch_data()

Fetch data from Kafka based on provided config object

class clx.io.reader.dask_fs_reader.DaskFileSystemReader(config)

Uses Dask to read from file system based on config object.

Parameters

config – dictionary object of config values for type, input_format, input_path, and dask reader optional keyword args

Methods

close()

Close dask reader

fetch_data()

Fetch data using dask based on provided config object

close()

Close dask reader

fetch_data()

Fetch data using dask based on provided config object

class clx.io.reader.fs_reader.FileSystemReader(config)

Uses cudf to read from file system based on config object.

Parameters

config – dictionary object of config values for type, input_format, input_path (or output_path), and cudf reader optional keyword args

Methods

close()

Close cudf reader

fetch_data()

Fetch data using cudf based on provided config object

close()

Close cudf reader

fetch_data()

Fetch data using cudf based on provided config object

class clx.io.writer.kafka_writer.KafkaWriter(kafka_topic, batch_size, delimiter, producer)

Publish to Kafka topic based on config object.

Parameters
  • kafka_topic – Kafka topic

  • batch_size – batch size

  • delimiter – delimiter

  • producer – producer

Attributes
delimiter
producer

Methods

close()

Close Kafka writer

write_data(df)

publish messages to kafka topic

close()

Close Kafka writer

write_data(df)

publish messages to kafka topic

Parameters

df – dataframe to publish

class clx.io.writer.fs_writer.FileSystemWriter(config)

Uses cudf to write to file system based on config object.

Parameters

config – dictionary object of config values for type, output_format, output_path (or output_path), and cudf writer optional keyword args

Methods

close()

Close cudf writer

write_data(df)

Write data to file system using cudf based on provided config object

close()

Close cudf writer

write_data(df)

Write data to file system using cudf based on provided config object