API Reference

IP

clx.ip.hostmask(ips, prefixlen=16)

Compute a column of hostmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters
  • ips – IP addresses

  • prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses

Returns

hostmasks

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.hostmask(cudf.Series(["192.168.0.1","10.0.0.1"], prefixlen=16)
0    0.0.255.255
1    0.0.255.255
Name: hostmask, dtype: object
clx.ip.int_to_ip(values)

Convert integer column to IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters

values (cudf.Series) – Integers to be converted

Returns

IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.int_to_ip(cudf.Series([3232235521, 167772161]))
0     5.79.97.178
1    94.130.74.45
dtype: object
clx.ip.ip_to_int(values)

Convert string column of IP addresses to integer values. Addresses must be IPv4. IPv6 not yet supported.

Parameters

values (cudf.Series) – IP addresses to be converted

Returns

Integer representations of IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.ip_to_int(cudf.Series(["192.168.0.1","10.0.0.1"]))
0      89088434
1    1585596973
dtype: int64
clx.ip.is_global(ips)

Indicates whether each address is global. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_global(cudf.Series(["127.0.0.1","207.46.13.151"]))
0    False
1    True
dtype: bool
clx.ip.is_ip(ips)

Indicates whether each address is an ip string. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_ip(cudf.Series(["192.168.0.1","10.123.0"]))
0     True
1    False
dtype: bool

Indicates whether each address is link local. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_link_local(cudf.Series(["127.0.0.1","169.254.123.123"]))
0    False
1    True
dtype: bool
clx.ip.is_loopback(ips)

Indicates whether each address is loopback. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_loopback(cudf.Series(["127.0.0.1","10.0.0.1"]))
0     True
1    False
dtype: bool
clx.ip.is_multicast(ips)

Indicates whether each address is multicast. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_multicast(cudf.Series(["127.0.0.1","224.0.0.0"]))
0    False
1    True
dtype: bool
clx.ip.is_private(ips)

Indicates whether each address is private. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_private(cudf.Series(["127.0.0.1","207.46.13.151"]))
0    True
1    False
dtype: bool
clx.ip.is_reserved(ips)

Indicates whether each address is reserved. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_reserved(cudf.Series(["127.0.0.1","10.0.0.1"]))
0    False
1    False
dtype: bool
clx.ip.is_unspecified(ips)

Indicates whether each address is unspecified. Addresses must be IPv4. IPv6 not yet supported.

Parameters

ips – IP addresses

Returns

booleans

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.is_unspecified(cudf.Series(["127.0.0.1","10.0.0.1"]))
0    False
1    False
dtype: bool
clx.ip.mask(ips, masks)

Apply a mask to a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters
  • ips – IP addresses

  • masks (cudf.Series) – The host or subnet masks to be applied

Returns

masked IP addresses

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> input_ips = cudf.Series(["192.168.0.1","10.0.0.1"])
>>> input_masks = cudf.Series(["255.255.0.0", "255.255.0.0"])
>>> clx.ip.mask(input_ips, input_masks)
0    192.168.0.0
1       10.0.0.0
Name: mask, dtype: object
clx.ip.netmask(ips, prefixlen=16)

Compute a column of netmasks for a column of IP addresses. Addresses must be IPv4. IPv6 not yet supported.

Parameters
  • ips – IP addresses

  • prefixlen (int) – Length of the network prefix, in bits, for IPv4 addresses

Returns

netmasks

Return type

cudf.Series

Examples

>>> import clx.ip
>>> import cudf
>>> clx.ip.netmask(cudf.Series(["192.168.0.1","10.0.0.1"]), prefixlen=16)
0    255.255.0.0
1    255.255.0.0
Name: net_mask, dtype: object

Analytics

class clx.analytics.dga_detector.DGADetector(lr=0.001)

This class provides multiple functionalities such as build, train and evaluate the RNNClassifier model to distinguish legitimate and DGA domain names.

Methods

evaluate_model(detector_dataset)

This function evaluates the trained model to verify it’s accuracy.

init_model([char_vocab, hidden_size, …])

This function instantiates RNNClassifier model to train.

predict(domains)

This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.

train_model(detector_dataset)

This function is used for training RNNClassifier model with a given training dataset.

evaluate_model(detector_dataset)

This function evaluates the trained model to verify it’s accuracy.

Parameters

detector_dataset (DetectorDataset) – Instance holds preprocessed data.

Returns

Model accuracy

Return type

decimal

Examples

>>> dd = DGADetector()
>>> dd.init_model()
>>> dd.evaluate_model(detector_dataset)
Evaluating trained model ...
Test set: Accuracy: 3/4 (0.75)
init_model(char_vocab=128, hidden_size=100, n_domain_type=2, n_layers=3)

This function instantiates RNNClassifier model to train. And also optimizes to scale it and keep running on parallelism.

Parameters
  • char_vocab (int) – Vocabulary size is set to 128 ASCII characters.

  • hidden_size (int) – Hidden size of the network.

  • n_domain_type (int) – Number of domain types.

  • n_layers (int) – Number of network layers.

predict(domains)

This function accepts cudf series of domains as an argument to classify domain names as benign/malicious and returns the learned label for each object in the form of cudf series.

Parameters

domains (cudf.Series) – List of domains.

Returns

Predicted results with respect to given domains.

Return type

cudf.Series

Examples

>>> dd.predict(['nvidia.com', 'dgadomain'])
0    0
1    1
Name: is_dga, dtype: int64
train_model(detector_dataset)

This function is used for training RNNClassifier model with a given training dataset. It returns total loss to determine model prediction accuracy. :param detector_dataset: Instance holds preprocessed data :type detector_dataset: DetectorDataset :return: Total loss :rtype: int

Examples

>>> from clx.analytics.dga_detector import DGADetector
>>> partitioned_dfs = ... # partitioned_dfs = [df1, df2, ...] represents training dataset
>>> dd = DGADetector()
>>> dd.init_model()
>>> dd.train_model(detector_dataset)
1.5728906989097595
class clx.analytics.phishing_detector.PhishingDetector

Phishing detection using BERT. This class provides methods for training/loading BERT models, evaluation and prediction.

Methods

evaluate_model(emails, labels[, …])

Evaluate trained BERT model

init_model([model_or_path])

Load a pretrained BERT model.

predict(emails[, max_num_sentences, …])

Predict the class with the trained model

save_model([save_to_path])

Save trained model

train_model(emails, labels[, …])

Train the classifier

evaluate_model(emails, labels, max_num_sentences=1000000, max_num_chars=100000000, max_rows_tensor=1000000, max_seq_len=128, batch_size=32)

Evaluate trained BERT model

Parameters
  • emails (cudf.Dataframe) – dataframe where each row contains one column holding email text

  • labels (cudf.Series) – series holding labels for each row in email dataframe

  • max_num_sentences (int) – maximum number of sentences to be encoded by tokenizer in one batch

  • max_num_chars (int) – maximum number of characters passed to tokenizer

  • max_rows_tensor (int) – maximum number of rows in a tokenizer output tensor

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.evaluate_model(emails_test, labels_test)
init_model(model_or_path='bert-base-uncased')

Load a pretrained BERT model. Default is bert-base-uncased.

Parameters

model_or_path – directory path to model, default is bert-base-uncased

Examples

>>> from clx.analytics.phishing_detector import PhishingDetector
>>> phish_detect = PhishingDetector()
>>> phish_detect.init_model()  # bert-base-uncased
>>> phish_detect.init_model(model_path)
predict(emails, max_num_sentences=1000000, max_num_chars=100000000, max_rows_tensor=1000000, max_seq_len=128, batch_size=32)

Predict the class with the trained model

Parameters
  • emails (cudf.DataFrame) – dataframe where each row contains one column holding email text

  • max_num_sentences (int) – maximum number of sentences to be encoded by tokenizer in one batch

  • max_num_chars (int) – maximum number of characters passed to tokenizer

  • max_rows_tensor (int) – maximum number of rows in a tokenizer output tensor

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

Returns

predictions: predicted labels (0 or 1) for each email

Return type

cudf.Series

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.train_model(emails_train, labels_train)
>>> predictions = phish_detect(new_emails_df)
save_model(save_to_path='.')

Save trained model

Parameters

save_to_path (str) – directory path to save model, default is current directory

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.train_model(emails_train, labels_train)
>>> phish_detect.save_model()
train_model(emails, labels, max_num_sentences=1000000, max_num_chars=100000000, max_rows_tensor=1000000, learning_rate=3e-05, max_seq_len=128, batch_size=32, epochs=5)

Train the classifier

Parameters
  • emails (cudf.DataFrame) – dataframe where each row contains one column holding email text

  • labels (cudf.Series) – series holding labels for each row in email dataframe

  • max_num_sentences (int) – maximum number of sentences to be encoded by tokenizer in one batch

  • max_num_chars (int) – maximum number of characters passed to tokenizer

  • max_rows_tensor (int) – maximum number of rows in a tokenizer output tensor

  • learning_rate (float) – learning rate

  • max_seq_len (int) – Limits the length of the sequence returned by tokenizer. If tokenized sentence is shorter than max_seq_len, output will be padded with 0s. If the tokenized sentence is longer than max_seq_len it will be truncated to max_seq_len.

  • batch_size (int) – batch size

  • epoch (int) – epoch, default is 5

Examples

>>> from cuml.preprocessing.model_selection import train_test_split
>>> emails_train, emails_test, labels_train, labels_test = train_test_split(train_emails_df, 'label', train_size=0.8)
>>> phish_detect.train_model(emails_train, labels_train)
class clx.analytics.model.rnn_classifier.RNNClassifier(input_size, hidden_size, output_size, n_layers=1, bidirectional=True)

Methods

forward

clx.analytics.stats.rzscore(series, window)

Calculates rolling z-score

Parameters
seriescudf.Series

Series for which to calculate rolling z-score

windowint

Window size

Returns
cudf.Series

Series with rolling z-score values

Examples

>>> import clx.analytics.stats
>>> import cudf
>>> sequence = [3,4,5,6,1,10,34,2,1,11,45,34,2,9,19,43,24,13,23,10,98,84,10]
>>> series = cudf.Series(sequence)
>>> zscores_df = cudf.DataFrame()
>>> zscores_df['zscore'] = clx.analytics.stats.rzscore(series, 7)
>>> zscores_df
            zscore
0           null
1           null
2           null
3           null
4           null
5           null
6    2.374423424
7   -0.645941275
8   -0.683973734
9    0.158832461
10   1.847751909
11   0.880026019
12  -0.950835449
13  -0.360593742
14   0.111407599
15   1.228914145
16  -0.074966331
17  -0.570321249
18   0.327849973
19  -0.934372308
20   2.296828498
21   1.282966989
22  -0.795223674
clx.analytics.tokenizer.tokenize_df(input_df, hash_file='default', max_sequence_length=64, stride=48, do_lower=True, do_truncate=False, max_num_sentences=100, max_num_chars=100000, max_rows_tensor=500)

Run CUDA BERT wordpiece tokenizer on cuDF dataframe. Encodes words to token ids using vocabulary from a pretrained tokenizer.

Parameters
  • input_df (cudf.DataFrame) – input dataframe, each row represents one sentence to be encoded

  • hash_file (str) – path to hash file containing vocabulary and ids from a pretrained tokenizer

  • max_sequence_length (int) – Limits the length of the sequence returned. If tokenized sentence is shorter than max_sequence_length, output will be padded with 0s. If the tokenized sentence is longer than max_sequence length and do_truncate is set to false, there will be multiple returned sequences containing the overflowing token ids.

  • stride (int) – If do_truncate is set to false and the tokenized sentence is larger than max_sequence_length, the sequences containing the overflowing token ids can contain duplicated token ids from the main sequence. If max_sequence_length is equal to stride there are no duplicated id tokens. If stride is 80% of max_sequence_length, 20% of the first sequence chunk will be repeated on the second sequence chunk and so on until the entire sentence is encoded.

  • do_lower (bool) – If set to true, original text will be lowercased before encoding.

  • do_truncate (bool) – If set to true, sentences will be truncated and padded to max_sequence_length. Each input sentence will result in exactly one output sequence. If set to false, there will be multiple output sequences when the max_sequence_length is smaller than the tokenized sentence.

  • max_num_sentences (int) – max num sentences to be encoded in one batch

  • max_num_chars (int) – max num characters in dataframe

  • max_rows_tensor (int) – max num of rows in an output tensor

Returns

tokens: token ids encoded from sentences padded with 0s to max_sequence_length

Return type

torch.Tensor

Returns

attention_masks: binary tensor indicating the position of the padded indices so that the model does not attend to them

Return type

torch.Tensor

Returns

metadata: for each row of the output tensors, the meta_data contains the index id of the original sentence encoded, and the first and last index of the token ids that are non-padded and non-overlapping

Return type

torch.Tensor

Examples

>>> from clx.analytics import tokenizer
>>> import cudf
>>> df = cudf.read_csv("input.txt")
>>> tokens, masks, metadata = tokenizer.tokenize_df(df)
clx.analytics.tokenizer.tokenize_file(input_file, hash_file='default', max_sequence_length=64, stride=48, do_lower=True, do_truncate=False, max_num_sentences=100, max_num_chars=100000, max_rows_tensor=500)

Run CUDA BERT wordpiece tokenizer on file. Encodes words to token ids using vocabulary from a pretrained tokenizer.

Parameters
  • input_file (str) – path to input file, each line represents one sentence to be encoded

  • hash_file (str) – path to hash file containing vocabulary and ids from a pretrained tokenizer

  • max_sequence_length (int) – Limits the length of the sequence returned. If tokenized sentence is shorter than max_sequence_length, output will be padded with 0s. If the tokenized sentence is longer than max_sequence length and do_truncate is set to false, there will be multiple returned sequences containing the overflowing token ids.

  • stride (int) – If do_truncate is set to false and the tokenized sentence is larger than max_sequence_length, the sequences containing the overflowing token ids can contain duplicated token ids from the main sequence. If max_sequence_length is equal to stride there are no duplicated id tokens. If stride is 80% of max_sequence_length, 20% of the first sequence chunk will be repeated on the second sequence chunk and so on until the entire sentence is encoded.

  • do_lower (bool) – If set to true, original text will be lowercased before encoding.

  • do_truncate (bool) – If set to true, sentences will be truncated and padded to max_sequence_length. Each input sentence will result in exactly one output sequence. If set to false, there will be multiple output sequences when the max_sequence_length is smaller than the tokenized sentence.

  • max_num_sentences (int) – max num sentences to be encoded in one batch

  • max_num_chars (int) – max num characters in file

  • max_rows_tensor (int) – max num of rows in an output tensor

Returns

tokens: token ids encoded from sentences padded with 0s to max_sequence_length

Return type

torch.Tensor

Returns

attention_masks: binary tensor indicating the position of the padded indices so that the model does not attend to them

Return type

torch.Tensor

Returns

metadata: for each row of the output tensors, the meta_data contains the index id of the original sentence encoded, and the first and last index of the token ids that are non-padded and non-overlapping

Return type

torch.Tensor

Examples

>>> from clx.analytics import tokenizer
>>> tokens, masks, metadata = tokenizer.tokenize_file("input.txt")

DNS Extractor

clx.dns.dns_extractor.extract_hostnames(url_series)

This function extracts hostnames from the given urls.

Parameters

url_series (cudf.Series) – Urls that are to be handled.

Returns

Hostnames extracted from the urls.

Return type

cudf.Series

Examples

>>> from cudf import DataFrame
>>> from clx.dns import dns_extractor as dns
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> dns.extract_hostnames(input_df["url"])
0       www.google.com
1            gmail.com
2           github.com
3    pandas.pydata.org
Name: 0, dtype: object
clx.dns.dns_extractor.generate_tld_cols(hostname_split_df, hostnames, col_len)

This function generates tld columns.

Parameters
  • hostname_split_df (cudf.DataFrame) – Hostname splits.

  • hostnames (cudf.DataFrame) – Hostnames.

  • col_len – Hostname splits dataframe columns length.

Returns

Tld columns with all combination.

Return type

cudf.DataFrame

Examples

>>> import cudf
>>> from clx.dns import dns_extractor as dns
>>> hostnames = cudf.Series(["www.google.com", "pandas.pydata.org"])
>>> hostname_splits = dns.get_hostname_split_df(hostnames)
>>> print(hostname_splits)
     2       1       0
0  com  google     www
1  org  pydata  pandas
>>> col_len = len(hostname_split_df.columns) - 1
>>> col_len = len(hostname_splits.columns) - 1
>>> dns.generate_tld_cols(hostname_splits, hostnames, col_len)
     2       1       0 tld2        tld1               tld0
0  com  google     www  com  google.com     www.google.com
1  org  pydata  pandas  org  pydata.org  pandas.pydata.org
clx.dns.dns_extractor.parse_url(url_series, req_cols=None)

This function extracts subdomain, domain and suffix for a given url.

Parameters
  • url_df_col (cudf.Series) – Urls that are to be handled.

  • req_cols (set(strings)) – Columns requested to extract such as (domain, subdomain, suffix and hostname).

Returns

Extracted information of requested columns.

Return type

cudf.DataFrame

Examples

>>> from cudf import DataFrame
>>> from clx.dns import dns_extractor as dns
>>>
>>> input_df = DataFrame(
...     {
...         "url": [
...             "http://www.google.com",
...             "gmail.com",
...             "github.com",
...             "https://pandas.pydata.org",
...         ]
...     }
... )
>>> dns.parse_url(input_df["url"])
            hostname  domain suffix subdomain
0     www.google.com  google    com       www
1          gmail.com   gmail    com
2         github.com  github    com
3  pandas.pydata.org  pydata    org    pandas
>>> dns.parse_url(input_df["url"], req_cols={'domain', 'suffix'})
   domain suffix
0  google    com
1   gmail    com
2  github    com
3  pydata    org

Heuristics

clx.heuristics.ports.major_ports(addr_col, port_col, min_conns=1, eph_min=10000)

Find major ports for each address. This is done by computing the mean number of connections across all ports for each address and then filters out all ports that don’t cross this threshold. Also adds column for IANA service name correspondingto each port.

Parameters
  • addr_col (cudf.Series) – Column of addresses as strings

  • port_col (cudf.Series) – Column of corresponding port numbers as ints

  • min_conns (int) – Filter out ip:port rows that don’t have at least this number of connections (default: 1)

  • eph_min (int) – Ports greater than or equal to this will be labeled as an ephemeral service (default: 10000)

Returns

DataFrame with columns for address, port, IANA service corresponding to port, and number of connections

Return type

cudf.DataFrame

Examples

>>> import clx.heuristics.ports as ports
>>> import cudf
>>> input_addr_col = cudf.Series(["10.0.75.1","10.0.75.1","10.0.75.1","10.0.75.255","10.110.104.107", "10.110.104.107"])
>>> input_port_col = cudf.Series([137,137,7680,137,7680, 7680])
>>> ports.major_ports(input_addr_col, input_port_col, min_conns=2, eph_min=7000)
            addr  port     service  conns
0      10.0.75.1   137  netbios-ns      2
1 10.110.104.107  7680   ephemeral      2

OSI (Open Source Integration)

class clx.osi.farsight.FarsightLookupClient(server, apikey, limit=None, http_proxy=None, https_proxy=None)

This class provides functionality to query DNSDB record in various ways Example: by IP, DomainName

Parameters
  • server – Farsight server

  • header – HTTP headers

  • apikey – API key

  • limit – limit

  • http_proxy – HTTP proxy

  • https_proxy – HTTPS proxy

Methods

query_rdata_ip(rdata_ip[, before, after])

query to find DNSDB records matching a specific IP address with given time range.

query_rdata_name(rdata_name[, rrtype, …])

query matches only a single DNSDB record of given oname and time ranges.

query_rrset(oname[, rrtype, bailiwick, …])

batch version of querying DNSDB by given domain name and time ranges.

query_rdata_ip(rdata_ip, before=None, after=None)

query to find DNSDB records matching a specific IP address with given time range.

query_rdata_name(rdata_name, rrtype=None, before=None, after=None)

query matches only a single DNSDB record of given oname and time ranges.

query_rrset(oname, rrtype=None, bailiwick=None, before=None, after=None)

batch version of querying DNSDB by given domain name and time ranges.

class clx.osi.virus_total.VirusTotalClient(api_key=None, proxies=None)
Attributes
api_key
proxies
vt_endpoint_dict

Methods

domain_report(domain)

Retrieve report using domain.

file_report(*resource)

The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report.

file_rescan(*resource)

This function rescan given files.

file_scan(file)

This function allows you to send a file for scanning with VirusTotal.

ipaddress_report(ip)

Retrieve report using ip address.

put_comment(resource, comment)

Post comment for a file or URL

scan_big_file(files)

Scanning files larger than 32MB

url_report(*resource)

The resource argument must be the URL to retrieve the most recent report.

url_scan(*url)

This function scan on provided url with VirusTotal.

domain_report(domain)

Retrieve report using domain.

file_report(*resource)

The resource argument can be the MD5, SHA-1 or SHA-256 of a file for which you want to retrieve the most recent antivirus report. You may also specify a scan_id returned by the /file/scan endpoint.

file_rescan(*resource)

This function rescan given files. The resource argument can be the MD5, SHA-1 or SHA-256 of the file you want to re-scan.

file_scan(file)

This function allows you to send a file for scanning with VirusTotal. Before performing submissions it would be nice to retrieve the latest report on the file. File size limit is 32MB, in order to submit files up to 200MB in size it is mandatory to request a special upload URL using the /file/scan/upload_url endpoint.

ipaddress_report(ip)

Retrieve report using ip address.

put_comment(resource, comment)

Post comment for a file or URL

scan_big_file(files)

Scanning files larger than 32MB

url_report(*resource)

The resource argument must be the URL to retrieve the most recent report.

url_scan(*url)

This function scan on provided url with VirusTotal.

class clx.osi.whois.WhoIsLookupClient(sep=',', datetime_format='%m-%d-%Y %H:%M:%S')

Methods

whois(domains[, arr2str])

Function to access parsed WHOIS data for a given domain.

whois(domains, arr2str=True)

Function to access parsed WHOIS data for a given domain.

Parsers

class clx.parsers.event_parser.EventParser(columns, event_name)

This is an abstract class for all event log parsers.

Parameters
  • columns (set(string)) – Event column names.

  • event_name (string) – Event name

Attributes
columns

List of columns that are being processed.

event_name

Event name define type of logs that are being processed.

Methods

filter_by_pattern(df, column, pattern)

Retrieve events only which satisfies given regex pattern.

parse(dataframe, raw_column)

Abstract method ‘parse’ triggers the parsing functionality.

parse_raw_event(dataframe, raw_column, …)

Processes parsing of a specific type of raw event records received as a dataframe.

property columns

List of columns that are being processed.

Returns

Event column names.

Return type

set(string)

property event_name

Event name define type of logs that are being processed.

Returns

Event name

Return type

string

filter_by_pattern(df, column, pattern)

Retrieve events only which satisfies given regex pattern.

Parameters
  • df (cudf.DataFrame) – Raw events to be filtered.

  • column (string) – Raw data contained column name.

  • pattern (string) – Regex pattern to retrieve events that are required.

Returns

filtered dataframe.

Return type

cudf.DataFrame

abstract parse(dataframe, raw_column)

Abstract method ‘parse’ triggers the parsing functionality. Subclasses are required to implement and execute any parsing pre-processing steps.

parse_raw_event(dataframe, raw_column, event_regex)

Processes parsing of a specific type of raw event records received as a dataframe.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

  • event_regex (dict) – Required regular expressions for a given event type.

Returns

parsed information.

Return type

cudf.DataFrame

class clx.parsers.splunk_notable_parser.SplunkNotableParser

This is class parses splunk notable logs.

Methods

parse(dataframe, raw_column)

Parses the Splunk notable raw events.

parse(dataframe, raw_column)

Parses the Splunk notable raw events.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

Returns

parsed information.

Return type

cudf.DataFrame

class clx.parsers.windows_event_parser.WindowsEventParser(interested_eventcodes=None)

This is class parses windows event logs.

Parameters

interested_eventcodes (set(int)) – This parameter provides flexibility to parse only interested eventcodes.

Methods

clean_raw_data(dataframe, raw_column)

Lower casing and replacing escape characters.

get_columns()

Get columns of windows event codes.

parse(dataframe, raw_column)

Parses the Windows raw event.

clean_raw_data(dataframe, raw_column)

Lower casing and replacing escape characters.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

Returns

Clean raw information.

Return type

cudf.DataFrame

get_columns()

Get columns of windows event codes.

Returns

Columns of all configured eventcodes, if no interested eventcodes specified.

Return type

set(string)

parse(dataframe, raw_column)

Parses the Windows raw event.

Parameters
  • dataframe (cudf.DataFrame) – Raw events to be parsed.

  • raw_column (string) – Raw data contained column name.

Returns

Parsed information.

Return type

cudf.DataFrame

clx.parsers.zeek.parse_log_file(filepath)

Parse Zeek log file and return cuDF dataframe. Uses header comments to get column names/types and configure parser.

Parameters

filepath (string) – filepath for Zeek log file

Returns

Zeek log dataframe

Return type

cudf.DataFrame

Workflow

class clx.workflow.workflow.Workflow(name, source=None, destination=None)
Attributes
destination

Dictionary of configuration parameters for the data destination (writer)

name

Name of the workflow for logging purposes.

source

Dictionary of configuration parameters for the data source (reader)

Methods

benchmark()

Decorator used to capture a benchmark for a given function

run_workflow()

Run workflow.

set_destination(destination)

Set destination.

set_source(source)

Set source.

stop_workflow()

Close workflow.

workflow(dataframe)

The pipeline function performs the data enrichment on the data.

benchmark()

Decorator used to capture a benchmark for a given function

property destination

Dictionary of configuration parameters for the data destination (writer)

property name

Name of the workflow for logging purposes.

run_workflow()

Run workflow. Reader (source) fetches data. Workflow implementation is executed. Workflow output is written to destination.

set_destination(destination)

Set destination.

Parameters

destination – dict of configuration parameters for the destination (writer)

set_source(source)

Set source.

Parameters

source – dict of configuration parameters for data source (reader)

property source

Dictionary of configuration parameters for the data source (reader)

stop_workflow()

Close workflow. This includes calling close() method on reader (source) and writer (destination)

abstract workflow(dataframe)

The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.

class clx.workflow.splunk_alert_workflow.SplunkAlertWorkflow(name, source=None, destination=None, interval='day', threshold=2.5, window=7, raw_data_col_name='_raw')
Attributes
interval

Interval can be set to day or hour by which z score will be calculated

raw_data_col_name

Dataframe column name containing raw splunk alert data

threshold

Threshold by which to flag z score.

window

Window by which to calculate rolling z score

Methods

workflow(dataframe)

The pipeline function performs the data enrichment on the data.

property interval

Interval can be set to day or hour by which z score will be calculated

property raw_data_col_name

Dataframe column name containing raw splunk alert data

property threshold

Threshold by which to flag z score. Threshold will be flagged for scores >threshold or <-threshold

property window

Window by which to calculate rolling z score

workflow(dataframe)

The pipeline function performs the data enrichment on the data. Subclasses must define this function. This function will return a gpu dataframe with enriched data.

I/O

class clx.io.reader.kafka_reader.KafkaReader(batch_size, consumer, time_window=30)

Reads from Kafka based on config object.

Parameters
  • batch_size – batch size

  • consumer – Kafka consumer

  • time_window – Max window of time that queued events will wait to be pushed to workflow

Attributes
consumer
has_data
time_window

Methods

close()

Close Kafka reader

fetch_data()

Fetch data from Kafka based on provided config object

close()

Close Kafka reader

fetch_data()

Fetch data from Kafka based on provided config object

class clx.io.reader.dask_fs_reader.DaskFileSystemReader(config)

Uses Dask to read from file system based on config object.

Parameters

config – dictionary object of config values for type, input_format, input_path, and dask reader optional keyword args

Methods

close()

Close dask reader

fetch_data()

Fetch data using dask based on provided config object

close()

Close dask reader

fetch_data()

Fetch data using dask based on provided config object

class clx.io.reader.fs_reader.FileSystemReader(config)

Uses cudf to read from file system based on config object.

Parameters

config – dictionary object of config values for type, input_format, input_path (or output_path), and cudf reader optional keyword args

Methods

close()

Close cudf reader

fetch_data()

Fetch data using cudf based on provided config object

close()

Close cudf reader

fetch_data()

Fetch data using cudf based on provided config object

class clx.io.writer.kafka_writer.KafkaWriter(kafka_topic, batch_size, delimiter, producer)

Publish to Kafka topic based on config object.

Parameters
  • kafka_topic – Kafka topic

  • batch_size – batch size

  • delimiter – delimiter

  • producer – producer

Attributes
delimiter
producer

Methods

close()

Close Kafka writer

write_data(df)

publish messages to kafka topic

close()

Close Kafka writer

write_data(df)

publish messages to kafka topic

Parameters

df – dataframe to publish

class clx.io.writer.fs_writer.FileSystemWriter(config)

Uses cudf to write to file system based on config object.

Parameters

config – dictionary object of config values for type, output_format, output_path (or output_path), and cudf writer optional keyword args

Methods

close()

Close cudf writer

write_data(df)

Write data to file system using cudf based on provided config object

close()

Close cudf writer

write_data(df)

Write data to file system using cudf based on provided config object