How to do it...

In the following steps, we show three different methods for selecting the most informative N-grams. The recipe assumes that binaryFileToNgramCounts(file, N) and all other helper functions from the previous recipe have been included:

Begin by specifying the folders containing our samples, specifying our N, and importing modules to enumerate files:

from os import listdir
from os.path import isfile, join

directories = ["Benign PE Samples", "Malicious PE Samples"]
N = 2

Next, we count all the N-grams from all the files:

Ngram_counts_all_files = collections.Counter([])
for dataset_path in directories:
    all_samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
    for sample in all_samples:
        file_path = join(dataset_path, sample)
        Ngram_counts_all_files += binary_file_to_Ngram_counts(file_path, N)

We collect the K1=1000 most frequent N-grams into a list:

K1 = 1000
K1_most_frequent_Ngrams = Ngram_counts_all_files.most_common(K1)
K1_most_frequent_Ngrams_list = [x[0] for x in K1_most_frequent_Ngrams]

A helper method, featurize_sample, will be used to take a sample and output the number of appearances of the most common N-grams in its byte sequence:

def featurize_sample(sample, K1_most_frequent_Ngrams_list):
    """Takes a sample and produces a feature vector.
    The features are the counts of the K1 N-grams we've selected.
    """
    K1 = len(K1_most_frequent_Ngrams_list)
    feature_vector = K1 * [0]
    file_Ngrams = binary_file_to_Ngram_counts(sample, N)
    for i in range(K1):
        feature_vector[i] = file_Ngrams[K1_most_frequent_Ngrams_list[i]]
    return feature_vector

We iterate through our directories, and use the preceding featurize_sample function to featurize our samples. We also create a set of labels:

directories_with_labels = [("Benign PE Samples", 0), ("Malicious PE Samples", 1)]
X = []
y = []
for dataset_path, label in directories_with_labels:
    all_samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
    for sample in all_samples:
        file_path = join(dataset_path, sample)
        X.append(featurize_sample(file_path, K1_most_frequent_Ngrams_list))
        y.append(label)

We import the libraries we will be using for feature selection and specify how many features we would like to narrow down to:

from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2

K2 = 10

We perform three types of feature selections for our N-grams:

Frequency—selects the most frequent N-grams:

X = np.asarray(X)
X_top_K2_freq = X[:,:K2]

Mutual information—selects the N-grams ranked highest by the mutual information algorithm:

mi_selector = SelectKBest(mutual_info_classif, k=K2)
X_top_K2_mi = mi_selector.fit_transform(X, y)

Chi-squared—selects the N-grams ranked highest by the chi squared algorithm:

chi2_selector = SelectKBest(chi2, k=K2)
X_top_K2_ch2 = chi2_selector.fit_transform(X, y)

官术网_书友最值得收藏!

Machine Learning for Cybersecurity Cookbook

How to do it...