- Machine Learning for Cybersecurity Cookbook
- Emmanuel Tsukerman
- 371字
- 2021-06-24 12:29:09
How to do it...
In the following steps, we show three different methods for selecting the most informative N-grams. The recipe assumes that binaryFileToNgramCounts(file, N) and all other helper functions from the previous recipe have been included:
- Begin by specifying the folders containing our samples, specifying our N, and importing modules to enumerate files:
from os import listdir
from os.path import isfile, join
directories = ["Benign PE Samples", "Malicious PE Samples"]
N = 2
- Next, we count all the N-grams from all the files:
Ngram_counts_all_files = collections.Counter([])
for dataset_path in directories:
all_samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
for sample in all_samples:
file_path = join(dataset_path, sample)
Ngram_counts_all_files += binary_file_to_Ngram_counts(file_path, N)
- We collect the K1=1000 most frequent N-grams into a list:
K1 = 1000
K1_most_frequent_Ngrams = Ngram_counts_all_files.most_common(K1)
K1_most_frequent_Ngrams_list = [x[0] for x in K1_most_frequent_Ngrams]
- A helper method, featurize_sample, will be used to take a sample and output the number of appearances of the most common N-grams in its byte sequence:
def featurize_sample(sample, K1_most_frequent_Ngrams_list):
"""Takes a sample and produces a feature vector.
The features are the counts of the K1 N-grams we've selected.
"""
K1 = len(K1_most_frequent_Ngrams_list)
feature_vector = K1 * [0]
file_Ngrams = binary_file_to_Ngram_counts(sample, N)
for i in range(K1):
feature_vector[i] = file_Ngrams[K1_most_frequent_Ngrams_list[i]]
return feature_vector
- We iterate through our directories, and use the preceding featurize_sample function to featurize our samples. We also create a set of labels:
directories_with_labels = [("Benign PE Samples", 0), ("Malicious PE Samples", 1)]
X = []
y = []
for dataset_path, label in directories_with_labels:
all_samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
for sample in all_samples:
file_path = join(dataset_path, sample)
X.append(featurize_sample(file_path, K1_most_frequent_Ngrams_list))
y.append(label)
- We import the libraries we will be using for feature selection and specify how many features we would like to narrow down to:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
K2 = 10
- We perform three types of feature selections for our N-grams:
- Frequency—selects the most frequent N-grams:
X = np.asarray(X)
X_top_K2_freq = X[:,:K2]
- Mutual information—selects the N-grams ranked highest by the mutual information algorithm:
mi_selector = SelectKBest(mutual_info_classif, k=K2)
X_top_K2_mi = mi_selector.fit_transform(X, y)
- Chi-squared—selects the N-grams ranked highest by the chi squared algorithm:
chi2_selector = SelectKBest(chi2, k=K2)
X_top_K2_ch2 = chi2_selector.fit_transform(X, y)
推薦閱讀
- 大學計算機信息技術導論
- Dreamweaver CS3+Flash CS3+Fireworks CS3創意網站構建實例詳解
- SCRATCH與機器人
- 數據中心建設與管理指南
- 流處理器研究與設計
- Hands-On Machine Learning with TensorFlow.js
- Python Data Science Essentials
- 自動控制理論(非自動化專業)
- Building Google Cloud Platform Solutions
- Mastering Predictive Analytics with scikit:learn and TensorFlow
- Hadoop Beginner's Guide
- FreeCAD [How-to]
- 渲染王3ds Max三維特效動畫技術
- PyTorch深度學習
- Intel Edison Projects