官术网_书友最值得收藏!

How to do it...

In the following steps, we show three different methods for selecting the most informative N-grams. The recipe assumes that binaryFileToNgramCounts(file, N) and all other helper functions from the previous recipe have been included:

  1. Begin by specifying the folders containing our samples, specifying our N, and importing modules to enumerate files:
from os import listdir
from os.path import isfile, join

directories = ["Benign PE Samples", "Malicious PE Samples"]
N = 2

  1. Next, we count all the N-grams from all the files:
Ngram_counts_all_files = collections.Counter([])
for dataset_path in directories:
all_samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
for sample in all_samples:
file_path = join(dataset_path, sample)
Ngram_counts_all_files += binary_file_to_Ngram_counts(file_path, N)
  1.  We collect the K1=1000 most frequent N-grams into a list:
K1 = 1000
K1_most_frequent_Ngrams = Ngram_counts_all_files.most_common(K1)
K1_most_frequent_Ngrams_list = [x[0] for x in K1_most_frequent_Ngrams]
  1. A helper method, featurize_sample, will be used to take a sample and output the number of appearances of the most common N-grams in its byte sequence:
def featurize_sample(sample, K1_most_frequent_Ngrams_list):
"""Takes a sample and produces a feature vector.
The features are the counts of the K1 N-grams we've selected.
"""
K1 = len(K1_most_frequent_Ngrams_list)
feature_vector = K1 * [0]
file_Ngrams = binary_file_to_Ngram_counts(sample, N)
for i in range(K1):
feature_vector[i] = file_Ngrams[K1_most_frequent_Ngrams_list[i]]
return feature_vector
  1. We iterate through our directories, and use the preceding featurize_sample function to featurize our samples. We also create a set of labels:
directories_with_labels = [("Benign PE Samples", 0), ("Malicious PE Samples", 1)]
X = []
y = []
for dataset_path, label in directories_with_labels:
all_samples = [f for f in listdir(dataset_path) if isfile(join(dataset_path, f))]
for sample in all_samples:
file_path = join(dataset_path, sample)
X.append(featurize_sample(file_path, K1_most_frequent_Ngrams_list))
y.append(label)

  1. We import the libraries we will be using for feature selection and specify how many features we would like to narrow down to:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2

K2 = 10
  1.  We perform three types of feature selections for our N-grams:
  • Frequency—selects the most frequent N-grams:
X = np.asarray(X)
X_top_K2_freq = X[:,:K2]
  • Mutual information—selects the N-grams ranked highest by the mutual information algorithm:
mi_selector = SelectKBest(mutual_info_classif, k=K2)
X_top_K2_mi = mi_selector.fit_transform(X, y)
  • Chi-squared—selects the N-grams ranked highest by the chi squared algorithm:
chi2_selector = SelectKBest(chi2, k=K2)
X_top_K2_ch2 = chi2_selector.fit_transform(X, y)
主站蜘蛛池模板: 普兰县| 谷城县| 建湖县| 衡阳县| 古田县| 大港区| 铜鼓县| 东莞市| 柯坪县| 泗阳县| 独山县| 罗甸县| 苗栗市| 德兴市| 中方县| 东明县| 浦北县| 云浮市| 江北区| 金塔县| 奎屯市| 聂荣县| 汨罗市| 曲阜市| 高青县| 富阳市| 英山县| 文登市| 芷江| 荃湾区| 普定县| 灵川县| 漯河市| 介休市| 余庆县| 洪湖市| 尉犁县| 镇平县| 日喀则市| 龙口市| 玛纳斯县|