- Machine Learning for Cybersecurity Cookbook
- Emmanuel Tsukerman
- 298字
- 2021-06-24 12:29:09
How it works…
Unlike the previous recipe, in which we analyzed a single file's N-grams, in this recipe, we look at a large collection of files to understand which N-grams are the most informative features. We start by specifying the folders containing our samples, our value of N, and import some modules to enumerate files (step 1). We proceed to count all N-grams from all files in our dataset (step 2). This allows us to find the globally most frequent N-grams. Of these, we filter down to the K1=1000 most frequent ones (step 3). Next, we introduce a helper method, featurizeSample, to be used to take a sample and output the number of appearances of the K1 most common N-grams in its byte sequence (step 4). We then iterate through our directories of files, and use the previous featurizeSample function to featurize our samples, as well as record their labels, as malicious or benign (step 5). The importance of the labels is that the assessment of whether an N-gram is informative depends on being able to discriminate between the malicious and benign classes based on it.
We import the SelectKBest library to select the best features via a score function, and the two score functions, mutual information and chi-squared (step 6). Finally, we apply the three different feature selection schemes to select the best N-grams and apply this knowledge to transform our features (step 7). In the first method, we simply select the K2 most frequent N-grams. Note that the selection of this method is often recommended in the literature, and is easier because of not requiring labels or extensive computation. In the second method, we use mutual information to narrow down the K2 features, while in the third, we use chi-squared to do so.
- 大學(xué)計(jì)算機(jī)基礎(chǔ):基礎(chǔ)理論篇
- JavaScript實(shí)例自學(xué)手冊
- Excel 2007函數(shù)與公式自學(xué)寶典
- Java Web整合開發(fā)全程指南
- 大數(shù)據(jù)時(shí)代
- 具比例時(shí)滯遞歸神經(jīng)網(wǎng)絡(luò)的穩(wěn)定性及其仿真與應(yīng)用
- RedHat Linux用戶基礎(chǔ)
- 過程控制系統(tǒng)
- Salesforce Advanced Administrator Certification Guide
- 人工智能:語言智能處理
- 單片機(jī)原理實(shí)用教程
- Silverlight 2完美征程
- Mastering Geospatial Analysis with Python
- Building Google Cloud Platform Solutions
- 設(shè)計(jì)模式