官术网_书友最值得收藏!

How it works...

In the literature and industry, it has been determined that the most frequent N-grams are also the most informative ones for a malware classification algorithm. For this reason, in this recipe, we will write functions to extract them for a file. We start by importing some helpful libraries for our extraction of N-grams (step 1). In particular, we import the collections library and the ngrams library from nltk. The collections library allows us to convert a list of N-grams to a frequency count of the N-grams, while the ngrams library allows us to take an ordered list of bytes and obtain a list of N-grams. We specify the file we would like to analyze and write a function that will read all of the bytes of a given file (steps 2 and 3). We define a few more convenience functions before we begin the extraction. In particular, we write a function to take a file's sequence of bytes and output a list of its N-grams (step 4), and a function to take a file and output the counts of its N-grams (step 5). We are now ready to pass in a file and extracts its N-grams. We do so to extract the counts of 4-grams of our file (step 6) and then display the 10 most common of them, along with their counts (step 7). We see that some of the N-gram sequences, such as (0,0,0,0) and (255,255,255,255) may not be very informative. For this reason, we will utilize feature selection methods to cut out the less informative N-grams in our next recipe.

主站蜘蛛池模板: 且末县| 望城县| 新津县| 阜康市| 松桃| 偏关县| 潍坊市| 鸡西市| 密云县| 个旧市| 云安县| 冷水江市| 清徐县| 荔波县| 屯昌县| 株洲县| 云安县| 扬州市| 株洲市| 夹江县| 磐石市| 汤原县| 同仁县| 汝州市| 桑日县| 陇西县| 万安县| 旅游| 永春县| 鹿邑县| 洞口县| 五家渠市| 织金县| 凤庆县| 错那县| 高要市| 永平县| 武隆县| 洞口县| 兴化市| 靖江市|