官术网_书友最值得收藏!

How it works...

In the literature and industry, it has been determined that the most frequent N-grams are also the most informative ones for a malware classification algorithm. For this reason, in this recipe, we will write functions to extract them for a file. We start by importing some helpful libraries for our extraction of N-grams (step 1). In particular, we import the collections library and the ngrams library from nltk. The collections library allows us to convert a list of N-grams to a frequency count of the N-grams, while the ngrams library allows us to take an ordered list of bytes and obtain a list of N-grams. We specify the file we would like to analyze and write a function that will read all of the bytes of a given file (steps 2 and 3). We define a few more convenience functions before we begin the extraction. In particular, we write a function to take a file's sequence of bytes and output a list of its N-grams (step 4), and a function to take a file and output the counts of its N-grams (step 5). We are now ready to pass in a file and extracts its N-grams. We do so to extract the counts of 4-grams of our file (step 6) and then display the 10 most common of them, along with their counts (step 7). We see that some of the N-gram sequences, such as (0,0,0,0) and (255,255,255,255) may not be very informative. For this reason, we will utilize feature selection methods to cut out the less informative N-grams in our next recipe.

主站蜘蛛池模板: 永康市| 北碚区| 沁水县| 望奎县| 凤翔县| 通州市| 阳春市| 宁南县| 宁晋县| 永春县| 平昌县| 都兰县| 永嘉县| 阿拉善左旗| 原平市| 洛宁县| 湘阴县| 长沙市| 莒南县| 甘孜县| 万盛区| 菏泽市| 蛟河市| 乌什县| 乌恰县| 广丰县| 长兴县| 厦门市| 房山区| 黄平县| 德钦县| 杭锦后旗| 太和县| 罗山县| 仁布县| 连山| 万盛区| 崇信县| 沭阳县| 镇巴县| 介休市|