官术网_书友最值得收藏!

Extracting N-grams

In standard quantitative analysis of text, N-grams are sequences of N tokens (for example, words or characters). For instance, given the text The quick brown fox jumped over the lazy dog, if our tokens are words, then the 1-grams are the, quick, brown, fox, jumped, over, the, lazy, and dog. The 2-grams are the quick, quick brown, brown fox, and so on. The 3-grams are the quick brown, quick brown fox, brown fox jumped, and so on. Just like the local statistics of the text allowed us to build a Markov chain to perform statistical predictions and text generation from a corpus, N-grams allow us to model the local statistical properties of our corpus. Our ultimate goal is to utilize the counts of N-grams to help us predict whether a sample is malicious or benign. In this recipe, we demonstrate how to extract N-gram counts from a sample.

主站蜘蛛池模板: 海淀区| 和龙市| 惠东县| 盈江县| 鸡泽县| 资溪县| 湾仔区| 明水县| 娱乐| 涟源市| 勐海县| 独山县| 龙南县| 凤山市| 夏河县| 东山县| 鄂托克前旗| 江口县| 保德县| 灵寿县| 武宣县| 红安县| 庆安县| 丹寨县| 宝清县| 禄丰县| 吉水县| 哈巴河县| 大新县| 新绛县| 莱阳市| 洪雅县| 锡林郭勒盟| 河北区| 宜兴市| 平江县| 云梦县| 冷水江市| 临澧县| 太谷县| 雷波县|