官术网_书友最值得收藏!

Natural language processing using a hashing vectorizer and tf-idf with scikit-learn

We often find in data science that the objects we wish to analyze are textual. For example, they might be tweets, articles, or network logs. Since our algorithms require numerical inputs, we must find a way to convert such text into numerical features. To this end, we utilize a sequence of techniques.

A token is a unit of text. For example, we may specify that our tokens are words, sentences, or characters. A count vectorizer takes textual input and then outputs a vector consisting of the counts of the textual tokens. A hashing vectorizer is a variation on the count vectorizer that sets out to be faster and more scalable, at the cost of interpretability and hashing collisions. Though it can be useful, just having the counts of the words appearing in a document corpus can be misleading. The reason is that, often, unimportant words, such as the and a (known as stop words) have a high frequency of occurrence, and hence little informative content. For reasons such as this, we often give words different weights to offset this. The main technique for doing so is tf-idf, which stands for Term-Frequency, Inverse-Document-Frequency. The main idea is that we account for the number of times a term occurs, but discount it by the number of documents it occurs in.

In cybersecurity, text data is omnipresent; event logs, conversational transcripts, and lists of function names are just a few examples. Consequently, it is essential to be able to work with such data, something you'll learn in this recipe.

主站蜘蛛池模板: 绍兴县| 贺兰县| 靖安县| 沁源县| 古交市| 邯郸县| 三江| 图们市| 新安县| 和平县| 江西省| 新巴尔虎右旗| 乐平市| 嫩江县| 根河市| 阿拉善盟| 文安县| 五家渠市| 广东省| 绥芬河市| 曲阳县| 资阳市| 南昌县| 布拖县| 正蓝旗| 荥经县| 临西县| 新竹县| 遂平县| 天镇县| 大余县| 许昌市| 济源市| 北碚区| 和平区| 兴城市| 永川市| 达孜县| 盘锦市| 建宁县| 通许县|