- Machine Learning for Cybersecurity Cookbook
- Emmanuel Tsukerman
- 268字
- 2021-06-24 12:29:00
Natural language processing using a hashing vectorizer and tf-idf with scikit-learn
We often find in data science that the objects we wish to analyze are textual. For example, they might be tweets, articles, or network logs. Since our algorithms require numerical inputs, we must find a way to convert such text into numerical features. To this end, we utilize a sequence of techniques.
A token is a unit of text. For example, we may specify that our tokens are words, sentences, or characters. A count vectorizer takes textual input and then outputs a vector consisting of the counts of the textual tokens. A hashing vectorizer is a variation on the count vectorizer that sets out to be faster and more scalable, at the cost of interpretability and hashing collisions. Though it can be useful, just having the counts of the words appearing in a document corpus can be misleading. The reason is that, often, unimportant words, such as the and a (known as stop words) have a high frequency of occurrence, and hence little informative content. For reasons such as this, we often give words different weights to offset this. The main technique for doing so is tf-idf, which stands for Term-Frequency, Inverse-Document-Frequency. The main idea is that we account for the number of times a term occurs, but discount it by the number of documents it occurs in.
In cybersecurity, text data is omnipresent; event logs, conversational transcripts, and lists of function names are just a few examples. Consequently, it is essential to be able to work with such data, something you'll learn in this recipe.
- Mastering Spark for Data Science
- 智能傳感器技術與應用
- Getting Started with Clickteam Fusion
- 電腦上網直通車
- Multimedia Programming with Pure Data
- Photoshop CS3圖像處理融會貫通
- Ceph:Designing and Implementing Scalable Storage Systems
- 網絡布線與小型局域網搭建
- Red Hat Linux 9實務自學手冊
- Visual FoxPro程序設計
- 網絡服務器搭建與管理
- 基于RPA技術財務機器人的應用與研究
- 穿越計算機的迷霧
- 菜鳥起飛電腦組裝·維護與故障排查
- Embedded Linux Development using Yocto Projects(Second Edition)