- Machine Learning for Cybersecurity Cookbook
- Emmanuel Tsukerman
- 112字
- 2021-06-24 12:29:00
How to do it…
In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:
- First, import a textual dataset:
with open("anonops_short.txt", encoding="utf8") as f:
anonops_chat_logs = f.readlines()
- Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
- The end result is a sparse matrix with each row being a vector representing one of the texts:
X_train_tf
<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>
print(X_train_tf)
The following is the output:

推薦閱讀
- 集成架構中型系統
- Java編程全能詞典
- 大數據項目管理:從規劃到實現
- 我的J2EE成功之路
- 中文版Photoshop CS5數碼照片處理完全自學一本通
- 數據運營之路:掘金數據化時代
- Linux:Powerful Server Administration
- 單片機C語言程序設計完全自學手冊
- 工業機器人力覺視覺控制高級應用
- TensorFlow Deep Learning Projects
- Embedded Linux Development using Yocto Projects(Second Edition)
- 從祖先到算法:加速進化的人類文化
- PVCBOT零基礎機器人制作(第2版)
- 創客機器人實戰:基于Arduino和樹莓派
- 圖像傳感器應用技術