書名: Machine Learning for Cybersecurity Cookbook作者名: Emmanuel Tsukerman本章字?jǐn)?shù): 112字更新時間: 2021-06-24 12:29:00
How to do it…
In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:
- First, import a textual dataset:
with open("anonops_short.txt", encoding="utf8") as f:
anonops_chat_logs = f.readlines()
- Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
- The end result is a sparse matrix with each row being a vector representing one of the texts:
X_train_tf
<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>
print(X_train_tf)
The following is the output:

推薦閱讀
- 亮劍.NET:.NET深入體驗與實戰(zhàn)精要
- 電氣自動化專業(yè)英語(第3版)
- Div+CSS 3.0網(wǎng)頁布局案例精粹
- Verilog HDL數(shù)字系統(tǒng)設(shè)計入門與應(yīng)用實例
- 反饋系統(tǒng):多學(xué)科視角(原書第2版)
- Expert AWS Development
- 計算機(jī)圖形圖像處理:Photoshop CS3
- 自動化控制工程設(shè)計
- 大數(shù)據(jù)挑戰(zhàn)與NoSQL數(shù)據(jù)庫技術(shù)
- 網(wǎng)絡(luò)化分布式系統(tǒng)預(yù)測控制
- Hadoop應(yīng)用開發(fā)基礎(chǔ)
- 筆記本電腦維修90個精選實例
- R Data Analysis Projects
- 單片機(jī)技術(shù)項目化原理與實訓(xùn)
- Artificial Intelligence By Example