官术网_书友最值得收藏!

How to do it…

In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:

  1. First, import a textual dataset:
with open("anonops_short.txt", encoding="utf8") as f:
anonops_chat_logs = f.readlines()
  1. Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
  1. The end result is a sparse matrix with each row being a vector representing one of the texts:
X_train_tf

<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>

print(X_train_tf)

The following is the output:

主站蜘蛛池模板: 锦州市| 信宜市| 同仁县| 浦江县| 仙居县| 平泉县| 玉门市| 龙州县| 扶余县| 黄龙县| 宁海县| 林芝县| 龙江县| 简阳市| 兖州市| 拜城县| 上杭县| 阳城县| 宜阳县| 湟源县| 清水河县| 昭觉县| 玉树县| 东乡| 突泉县| 宣武区| 富裕县| 将乐县| 衢州市| 通州区| 河北省| 关岭| 夏津县| 鲁山县| 太谷县| 永寿县| 察隅县| 岳阳县| 新密市| 务川| 成武县|