官术网_书友最值得收藏!

How to do it…

In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:

  1. First, import a textual dataset:
with open("anonops_short.txt", encoding="utf8") as f:
anonops_chat_logs = f.readlines()
  1. Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
  1. The end result is a sparse matrix with each row being a vector representing one of the texts:
X_train_tf

<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>

print(X_train_tf)

The following is the output:

主站蜘蛛池模板: 博野县| 阳城县| 昂仁县| 思南县| 托克逊县| 泰兴市| 上思县| 巨鹿县| 抚宁县| 搜索| 定结县| 大姚县| 高雄县| 渭南市| 富平县| 三台县| 南丹县| 高唐县| 迁西县| 敦煌市| 昆明市| 乐陵市| 清徐县| 东源县| 泰兴市| 文安县| 郧西县| 盐山县| 中山市| 江华| 荥经县| 如皋市| 衡山县| 康乐县| 灵丘县| 白城市| 阳谷县| 镇巴县| 平乡县| 仙居县| 凤山市|