官术网_书友最值得收藏!

Summary

In this chapter, you have learned about various types of data and ways to deal with unstructured text data. Text data is usually untidy and needs to be cleaned and pre-processed. Pre-processing steps mainly consist of tokenization, stemming, lemmatization, and stop-word removal. After pre-processing, features are extracted from texts using various methods, such as BoW and TF-IDF. This step converts unstructured text data into structured numeric data. New features are created from existing features using a technique called feature engineering. In the last part of the chapter, we explored various ways of visualizing text data, such as word clouds.

In the next chapter, you will learn how to develop machine learning models to classify texts using the features you have learned to extract in this chapter. Moreover, different sampling techniques and model evaluation parameters will be introduced.

主站蜘蛛池模板: 仁寿县| 历史| 吕梁市| 大庆市| 玉树县| 柘城县| 忻州市| 永城市| 富源县| 张家川| 承德市| 焦作市| 建水县| 武功县| 朔州市| 泰顺县| 长寿区| 文山县| 托里县| 翼城县| 集安市| 永嘉县| 阳高县| 霍林郭勒市| 康平县| 宁夏| 四子王旗| 临沂市| 明溪县| 左云县| 毕节市| 祁阳县| 兰考县| 扎兰屯市| 炉霍县| 淅川县| 界首市| 闻喜县| 合阳县| 北安市| 长阳|