官术网_书友最值得收藏!

Summary

In this chapter, you have learned about various types of data and ways to deal with unstructured text data. Text data is usually extremely noisy and needs to be cleaned and preprocessed, which mainly consists of tokenization, stemming, lemmatization, and stop-word removal. After preprocessing, features are extracted from texts using various methods, such as BoW and TFIDF. These methods convert unstructured text data into structured numeric data. New features are created from existing features using a technique called feature engineering. In the last part of this chapter, we explored various ways of visualizing text data, such as word clouds.

In the next chapter, you will learn how to develop machine learning models to classify texts using the feature extraction methods you have learned about in this chapter. Moreover, different sampling techniques and model evaluation parameters will be introduced.

主站蜘蛛池模板: 江陵县| 息烽县| 正定县| 峨边| 息烽县| 仙居县| 万州区| 嫩江县| 思南县| 锡林郭勒盟| 宽城| 荆门市| 怀来县| 驻马店市| 石柱| 长泰县| 雷波县| 塔城市| 抚顺县| 昭通市| 芜湖市| 姜堰市| 汝州市| 永福县| 务川| 化隆| 湖州市| 上蔡县| 钟祥市| 孝昌县| 上犹县| 双牌县| 郴州市| 荣成市| 嫩江县| 台南县| 新沂市| 正镶白旗| 左权县| 青神县| 黄山市|