官术网_书友最值得收藏!

Introduction

In the previous chapter, we learned about the concepts of Natural Language Processing (NLP) and text analytics. We also looked at various pre-processing steps in brief. In this chapter, we will learn how to deal with text data whose formats are mostly unstructured. Unstructured data cannot be represented in a tabular format. Therefore, it is essential to convert it into numeric features because most machine learning algorithms are capable of dealing only with numbers. More emphasis will be put on steps such as tokenization, stemming, lemmatization, and stop-word removal. You will also learn about two popular methods for feature extraction: bag of words and Term Frequency-Inverse Document Frequency, as well as various methods for creating new features from existing features. Finally, you will become familiar with how text data can be visualized.

主站蜘蛛池模板: 磴口县| 霍山县| 涪陵区| 北票市| 崇仁县| 眉山市| 日土县| 贵德县| 长白| 吴堡县| 马鞍山市| 奉节县| 碌曲县| 米泉市| 绥德县| 巴塘县| 乳源| 固始县| 长乐市| 沁源县| 常州市| 安吉县| 东乌珠穆沁旗| 弥渡县| 新安县| 观塘区| 普陀区| 靖安县| 太和县| 仪陇县| 剑河县| 油尖旺区| 海南省| 吐鲁番市| 宝丰县| 德安县| 渝北区| 盐山县| 常州市| 孟州市| 虞城县|