官术网_书友最值得收藏!

Exploring the 20 Newsgroups Dataset with Text Analysis Techniques

We went through a bunch of fundamental machine learning concepts in the previous chapter. We learned about them along with analogies, in a fun way, such as studying for exams and designing a driving schedule. Starting from this chapter as the second step of our learning journal, we will be discovering in detail several important machine learning algorithms and techniques. Beyond analogies, we will be exposed to and solve real-world examples, which makes our journey more interesting. We will start with a natural language processing problem—exploring newsgroups data. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values and how to clean up words with little meaning. We will also visualize text data by mapping it into a two-dimensional space in an unsupervised learning manner.

We will go into detail for each of the following topics:

  • What is NLP and its applications
  • NLP basics
  • Touring Python NLP libraries
  • Tokenization
  • Part-of-speech tagging
  • Named entities recognition
  • Stemming and lemmatization
  • Getting and exploring the newsgroups data
  • Data visualization using seaborn and matplotlib
  • The Bag of words (BoW) model and token count vectorization
  • Text preprocessing
  • Stop words removal
  • Dimensionality reduction
  • T-SNE
  • T-SNE for text visualization
主站蜘蛛池模板: 安岳县| 岚皋县| 鲁甸县| 新晃| 穆棱市| 江川县| 勃利县| 乳山市| 武宣县| 皮山县| 司法| 阜新市| 朔州市| 绥德县| 汶川县| 临颍县| 惠安县| 台州市| 青海省| 白城市| 五河县| 南木林县| 临桂县| 武强县| 舞钢市| 承德市| 栾川县| 比如县| 莱州市| 青铜峡市| 临湘市| 塔城市| 海阳市| 深水埗区| 光泽县| 乌恰县| 马边| 金平| 安达市| 庆阳市| 宜兰市|