官术网_书友最值得收藏!

Exploring the 20 Newsgroups Dataset with Text Analysis Techniques

We went through a bunch of fundamental machine learning concepts in the previous chapter. We learned about them along with analogies, in a fun way, such as studying for exams and designing a driving schedule. Starting from this chapter as the second step of our learning journal, we will be discovering in detail several important machine learning algorithms and techniques. Beyond analogies, we will be exposed to and solve real-world examples, which makes our journey more interesting. We will start with a natural language processing problem—exploring newsgroups data. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values and how to clean up words with little meaning. We will also visualize text data by mapping it into a two-dimensional space in an unsupervised learning manner.

We will go into detail for each of the following topics:

  • What is NLP and its applications
  • NLP basics
  • Touring Python NLP libraries
  • Tokenization
  • Part-of-speech tagging
  • Named entities recognition
  • Stemming and lemmatization
  • Getting and exploring the newsgroups data
  • Data visualization using seaborn and matplotlib
  • The Bag of words (BoW) model and token count vectorization
  • Text preprocessing
  • Stop words removal
  • Dimensionality reduction
  • T-SNE
  • T-SNE for text visualization
主站蜘蛛池模板: 包头市| 松阳县| 乾安县| 松江区| 吉隆县| 方正县| 嫩江县| 阜新| 铁力市| 洪湖市| 元阳县| 凌海市| 大埔县| 和政县| 阿鲁科尔沁旗| 杭锦旗| 大宁县| 郓城县| 台前县| 延吉市| 大埔县| 屏东市| 耿马| 邢台市| 承德市| 东乌珠穆沁旗| 绩溪县| 平潭县| 呼伦贝尔市| 桓仁| 乡城县| 三穗县| 东安县| 兰溪市| 依安县| 白沙| 光泽县| 金平| 宣威市| 沂源县| 家居|