官术网_书友最值得收藏!

Exploring the 20 Newsgroups Dataset with Text Analysis Techniques

We went through a bunch of fundamental machine learning concepts in the previous chapter. We learned about them along with analogies, in a fun way, such as studying for exams and designing a driving schedule. Starting from this chapter as the second step of our learning journal, we will be discovering in detail several important machine learning algorithms and techniques. Beyond analogies, we will be exposed to and solve real-world examples, which makes our journey more interesting. We will start with a natural language processing problem—exploring newsgroups data. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values and how to clean up words with little meaning. We will also visualize text data by mapping it into a two-dimensional space in an unsupervised learning manner.

We will go into detail for each of the following topics:

  • What is NLP and its applications
  • NLP basics
  • Touring Python NLP libraries
  • Tokenization
  • Part-of-speech tagging
  • Named entities recognition
  • Stemming and lemmatization
  • Getting and exploring the newsgroups data
  • Data visualization using seaborn and matplotlib
  • The Bag of words (BoW) model and token count vectorization
  • Text preprocessing
  • Stop words removal
  • Dimensionality reduction
  • T-SNE
  • T-SNE for text visualization
主站蜘蛛池模板: 甘孜县| 弥勒县| 马尔康县| 文安县| 临邑县| 海口市| 汶川县| 北碚区| 永新县| 莱芜市| 海阳市| 阿瓦提县| 兴隆县| 盈江县| 宝山区| 府谷县| 光泽县| 淳安县| 平安县| 辽阳市| 嘉兴市| 湟中县| 武清区| 阿拉尔市| 巨鹿县| 太保市| 宝清县| 固阳县| 汉阴县| 兴业县| 伊春市| 沙坪坝区| 连州市| 东港市| 壤塘县| 漳州市| 会宁县| 桐城市| 慈溪市| 珠海市| 南开区|