Exploring the 20 Newsgroups Dataset with Text Analysis Techniques
We went through a bunch of fundamental machine learning concepts in the previous chapter. We learned about them along with analogies, in a fun way, such as studying for exams and designing a driving schedule. Starting from this chapter as the second step of our learning journal, we will be discovering in detail several important machine learning algorithms and techniques. Beyond analogies, we will be exposed to and solve real-world examples, which makes our journey more interesting. We will start with a natural language processing problem—exploring newsgroups data. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values and how to clean up words with little meaning. We will also visualize text data by mapping it into a two-dimensional space in an unsupervised learning manner.
We will go into detail for each of the following topics:
What is NLP and its applications
NLP basics
Touring Python NLP libraries
Tokenization
Part-of-speech tagging
Named entities recognition
Stemming and lemmatization
Getting and exploring the newsgroups data
Data visualization using seaborn and matplotlib
The Bag of words (BoW) model and token count vectorization