- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 483字
- 2021-08-05 17:12:47
What this book covers
Chapter 1, Simple Classifiers, explains that a huge percentage of NLP problems are actually classification problems. This chapter covers very simple but powerful classifiers based on character sequences and then brings in evaluation techniques such as cross-validation and metrics such as precision, recall, and the always-BS-resisting confusion matrix. You get to train yourself on your own and download data from Twitter. The chapter ends with a simple sentiment example.
Chapter 2, Finding and Working with Words, is exactly as boring as it sounds but there are some high points. The last recipe will show you how to tokenize Chinese/Japanese/Vietnamese languages, which doesn't have whitespaces, to help define words. We will show you how to wrap Lucene tokenizers, which cover all kinds of fun languages such as Arabic. Almost everything later in the book relies on tokenization.
Chapter 3, Advanced Classifiers, introduces the star of modern NLP systems—logistic regression classifiers. 20 years of hard-won experience lurks in this chapter. We will address the life cycle around building classifiers and how to create training data, cheat when creating training data with active learning, and how to tune and make the classifiers work faster.
Chapter 4, Tagging Words and Tokens, explains that language is about words. This chapter focuses on ways of applying categories to tokens, which in turn drives many of the high-end uses of LingPipe such as entity detection (people/places/orgs in text), part-of-speech tagging, and more. It starts with tag clouds, which have been described as "mullet of the Internet" and ends with a foundational recipe for conditional random fields (CRF), which can provide state-of-the-art performance for entity-detection tasks. In between, we will address confidence-tagged words, which is likely to be a very important dimension of more sophisticated systems.
Chapter 5, Finding Spans in Text – Chunking, shows that text is not words alone. It is collections of words, usually in spans. This chapter will advance from word tagging to span tagging, which brings in capabilities such as finding sentences, named entities, and basal NPs and VPs. The full power of CRFs are addressed with discussions on feature extraction and tuning. Dictionary approaches are discussed as they are ways of combining chunkings.
Chapter 6, String Comparison and Clustering, focuses on comparing text with each other, independent of a trained classifier. The technologies range from the hugely practical spellchecking to the hopeful but often frustrating Latent Dirichelet Allocation (LDA) clustering approach. Less presumptive technologies such as single-link and complete-link clustering have driven major commercial successes for us. Don't ignore this chapter.
Chapter 7, Finding Coreference Between Concepts/People, lays the future but unfortunately, you won't get the ultimate recipe, just our best efforts so far. This is one of the bleeding edges of industrial and academic NLP efforts that has tremendous potential. Potential is why we include our efforts to help grease the way to see this technology in use.
- C和C++安全編碼(原書第2版)
- Web交互界面設計與制作(微課版)
- Web開發的貴族:ASP.NET 3.5+SQL Server 2008
- 單片機應用技術
- Building a Recommendation Engine with Scala
- HDInsight Essentials(Second Edition)
- Learning Python by Building Games
- SQL基礎教程(第2版)
- C#開發案例精粹
- App Inventor創意趣味編程進階
- 響應式Web設計:HTML5和CSS3實戰(第2版)
- Swift語言實戰晉級
- 零代碼實戰:企業級應用搭建與案例詳解
- Android應用開發實戰
- Mastering Elixir