官术网_书友最值得收藏!

Introduction

Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language processing and text analytics. Originally designed for teaching, it has been adopted in the industry for research and development due to its usefulness and breadth of coverage. NLTK is often used for rapid prototyping of text processing programs and can even be used in production applications. Demos of select NLTK functionality and production-ready APIs are available at http://text-processing.com.

This chapter will cover the basics of tokenizing text and using WordNet. Tokenization is a method of breaking up a piece of text into many pieces, such as sentences and words, and is an essential first step for recipes in the later chapters. WordNet is a dictionary designed for programmatic access by natural language processing systems. It has many different use cases, including:

  • Looking up the definition of a word
  • Finding synonyms and antonyms
  • Exploring word relations and similarity
  • Word sense disambiguation for words that have multiple uses and definitions

NLTK includes a WordNet corpus reader, which we will use to access and explore WordNet. A corpus is just a body of text, and corpus readers are designed to make accessing a corpus much easier than direct file access. We'll be using WordNet again in the later chapters, so it's important to familiarize yourself with the basics first.

主站蜘蛛池模板: 石河子市| 贵阳市| 宁陵县| 胶州市| 安义县| 长汀县| 北宁市| 张掖市| 盈江县| 华容县| 金华市| 曲沃县| 堆龙德庆县| 水富县| 隆回县| 安吉县| 乐昌市| 绥中县| 康定县| 肥城市| 固原市| 洛阳市| 城固县| 通山县| 山西省| 大冶市| 新平| 舞钢市| 鲁甸县| 普兰店市| 长乐市| 田林县| 雅安市| 盐边县| 上犹县| 冷水江市| 梧州市| 克拉玛依市| 奉化市| 北海市| 道孚县|