官术网_书友最值得收藏!

Introduction

An important part of building NLP systems is to work with the appropriate unit for processing. This chapter addresses the abstraction layer associated with the word level of processing. This is called tokenization, which amounts to grouping adjacent characters into meaningful chunks in support of classification, entity finding, and the rest of NLP.

LingPipe provides a broad range of tokenizer needs, which are not covered in this book. Look at the Javadoc for tokenizers that do stemming, Soundex (tokens based on what English words sound like), and more.

主站蜘蛛池模板: 油尖旺区| 繁昌县| 龙泉市| 和龙市| 无极县| 金乡县| 旬邑县| 木兰县| 汶上县| 理塘县| 新乐市| 吉林省| 攀枝花市| 双江| 镇江市| 奉化市| 南靖县| 辉县市| 商河县| 仁化县| 泗水县| 宿迁市| 阿勒泰市| 措勤县| 梧州市| 平和县| 德惠市| 大理市| 桑日县| 丹江口市| 岳西县| 萨嘎县| 新巴尔虎左旗| 兴文县| 宜春市| 奎屯市| 上蔡县| 桓仁| 鹰潭市| 汕头市| 岗巴县|