官术网_书友最值得收藏!

Introduction

Many interesting analysis techniques can be used on a large corpus of words. Whether it be examining the structure of a sentence or the content of a book, these recipes will introduce us to some useful tools.

When manipulating strings for data analysis, some of the most common functions are among substring search and edit distance computations. Since numbers are often found in a corpus of text, this chapter will start by showing how to represent numbers in an arbitrary base as a string. We will cover a couple of string-searching algorithms and then focus on extracting text to study not only the words but also how the words are used together.

Many practical applications can be constructed given the simple set of tools provided in this section. For example, in the last recipe, we will demonstrate a way to correct spelling mistakes. How we use these algorithms is entirely up to our creativity, but at least having them at our disposal is an excellent start.

主站蜘蛛池模板: 石柱| 深圳市| 宜良县| 潜江市| 彝良县| 友谊县| 安达市| 阿尔山市| 大悟县| 诸暨市| 新邵县| 漳平市| 元氏县| 淮滨县| 阳原县| 苍山县| 库尔勒市| 北碚区| 喀喇沁旗| 当雄县| 南靖县| 保山市| 阿荣旗| 定日县| 礼泉县| 宣威市| 泾源县| 淮安市| 英德市| 临潭县| 吉林省| 龙海市| 奉新县| 南漳县| 乌鲁木齐市| 资源县| 烟台市| 泌阳县| 辰溪县| 南川市| 开封市|