官术网_书友最值得收藏!

Introduction

Many interesting analysis techniques can be used on a large corpus of words. Whether it be examining the structure of a sentence or the content of a book, these recipes will introduce us to some useful tools.

When manipulating strings for data analysis, some of the most common functions are among substring search and edit distance computations. Since numbers are often found in a corpus of text, this chapter will start by showing how to represent numbers in an arbitrary base as a string. We will cover a couple of string-searching algorithms and then focus on extracting text to study not only the words but also how the words are used together.

Many practical applications can be constructed given the simple set of tools provided in this section. For example, in the last recipe, we will demonstrate a way to correct spelling mistakes. How we use these algorithms is entirely up to our creativity, but at least having them at our disposal is an excellent start.

主站蜘蛛池模板: 河津市| 盐边县| 九龙城区| 永吉县| 安宁市| 乌鲁木齐县| 安仁县| 高州市| 中山市| 茌平县| 清流县| 商都县| 达拉特旗| 雷州市| 玉屏| 灵石县| 乌拉特中旗| 布尔津县| 江达县| 周口市| 福安市| 大洼县| 怀安县| 亚东县| 南乐县| 黄骅市| 新田县| 井冈山市| 左权县| 柘荣县| 麦盖提县| 梁平县| 柳河县| 开化县| 罗平县| 香河县| 唐河县| 麻城市| 武宁县| 镇原县| 龙陵县|