官术网_书友最值得收藏!

Introduction

Data is everywhere, logging is cheap, and analysis is inevitable. One of the most fundamental concepts of this chapter is based on gathering useful data. After building a large collection of usable text, which we call the corpus, we must learn to represent this content in code. The primary focus will be first on obtaining data and later on enumerating ways of representing it.

Gathering data is arguably as important as analyzing it to extrapolate results and form valid generalizable claims. It is a scientific pursuit; therefore, great care must and will be taken to ensure unbiased and representative sampling. We recommend following along closely in this chapter because the remainder of the book depends on having a source of data to work with. Without data, there isn't much to analyze, so we should carefully observe the techniques laid out to build our own formidable corpus.

The first recipe enumerates various sources to start gathering data online. The next few recipes deal with using local data of different file formats. We then learn how to download data from the Internet using our Haskell code. Finally, we finish this chapter with a couple of recipes on using databases in Haskell.

主站蜘蛛池模板: 成都市| 冕宁县| 嘉定区| 铁岭县| 江永县| 营山县| 资阳市| 安阳市| 罗源县| 扶沟县| 闵行区| 黄山市| 磐石市| 察哈| 扎囊县| 紫金县| 苍梧县| 石首市| 竹北市| 盈江县| 盐城市| 广安市| 九台市| 大邑县| 中阳县| 泽库县| 梅河口市| 南江县| 林芝县| 东乌珠穆沁旗| 商南县| 保靖县| 同心县| 灵武市| 龙州县| 疏勒县| 蛟河市| 抚顺市| 隆化县| 阿拉善盟| 福建省|