官术网_书友最值得收藏!

Introduction

The conclusions drawn from data analysis are only as robust as the quality of the data itself. After obtaining raw text, the next natural step is to validate and clean it carefully. Even the slightest bias may risk the integrity of the results. Therefore, we must take great precautionary measures, which involve thorough inspection, to ensure sanity checks are performed on our data before we begin to understand it. This section should be the starting point for cleaning data in Haskell.

Real-world data often has an impurity that needs to be addressed before it can be processed. For example, extraneous whitespaces or punctuation could clutter data, making it difficult to parse. Duplication and data conflicts are another area of unintended consequences of reading real-world data. Sometimes it's just reassuring to know that data makes sense by conducting sanity checks. Some examples of sanity checks include matching regular expressions as well as detecting outliers by establishing a measure of distance. In this chapter, we will cover each of these topics.

主站蜘蛛池模板: 宜州市| 沧州市| 金平| 丰原市| 孝昌县| 商洛市| 边坝县| 会东县| 扎赉特旗| 荔浦县| 两当县| 皮山县| 沂源县| 连平县| 丹江口市| 南宁市| 丁青县| 青川县| 瓮安县| 璧山县| 遂溪县| 老河口市| 老河口市| 南通市| 天祝| 万宁市| 紫金县| 泰州市| 乌什县| 郴州市| 昌江| 灵山县| 奉贤区| 长阳| 许昌县| 前郭尔| 芜湖市| 巴青县| 德江县| 叶城县| 延吉市|