官术网_书友最值得收藏!

Introduction

The conclusions drawn from data analysis are only as robust as the quality of the data itself. After obtaining raw text, the next natural step is to validate and clean it carefully. Even the slightest bias may risk the integrity of the results. Therefore, we must take great precautionary measures, which involve thorough inspection, to ensure sanity checks are performed on our data before we begin to understand it. This section should be the starting point for cleaning data in Haskell.

Real-world data often has an impurity that needs to be addressed before it can be processed. For example, extraneous whitespaces or punctuation could clutter data, making it difficult to parse. Duplication and data conflicts are another area of unintended consequences of reading real-world data. Sometimes it's just reassuring to know that data makes sense by conducting sanity checks. Some examples of sanity checks include matching regular expressions as well as detecting outliers by establishing a measure of distance. In this chapter, we will cover each of these topics.

主站蜘蛛池模板: 南康市| 西昌市| 克什克腾旗| 旌德县| 苏尼特右旗| 东城区| 山东| 客服| 阳江市| 德化县| 石河子市| 定日县| 西平县| 威海市| 南投县| 会昌县| 都江堰市| 邵武市| 攀枝花市| 温泉县| 广南县| 微博| 阜南县| 昆明市| 太原市| 来凤县| 石景山区| 凭祥市| 元氏县| 文登市| 慈利县| 台江县| 天峻县| 河间市| 开鲁县| 潢川县| 江达县| 麟游县| 那坡县| 那曲县| 新晃|