官术网_书友最值得收藏!

Contextual data issues

A lot of the previously mentioned data issues can be automatically detected and even corrected. The issues may have been originally caused by user entry errors, by corruption in transmission or storage, or by different definitions or understandings of similar entities in different data sources. In data science, there is more to think about.

During data cleaning, a data scientist will attempt to identify patterns within the data, based on a hypothesis or assumption about the context of the data and its intended purpose. In other words, any data that the data scientist determines to be either obviously disconnected with the assumption or objective of the data or obviously inaccurate will then be addressed. This process is reliant upon the data scientist's judgment and his or her ability to determine which points are valid and which are not.

When relying on human judgment, there is always a chance that valid data points, not sufficiently accounted for in the data scientist's hypothesis/assumption, are overlooked or incorrectly addressed. Therefore, it is a common practice to maintain appropriately labeled versions of your cleansed data.
主站蜘蛛池模板: 龙井市| 全州县| 商南县| 苍溪县| 长阳| 青冈县| 门头沟区| 新建县| 双江| 临沧市| 眉山市| 赞皇县| 稷山县| 铜山县| 齐齐哈尔市| 松原市| 渭源县| 乌兰察布市| 樟树市| 卢氏县| 和静县| 庆城县| 德庆县| 繁昌县| 长兴县| 广州市| 当阳市| 福清市| 历史| 炉霍县| 项城市| 镇赉县| 柯坪县| 姚安县| 雅安市| 廊坊市| 永德县| 新民市| 美姑县| 景宁| 兴城市|