官术网_书友最值得收藏!

Data cleansing

Data cleansing is the process of identifying and fixing corrupt or fallacious records in a record set, table, or database. It also deals with identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data, and then replacing, modifying, or deleting the infected data. Data entry and acquisition is inherently prone to errors, both simple and complex. There is much effort involved in this frontend process, but the fact remains that errors are common in large datasets. With respect to big data management, data cleaning is very important, for the following reasons:

  • The main data is usually spread across different legacy systems, including spreadsheets, text files, and web pages
  • By ensuring that the data is as accurate as possible, an organization can maintain good relationships with its customers, improving the organization's efficiency
  • Correct and complete data provides better insights into the process that the data concerns

There are libraries for Python (Pandas) and R (Dplyr) that can help with this process. In addition, there are other premium services available in the market, including Trifacta, OpenRefine, Paxata, and so on. 

主站蜘蛛池模板: 怀柔区| 彰化市| 明溪县| 鸡东县| 宜春市| 外汇| 临汾市| 宁都县| 田阳县| 阳江市| 宜州市| 开平市| 壤塘县| 安顺市| 沅江市| 衡阳市| 漯河市| 荆门市| 葫芦岛市| 洛宁县| 金湖县| 苏尼特右旗| 梓潼县| 新邵县| 遂昌县| 句容市| 英吉沙县| 济宁市| 喀喇沁旗| 石泉县| 汝州市| 沙田区| 绍兴市| 怀宁县| 福泉市| 东乌珠穆沁旗| 海南省| 平乡县| 莱西市| 梓潼县| 八宿县|