官术网_书友最值得收藏!

Data cleansing

Data cleansing is the process of identifying and fixing corrupt or fallacious records in a record set, table, or database. It also deals with identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data, and then replacing, modifying, or deleting the infected data. Data entry and acquisition is inherently prone to errors, both simple and complex. There is much effort involved in this frontend process, but the fact remains that errors are common in large datasets. With respect to big data management, data cleaning is very important, for the following reasons:

  • The main data is usually spread across different legacy systems, including spreadsheets, text files, and web pages
  • By ensuring that the data is as accurate as possible, an organization can maintain good relationships with its customers, improving the organization's efficiency
  • Correct and complete data provides better insights into the process that the data concerns

There are libraries for Python (Pandas) and R (Dplyr) that can help with this process. In addition, there are other premium services available in the market, including Trifacta, OpenRefine, Paxata, and so on. 

主站蜘蛛池模板: 东台市| 溧水县| 七台河市| 阿尔山市| 台山市| 南开区| 湘阴县| 内江市| 祁连县| 讷河市| 通辽市| 西盟| 兰西县| 蛟河市| 德昌县| 霍山县| 辛集市| 安顺市| 吴旗县| 榆社县| 和田县| 乌审旗| 淮安市| 武鸣县| 河池市| 绵竹市| 宁阳县| 读书| 马公市| 会泽县| 织金县| 林甸县| 尚义县| 遂宁市| 双鸭山市| 体育| 衡南县| 三台县| 兴海县| 盐城市| 宝丰县|