官术网_书友最值得收藏!

Understanding basic data cleaning

The importance of having clean (and therefore reliable) data in any statistical project cannot be overstated. Dirty data, even with sound statistical practice, can be unreliable and can lead to producing results that suggest courses of action that are incorrect or that may even cause harm or financial loss. It has been stated that a data scientist spends nearly 90 percent of his or her time in the process of cleaning data and only 10 percent on the actual modeling of the data and deriving results from it.

So, just what is data cleaning?

Data cleaning is also referred to as data cleansing or data scrubbing and involves both the processes of detecting as well as addressing errors, omissions, and inconsistencies within a population of data.

This may be done interactively with data wrangling tools, or in batch mode through scripting. We will use R in this book as it is well-fitted for data science since it works with even very complex datasets, allows handling of the data through various modeling functions, and even provides the ability to generate visualizations to represent data and prove theories and assumptions in just a few lines of code.

During cleansing, you first use logic to examine and evaluate your data pool to establish a level of quality for the data. The level of data quality can be affected by the way the data is entered, stored, and managed. Cleansing can involve correcting, replacing, or just removing data points or entire actual records.

Cleansing should not be confused with validating as they differ from each other. A validation process is a pass or fails process, usually occurring as the data is captured (time of entry), rather than an operation performed later on the data in preparation for an intended purpose.

As a data developer, one should not be new to the concept of data cleaning or the importance of improving the level of quality of data. A data developer will also agree that the process of addressing data quality requires a routine and regular review and evaluation of the data, and in fact, most organizations have enterprise tools and/or processes (or at least policies) in place to routinely preprocess and cleanse the enterprise data.

There is quite a list of both free and paid tools to sample, if you are interested, including iManageData, Data Manager, DataPreparator (Trifecta) Wrangler, and so on. From a statistical perspective, the top choices would be R, Python, and Julia.

Before you can address specific issues within your data, you need to detect them. Detecting them requires that you determine what would qualify as an issue or error, given the context of your objective (more on this later in this section).

主站蜘蛛池模板: 大关县| 自治县| 阿克陶县| 湘潭县| 仁布县| 安龙县| 灵山县| 望奎县| 班玛县| 嘉鱼县| 翁牛特旗| 吉木萨尔县| 陕西省| 博兴县| 清新县| 明星| 舞钢市| 垫江县| 彭水| 禹州市| 罗城| 故城县| 绩溪县| 陇川县| 肃南| 辽源市| 古田县| 鹤山市| 丽江市| 根河市| 嘉兴市| 明溪县| 兴隆县| 佛学| 肇源县| 寿光市| 石景山区| 玛纳斯县| 三台县| 沐川县| 东至县|