- Practical Data Analysis Cookbook
- Tomasz Drabas
- 264字
- 2021-07-16 11:13:53
Removing duplicates
We can safely assume that all the data that lands on our desks is dirty (until proven otherwise). It is a good habit to check whether everything with our data is in order. The first thing I always check for is the duplication of rows.
Getting ready
To follow this recipe, you need to have OpenRefine and virtually any Internet browser installed on your computer.
We assume that you followed the previous recipes and your data is already loaded to OpenRefine and the data types are now representative of what the columns hold. No other prerequisites are required.
How to do it…
First, we assume that within the seven days of property sales, a row is a duplicate if the same address appears twice (or more) in the dataset. It is quite unlikely that the same house is sold twice (or more times) within such a short period of time. Therefore, first, we Blank down the observations if they repeat:

This effects in keeping only the first occurrence of a certain set of observations and blanking the rest (see the fourth row in the following screenshot):

We can now create a Facet by blank that would allow us to quickly select the blanked rows:

Creating such a facet allows us to quickly select all the rows that are blank and remove them from the dataset:

Our dataset now has no duplicate records.
- Mastering Ext JS(Second Edition)
- C語言程序設計教程
- ASP.NET MVC4框架揭秘
- PostgreSQL技術內幕:事務處理深度探索
- Swift 3 New Features
- Expert Data Visualization
- 自然語言處理Python進階
- Asynchronous Android Programming(Second Edition)
- Python機器學習:預測分析核心算法
- Service Mesh實戰:基于Linkerd和Kubernetes的微服務實踐
- 零基礎學HTML+CSS第2版
- Application Development with Swift
- SSH框架企業級應用實戰
- Visual Basic 程序設計實踐教程
- DevOps 精要:業務視角