官术网_书友最值得收藏!

Removing duplicates

We can safely assume that all the data that lands on our desks is dirty (until proven otherwise). It is a good habit to check whether everything with our data is in order. The first thing I always check for is the duplication of rows.

Getting ready

To follow this recipe, you need to have OpenRefine and virtually any Internet browser installed on your computer.

We assume that you followed the previous recipes and your data is already loaded to OpenRefine and the data types are now representative of what the columns hold. No other prerequisites are required.

How to do it…

First, we assume that within the seven days of property sales, a row is a duplicate if the same address appears twice (or more) in the dataset. It is quite unlikely that the same house is sold twice (or more times) within such a short period of time. Therefore, first, we Blank down the observations if they repeat:

How to do it…

This effects in keeping only the first occurrence of a certain set of observations and blanking the rest (see the fourth row in the following screenshot):

How to do it…

Tip

The Fill down option has the opposite effect—it would fill in the blanks with the values from the row above unless the cell is not blank.

We can now create a Facet by blank that would allow us to quickly select the blanked rows:

How to do it…

Creating such a facet allows us to quickly select all the rows that are blank and remove them from the dataset:

How to do it…

Our dataset now has no duplicate records.

主站蜘蛛池模板: 平乡县| 清远市| 民丰县| 泰宁县| 图们市| 建阳市| 绥中县| 桂林市| 黑河市| 灵台县| 永善县| 九江县| 黑河市| 万州区| 迁西县| 云和县| 清远市| 泰兴市| 宁阳县| 陆川县| 中宁县| 福海县| 杨浦区| 阳新县| 高州市| 中方县| 河源市| 桃源县| 墨竹工卡县| 灵宝市| 南溪县| 惠水县| 勃利县| 酉阳| 沙雅县| 长宁县| 荣昌县| 金川县| 应城市| 辽阳市| 嵩明县|