官术网_书友最值得收藏!

Removing duplicates

We can safely assume that all the data that lands on our desks is dirty (until proven otherwise). It is a good habit to check whether everything with our data is in order. The first thing I always check for is the duplication of rows.

Getting ready

To follow this recipe, you need to have OpenRefine and virtually any Internet browser installed on your computer.

We assume that you followed the previous recipes and your data is already loaded to OpenRefine and the data types are now representative of what the columns hold. No other prerequisites are required.

How to do it…

First, we assume that within the seven days of property sales, a row is a duplicate if the same address appears twice (or more) in the dataset. It is quite unlikely that the same house is sold twice (or more times) within such a short period of time. Therefore, first, we Blank down the observations if they repeat:

How to do it…

This effects in keeping only the first occurrence of a certain set of observations and blanking the rest (see the fourth row in the following screenshot):

How to do it…

Tip

The Fill down option has the opposite effect—it would fill in the blanks with the values from the row above unless the cell is not blank.

We can now create a Facet by blank that would allow us to quickly select the blanked rows:

How to do it…

Creating such a facet allows us to quickly select all the rows that are blank and remove them from the dataset:

How to do it…

Our dataset now has no duplicate records.

主站蜘蛛池模板: 宜川县| 根河市| 育儿| 徐州市| 宁波市| 玉环县| 林西县| 玉屏| 大渡口区| 雅安市| 宝兴县| 湘乡市| 永清县| 资源县| 阿尔山市| 杂多县| 枣庄市| 建德市| 镇坪县| 凌云县| 连南| 西平县| 宁武县| 凤冈县| 武冈市| 江川县| 正宁县| 浮山县| 额济纳旗| 云和县| 庆元县| 禹城市| 获嘉县| 南华县| 五家渠市| 苗栗县| 柞水县| 嵩明县| 嘉鱼县| 凤山县| 屏南县|