官术网_书友最值得收藏!

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
587 3

which(dupes == "TRUE")
[1] 588 589
If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes()  function. The code for that would be simply:  df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features. 

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

主站蜘蛛池模板: 商都县| 织金县| 松江区| 和平县| 合川市| 郯城县| 台南市| 永顺县| 西丰县| 仪陇县| 西丰县| 德江县| 马鞍山市| 佛学| 霍邱县| 府谷县| 福海县| 铜山县| 上思县| 神木县| 丽水市| 修文县| 唐山市| 石楼县| 顺昌县| 西宁市| 桃园县| 宜州市| 左贡县| 奇台县| 曲阳县| 镇赉县| 奉贤区| 闻喜县| 同德县| 济宁市| 嘉义县| 自治县| 宜宾县| 闽清县| 黔西县|