官术网_书友最值得收藏!

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
587 3

which(dupes == "TRUE")
[1] 588 589
If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes()  function. The code for that would be simply:  df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features. 

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

主站蜘蛛池模板: 蒙阴县| 蛟河市| 贵南县| 刚察县| 林芝县| 双柏县| 康定县| 南充市| 大厂| 育儿| 高台县| 鄢陵县| 额尔古纳市| 湘阴县| 鲁山县| 奇台县| 苍山县| 黄大仙区| 肃宁县| 扶风县| 东源县| 临安市| 和林格尔县| 凌海市| 收藏| 新野县| 海宁市| 江油市| 科技| 桐柏县| 兴文县| 喀喇沁旗| 紫金县| 曲靖市| 永善县| 阳原县| 青岛市| 安多县| 千阳县| 墨玉县| 普安县|