官术网_书友最值得收藏!

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
587 3

which(dupes == "TRUE")
[1] 588 589
If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes()  function. The code for that would be simply:  df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features. 

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

主站蜘蛛池模板: 沙坪坝区| 安远县| 南皮县| 漯河市| 安西县| 民和| 禄丰县| 葵青区| 汤原县| 德保县| 谢通门县| 同心县| 探索| 清徐县| 永修县| 宜丰县| 田东县| 宁海县| 青州市| 苍梧县| 嘉荫县| 安图县| 五莲县| 成安县| 应用必备| 宁蒗| 揭阳市| 乌拉特后旗| 辽宁省| 井研县| 永春县| 怀来县| 广州市| 静乐县| 宁远县| 图木舒克市| 镇雄县| 文昌市| 富平县| 肥东县| 南木林县|