官术网_书友最值得收藏!

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
587 3

which(dupes == "TRUE")
[1] 588 589
If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes()  function. The code for that would be simply:  df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features. 

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

主站蜘蛛池模板: 黄浦区| 宁夏| 阿尔山市| 麻城市| 甘孜| 阳城县| 布尔津县| 虞城县| 孟州市| 揭西县| 潜江市| 东乌珠穆沁旗| 辉南县| 华亭县| 清镇市| 东城区| 龙口市| 皋兰县| 伊春市| 曲麻莱县| 祁门县| 图片| 涿鹿县| 东安县| 南通市| 沂水县| 合山市| 巴林左旗| 宜章县| 霞浦县| 肥城市| 茂名市| 宝鸡市| 浑源县| 太湖县| 剑川县| 神木县| 潜山县| 满洲里市| 太原市| 辽中县|