- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 217字
- 2021-06-24 14:24:33
Handling duplicate observations
The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:
dupes <- duplicated(gettysburg)
table(dupes)
dupes
FALSE TRUE
587 3
which(dupes == "TRUE")
[1] 588 589
To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:
gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)
Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.
With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.
- 筆記本電腦使用、維護與故障排除實戰
- Aftershot Pro:Non-destructive photo editing and management
- Linux KVM虛擬化架構實戰指南
- 電腦組裝與維修從入門到精通(第2版)
- Mastering Delphi Programming:A Complete Reference Guide
- 數字道路技術架構與建設指南
- Unity 5.x Game Development Blueprints
- 電腦軟硬件維修從入門到精通
- 嵌入式系統中的模擬電路設計
- 計算機組裝與維護(第3版)
- R Deep Learning Essentials
- 筆記本電腦維修300問
- 微型計算機系統原理及應用:國產龍芯處理器的軟件和硬件集成(基礎篇)
- Hands-On Motion Graphics with Adobe After Effects CC
- 新編電腦組裝與硬件維修從入門到精通