- Mastering Machine Learning with R
- Cory Lesmeister
- 217字
- 2021-07-02 13:46:19
Handling duplicate observations
The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:
dupes <- duplicated(gettysburg)
table(dupes)
dupes
FALSE TRUE
587 3
which(dupes == "TRUE")
[1] 588 589
To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:
gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)
Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.
With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.
- 亮劍.NET:.NET深入體驗與實戰精要
- 網絡服務器架設(Windows Server+Linux Server)
- Python Artificial Intelligence Projects for Beginners
- 協作機器人技術及應用
- UTM(統一威脅管理)技術概論
- Visual C# 2008開發技術實例詳解
- 大數據挑戰與NoSQL數據庫技術
- STM32G4入門與電機控制實戰:基于X-CUBE-MCSDK的無刷直流電機與永磁同步電機控制實現
- Photoshop CS3特效處理融會貫通
- 數據通信與計算機網絡
- 自動控制理論(非自動化專業)
- Visual FoxPro數據庫基礎及應用
- 網絡安全管理實踐
- Hadoop應用開發基礎
- 智能生產線的重構方法