- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 217字
- 2021-06-24 14:24:33
Handling duplicate observations
The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:
dupes <- duplicated(gettysburg)
table(dupes)
dupes
FALSE TRUE
587 3
which(dupes == "TRUE")
[1] 588 589
To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:
gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)
Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.
With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.
- 基于Proteus和Keil的C51程序設計項目教程(第2版):理論、仿真、實踐相融合
- 施耐德SoMachine控制器應用及編程指南
- Artificial Intelligence Business:How you can profit from AI
- R Deep Learning Essentials
- Machine Learning with Go Quick Start Guide
- Neural Network Programming with Java(Second Edition)
- 單片微機原理及應用
- 計算機組成技術教程
- Deep Learning with Keras
- 分布式存儲系統:核心技術、系統實現與Go項目實戰
- Nagios系統監控實踐(原書第2版)
- 嵌入式系統原理:基于Arm Cortex-M微控制器體系
- Exceptional C++:47個C++工程難題、編程問題和解決方案(中文版)
- CPU設計實戰:LoongArch版
- FPGA的人工智能之路:基于Intel FPGA開發的入門到實踐