官术网_书友最值得收藏!

Using tibble and dplyr for data manipulation

tibble is a recent development. It is essentially a more user-friendly version of DataFrames. For example, when you view data.frame in R, it will attempt to print as many rows as your console supports until it reaches the max.print value, at which point you'll get the following message:

getOption("max.print") -- omitted 99000 rows 

tibble, on the other hand, will show only the first few rows by default and adjust the viewable columns based on your viewable area on the screen.

To use tibble, and other related functionalities, install the tidyverse package as follows:

install.packages("tidyverse") 
library("tidyverse") 

The output of library("tidyverse")  is as follows:

Let us create tibble of the state DataFrame that we have used thus far:

tstate <- as_tibble(state.x77) 
tstate$Region <- state.region 

Before getting into the details of dplyr, it would help to get familiarized with a commonly used notation in R called pipe, which is represented as %>%. This notation has been a recent development.

Pipes allow the developer to pass the output of one function in the input of a subsequent function successively. For instance, suppose we wanted to find Region with the highest income from our state dataset. 

One way to find the region with the maximum income would be to aggregate by Region and then find Region corresponding to the highest value, as follows:

step1 <- aggregate(tstate[,-c(9)], by=list(state$Region), mean, na.rm = T) 
step1 

The output is as follows:

step2 <- step1[step1$Income==max(step1$Income),] 
step2 

This can, however, be greatly simplified using the %>% pipe operator, as follows:

tstate %>% group_by(Region) %>% summarise(Income = mean(Income)) %>% filter(Income == max(Income)) 
 
# # A tibble: 1 x 2 
# Region   Income 
# <fctr>    <dbl> 
#   1   West 4702.615 

It is also possible to summarize all of the column values at once using summarise_all and find the row corresponding to the max income, as in the prior example:

tstate %>% group_by(Region) %>% summarise_all(funs(mean)) %>% filter(Income == max(Income)) 

The output is as follows:

主站蜘蛛池模板: 吉木萨尔县| 巨野县| 天峨县| 扎鲁特旗| 巴林左旗| 湘西| 台山市| 军事| 六枝特区| 木兰县| 吴堡县| 高州市| 紫云| 宝丰县| 丰原市| 沙坪坝区| 咸阳市| 贵州省| 壤塘县| 蕲春县| 黄平县| 小金县| 当涂县| 云安县| 城市| 资溪县| 永善县| 和龙市| 封开县| 临安市| 疏附县| 衡阳市| 冕宁县| 芜湖市| 聂拉木县| 舟曲县| 资阳市| 千阳县| 黄骅市| 革吉县| 潼关县|