官术网_书友最值得收藏!

Using tibble and dplyr for data manipulation

tibble is a recent development. It is essentially a more user-friendly version of DataFrames. For example, when you view data.frame in R, it will attempt to print as many rows as your console supports until it reaches the max.print value, at which point you'll get the following message:

getOption("max.print") -- omitted 99000 rows 

tibble, on the other hand, will show only the first few rows by default and adjust the viewable columns based on your viewable area on the screen.

To use tibble, and other related functionalities, install the tidyverse package as follows:

install.packages("tidyverse") 
library("tidyverse") 

The output of library("tidyverse")  is as follows:

Let us create tibble of the state DataFrame that we have used thus far:

tstate <- as_tibble(state.x77) 
tstate$Region <- state.region 

Before getting into the details of dplyr, it would help to get familiarized with a commonly used notation in R called pipe, which is represented as %>%. This notation has been a recent development.

Pipes allow the developer to pass the output of one function in the input of a subsequent function successively. For instance, suppose we wanted to find Region with the highest income from our state dataset. 

One way to find the region with the maximum income would be to aggregate by Region and then find Region corresponding to the highest value, as follows:

step1 <- aggregate(tstate[,-c(9)], by=list(state$Region), mean, na.rm = T) 
step1 

The output is as follows:

step2 <- step1[step1$Income==max(step1$Income),] 
step2 

This can, however, be greatly simplified using the %>% pipe operator, as follows:

tstate %>% group_by(Region) %>% summarise(Income = mean(Income)) %>% filter(Income == max(Income)) 
 
# # A tibble: 1 x 2 
# Region   Income 
# <fctr>    <dbl> 
#   1   West 4702.615 

It is also possible to summarize all of the column values at once using summarise_all and find the row corresponding to the max income, as in the prior example:

tstate %>% group_by(Region) %>% summarise_all(funs(mean)) %>% filter(Income == max(Income)) 

The output is as follows:

主站蜘蛛池模板: 满城县| 乐昌市| 北宁市| 华宁县| 布拖县| 枣阳市| 遵化市| 麻城市| 济宁市| 嘉祥县| 陇西县| 清河县| 左云县| 吐鲁番市| 邓州市| 庆元县| 陆良县| 青阳县| 宕昌县| 永仁县| 丹巴县| 余姚市| 个旧市| 崇州市| 蕲春县| 金湖县| 无锡市| 宝丰县| 连州市| 普格县| 孟州市| 定襄县| 天峨县| 和平县| 青川县| 绥德县| 平遥县| 张家口市| 乌拉特前旗| 鄂伦春自治旗| 株洲县|