官术网_书友最值得收藏!

Performance issues

Over 40% of the R code is predominantly written in C, and a little bit over 20% still in Fortran (the rest in C++, Java, and R), making some common computational tasks very costly. Microsoft (and, before, Revolution analytics) did rewrite some of the most frequently used functions from old Fortran to C/C++ in order to address performance issues.

Many package authors did very similar things. For example, Matt Dowle—the main author of the data.table R package—did several language performance lift-ups to speed up most common data wrangling steps.

When comparing similar operations on the same dataset using different packages, such as dplyr, plyr, data.table, and sqldf, one can see the difference in the time performance with the same results.

The following R sample shows roughly a 80 MiB big object with a simple grouping function of how much difference there is in the computation time. Packages dpylr and data.table stand out and have performance gain over 25x times better in comparison to plyr and sqldf. data.table, especially, is extremely efficient and this is mainly due to Matt's extreme impetus to optimize the code of the data.table package in order to gain performance:

set.seed(6546) 
nobs <- 1e+07 
df <- data.frame("group" = as.factor(sample(1:1e+05, nobs, replace = TRUE)), "variable" = rpois(nobs, 100)) 
 
# Calculate mean of variable within each group using plyr - ddply  
library(plyr) 
system.time(grpmean <- ddply( 
  df,  
  .(group),  
  summarize,  
  grpmean = mean(variable))) 
 
 
# Calcualte mean of variable within each group using dplyr 
detach("package:plyr", unload=TRUE) 
library(dplyr) 
 
system.time( 
  grpmean2 <- df %>%  
              group_by(group) %>% 
              summarise(group_mean = mean(variable))) 
 
# Calcualte mean of variable within each group using data.table 
library(data.table) 
system.time( 
  grpmean3 <- data.table(df)[ 
    #i 
    ,mean(variable)    
    ,by=(group)] ) 
 
# Calcualte mean of variable within each group using sqldf 
library(sqldf) 
system.time(grpmean4 <- sqldf("SELECT avg(variable), [group] from df GROUP BY [group]")) 

The Microsoft RevoScaleR package, on the other hand, is optimized as well and can supersede all of these packages in terms of performance and large datasets. This is just to prove how Microsoft has made R ready for large datasets to address the performance issues.

主站蜘蛛池模板: 正安县| 道孚县| 靖宇县| 天柱县| 上饶县| 即墨市| 永定县| 文登市| 通州市| 金山区| 丹东市| 商南县| 茌平县| 湄潭县| 呼伦贝尔市| 宁海县| 浠水县| 右玉县| 湘潭县| 衡山县| 万州区| 巴里| 阿合奇县| 饶河县| 龙游县| 浠水县| 宣城市| 扶沟县| 烟台市| 闻喜县| 庆城县| 西青区| 泽库县| 航空| 东乡县| 九江县| 凉城县| 于都县| 额尔古纳市| 漠河县| 新竹县|