官术网_书友最值得收藏!

Performance issues

Over 40% of the R code is predominantly written in C, and a little bit over 20% still in Fortran (the rest in C++, Java, and R), making some common computational tasks very costly. Microsoft (and, before, Revolution analytics) did rewrite some of the most frequently used functions from old Fortran to C/C++ in order to address performance issues.

Many package authors did very similar things. For example, Matt Dowle—the main author of the data.table R package—did several language performance lift-ups to speed up most common data wrangling steps.

When comparing similar operations on the same dataset using different packages, such as dplyr, plyr, data.table, and sqldf, one can see the difference in the time performance with the same results.

The following R sample shows roughly a 80 MiB big object with a simple grouping function of how much difference there is in the computation time. Packages dpylr and data.table stand out and have performance gain over 25x times better in comparison to plyr and sqldf. data.table, especially, is extremely efficient and this is mainly due to Matt's extreme impetus to optimize the code of the data.table package in order to gain performance:

set.seed(6546) 
nobs <- 1e+07 
df <- data.frame("group" = as.factor(sample(1:1e+05, nobs, replace = TRUE)), "variable" = rpois(nobs, 100)) 
 
# Calculate mean of variable within each group using plyr - ddply  
library(plyr) 
system.time(grpmean <- ddply( 
  df,  
  .(group),  
  summarize,  
  grpmean = mean(variable))) 
 
 
# Calcualte mean of variable within each group using dplyr 
detach("package:plyr", unload=TRUE) 
library(dplyr) 
 
system.time( 
  grpmean2 <- df %>%  
              group_by(group) %>% 
              summarise(group_mean = mean(variable))) 
 
# Calcualte mean of variable within each group using data.table 
library(data.table) 
system.time( 
  grpmean3 <- data.table(df)[ 
    #i 
    ,mean(variable)    
    ,by=(group)] ) 
 
# Calcualte mean of variable within each group using sqldf 
library(sqldf) 
system.time(grpmean4 <- sqldf("SELECT avg(variable), [group] from df GROUP BY [group]")) 

The Microsoft RevoScaleR package, on the other hand, is optimized as well and can supersede all of these packages in terms of performance and large datasets. This is just to prove how Microsoft has made R ready for large datasets to address the performance issues.

主站蜘蛛池模板: 周口市| 宜丰县| 呼玛县| 泰来县| 凤山县| 民勤县| 丹寨县| 新泰市| 西林县| 镇雄县| 珠海市| 滕州市| 临漳县| 甘泉县| 娄底市| 公安县| 密山市| 揭西县| 金溪县| 乡宁县| 德惠市| 抚顺市| 股票| 乌审旗| 准格尔旗| 延吉市| 乾安县| 玛多县| 晋中市| 新绛县| 洪江市| 宝山区| 兰西县| 牟定县| 宁乡县| 买车| 土默特右旗| 镇赉县| 习水县| 广丰县| 武功县|