- SQL Server 2017 Machine Learning Services with R
- Toma? Ka?trun Julie Koesmarno
- 292字
- 2021-06-24 19:03:44
Performance issues
Over 40% of the R code is predominantly written in C, and a little bit over 20% still in Fortran (the rest in C++, Java, and R), making some common computational tasks very costly. Microsoft (and, before, Revolution analytics) did rewrite some of the most frequently used functions from old Fortran to C/C++ in order to address performance issues.
Many package authors did very similar things. For example, Matt Dowle—the main author of the data.table R package—did several language performance lift-ups to speed up most common data wrangling steps.
When comparing similar operations on the same dataset using different packages, such as dplyr, plyr, data.table, and sqldf, one can see the difference in the time performance with the same results.
The following R sample shows roughly a 80 MiB big object with a simple grouping function of how much difference there is in the computation time. Packages dpylr and data.table stand out and have performance gain over 25x times better in comparison to plyr and sqldf. data.table, especially, is extremely efficient and this is mainly due to Matt's extreme impetus to optimize the code of the data.table package in order to gain performance:
set.seed(6546) nobs <- 1e+07 df <- data.frame("group" = as.factor(sample(1:1e+05, nobs, replace = TRUE)), "variable" = rpois(nobs, 100)) # Calculate mean of variable within each group using plyr - ddply library(plyr) system.time(grpmean <- ddply( df, .(group), summarize, grpmean = mean(variable))) # Calcualte mean of variable within each group using dplyr detach("package:plyr", unload=TRUE) library(dplyr) system.time( grpmean2 <- df %>% group_by(group) %>% summarise(group_mean = mean(variable))) # Calcualte mean of variable within each group using data.table library(data.table) system.time( grpmean3 <- data.table(df)[ #i ,mean(variable) ,by=(group)] ) # Calcualte mean of variable within each group using sqldf library(sqldf) system.time(grpmean4 <- sqldf("SELECT avg(variable), [group] from df GROUP BY [group]"))
The Microsoft RevoScaleR package, on the other hand, is optimized as well and can supersede all of these packages in terms of performance and large datasets. This is just to prove how Microsoft has made R ready for large datasets to address the performance issues.
- Div+CSS 3.0網頁布局案例精粹
- 輕松學PHP
- Windows 8應用開發實戰
- 機器人智能運動規劃技術
- 城市道路交通主動控制技術
- 21天學通C#
- Apache Spark Deep Learning Cookbook
- Ceph:Designing and Implementing Scalable Storage Systems
- Salesforce for Beginners
- 網絡管理工具實用詳解
- 精通數據科學:從線性回歸到深度學習
- Pentaho Analytics for MongoDB
- 30天學通Java Web項目案例開發
- Serverless Design Patterns and Best Practices
- Web滲透技術及實戰案例解析