官术网_书友最值得收藏!

Introduction to big data analysis in R

Big data refers to the situations when volume, velocity, or a variety of data exceeds the abilities of our computation capacity to process, store, and analyze them. Big data analysis has to deal not only with large datasets but also with computationally intensive analyses, simulations, and models with many parameters.

Leveraging large data samples can provide significant advantages in the field of quantitative finance; we can relax the assumption of linearity and normality, generate better perdition models, or identify low-frequency events.

However, the analysis of large datasets raises two challenges. First, most of the tools of quantitative analysis have limited capacity to handle massive data, and even simple calculations and data-management tasks can be challenging to perform. Second, even without the capacity limit, computation on large datasets may be extremely time consuming.

Although R is a powerful and robust program with a rich set of statistical algorithms and capabilities, one of the biggest shortcomings is its limited potential to scale to large data sizes. The reason for this is that R requires the data that it operates on to be first loaded into memory. However, the operating system and system architecture can only access approximately 4 GB of memory. If the dataset reaches the RAM threshold of the computer, it can literally become impossible to work with on a standard computer with a standard algorithm. Sometimes, even small datasets can cause serious computation problems in R, as R has to store the biggest object created during the analysis process.

R, however, has a few packages to bridge the gap to provide efficient support for big data analysis. In this section, we will introduce two particular packages that can be useful tools to create, store, access, and manipulate massive data.

First, we will introduce the bigmemory package that is a widely used option for large-scale statistical computing. The package and its sister packages (biganalytics, bigtabulate, and bigalgebra) address two challenges in handling and analyzing massive datasets: data management and statistical analysis. The tools are able to implement massive matrices that do not fit in the R runtime environment and support their manipulation and exploration.

An alternative for the bigmemory package is the ff package. This package allows R users to handle large vectors and matrices and work with several large data files simultaneously. The big advantage of ff objects is that they behave as ordinary R vectors. However, the data is not stored in the memory; it is a resident on the disk.

In this section, we will showcase how these packages can help R users overcome the limitations of R to cope with very large datasets. Although the datasets we use here are simple in size, they effectively shows the power of big data packages.

主站蜘蛛池模板: 石台县| 岳池县| 井冈山市| 广平县| 富蕴县| 平果县| 中方县| 顺义区| 波密县| 临沂市| 美姑县| 洞头县| 岗巴县| 遂川县| 永宁县| 红原县| 莱西市| 棋牌| 永康市| 南丹县| 楚雄市| 呼图壁县| 南漳县| 南乐县| 五指山市| 永昌县| 枣庄市| 延吉市| 广丰县| 宜章县| 新乡市| 永城市| 呼和浩特市| 宣恩县| 诸暨市| 河间市| 南丹县| 桦南县| 手游| 惠来县| 嘉荫县|