Introduction to big data analysis in R
Big data refers to the situations when volume, velocity, or a variety of data exceeds the abilities of our computation capacity to process, store, and analyze them. Big data analysis has to deal not only with large datasets but also with computationally intensive analyses, simulations, and models with many parameters.
Leveraging large data samples can provide significant advantages in the field of quantitative finance; we can relax the assumption of linearity and normality, generate better perdition models, or identify low-frequency events.
However, the analysis of large datasets raises two challenges. First, most of the tools of quantitative analysis have limited capacity to handle massive data, and even simple calculations and data-management tasks can be challenging to perform. Second, even without the capacity limit, computation on large datasets may be extremely time consuming.
Although R is a powerful and robust program with a rich set of statistical algorithms and capabilities, one of the biggest shortcomings is its limited potential to scale to large data sizes. The reason for this is that R requires the data that it operates on to be first loaded into memory. However, the operating system and system architecture can only access approximately 4 GB of memory. If the dataset reaches the RAM threshold of the computer, it can literally become impossible to work with on a standard computer with a standard algorithm. Sometimes, even small datasets can cause serious computation problems in R, as R has to store the biggest object created during the analysis process.
R, however, has a few packages to bridge the gap to provide efficient support for big data analysis. In this section, we will introduce two particular packages that can be useful tools to create, store, access, and manipulate massive data.
First, we will introduce the bigmemory
package that is a widely used option for large-scale statistical computing. The package and its sister packages (biganalytics
, bigtabulate
, and bigalgebra
) address two challenges in handling and analyzing massive datasets: data management and statistical analysis. The tools are able to implement massive matrices that do not fit in the R runtime environment and support their manipulation and exploration.
An alternative for the bigmemory package is the ff
package. This package allows R users to handle large vectors and matrices and work with several large data files simultaneously. The big advantage of ff
objects is that they behave as ordinary R vectors. However, the data is not stored in the memory; it is a resident on the disk.
In this section, we will showcase how these packages can help R users overcome the limitations of R to cope with very large datasets. Although the datasets we use here are simple in size, they effectively shows the power of big data packages.
- Dynamics 365 for Finance and Operations Development Cookbook(Fourth Edition)
- Moodle Administration Essentials
- Power Up Your PowToon Studio Project
- 前端跨界開發指南:JavaScript工具庫原理解析與實戰
- 數據結構和算法基礎(Java語言實現)
- Lua程序設計(第4版)
- Learning Neo4j 3.x(Second Edition)
- HTML5入門經典
- Apache Spark 2.x for Java Developers
- 區塊鏈底層設計Java實戰
- Essential C++(中文版)
- 軟件測試綜合技術
- RESTful Web Clients:基于超媒體的可復用客戶端
- Practical GIS
- Apache Solr for Indexing Data