Introduction to big data analysis in R
Big data refers to the situations when volume, velocity, or a variety of data exceeds the abilities of our computation capacity to process, store, and analyze them. Big data analysis has to deal not only with large datasets but also with computationally intensive analyses, simulations, and models with many parameters.
Leveraging large data samples can provide significant advantages in the field of quantitative finance; we can relax the assumption of linearity and normality, generate better perdition models, or identify low-frequency events.
However, the analysis of large datasets raises two challenges. First, most of the tools of quantitative analysis have limited capacity to handle massive data, and even simple calculations and data-management tasks can be challenging to perform. Second, even without the capacity limit, computation on large datasets may be extremely time consuming.
Although R is a powerful and robust program with a rich set of statistical algorithms and capabilities, one of the biggest shortcomings is its limited potential to scale to large data sizes. The reason for this is that R requires the data that it operates on to be first loaded into memory. However, the operating system and system architecture can only access approximately 4 GB of memory. If the dataset reaches the RAM threshold of the computer, it can literally become impossible to work with on a standard computer with a standard algorithm. Sometimes, even small datasets can cause serious computation problems in R, as R has to store the biggest object created during the analysis process.
R, however, has a few packages to bridge the gap to provide efficient support for big data analysis. In this section, we will introduce two particular packages that can be useful tools to create, store, access, and manipulate massive data.
First, we will introduce the bigmemory
package that is a widely used option for large-scale statistical computing. The package and its sister packages (biganalytics
, bigtabulate
, and bigalgebra
) address two challenges in handling and analyzing massive datasets: data management and statistical analysis. The tools are able to implement massive matrices that do not fit in the R runtime environment and support their manipulation and exploration.
An alternative for the bigmemory package is the ff
package. This package allows R users to handle large vectors and matrices and work with several large data files simultaneously. The big advantage of ff
objects is that they behave as ordinary R vectors. However, the data is not stored in the memory; it is a resident on the disk.
In this section, we will showcase how these packages can help R users overcome the limitations of R to cope with very large datasets. Although the datasets we use here are simple in size, they effectively shows the power of big data packages.
- Mastering Natural Language Processing with Python
- 微服務設計原理與架構
- JavaScript+jQuery開發實戰
- Apache Karaf Cookbook
- 名師講壇:Java微服務架構實戰(SpringBoot+SpringCloud+Docker+RabbitMQ)
- SSM輕量級框架應用實戰
- Highcharts Cookbook
- 基于ARM Cortex-M4F內核的MSP432 MCU開發實踐
- SQL Server 2008 R2數據庫技術及應用(第3版)
- 玩轉.NET Micro Framework移植:基于STM32F10x處理器
- 大學計算機基礎實訓教程
- Java高手是怎樣煉成的:原理、方法與實踐
- 面向對象程序設計及C++(第3版)
- Elasticsearch搜索引擎構建入門與實戰
- H5頁面設計與制作(全彩慕課版·第2版)