- Artificial Intelligence for Big Data
- Anand Deshpande Manish Kumar
- 298字
- 2021-06-25 21:57:06
Batch processing
Traditionally, the data processing pipeline within data warehousing systems consisted of Extracting, Transforming, and Loading the data for analysis and actions (ETL). With the new paradigm of file-based distributed computing, there has been a shift in the ETL process sequence. Now the data is Extracted, Loaded, and Transformed repetitively for analysis (ELTTT) a number of times:
In batch processing, the data is collected from various sources in the staging areas and loaded and transformed with defined frequencies and schedules. In most use cases with batch processing, there is no critical need to process the data in real time or in near real time. As an example, the monthly report on a student's attendance data will be generated by a process (batch) at the end of a calendar month. This process will extract the data from source systems, load it, and transform it for various views and reports. One of the most popular batch processing frameworks is Apache Hadoop. It is a highly scalable, distributed/parallel processing framework. The primary building block of Hadoop is the Hadoop Distributed File System.
As the name suggests, this is a wrapper filesystem which stores the data (structured/unstructured/semi-structured) in a distributed manner on data nodes within Hadoop. The processing that is applied on the data (instead of the data that is processed) is sent to the data on various nodes. Once the compute is performed by an inpidual node, the results are consolidated by the master process. In this paradigm of data-compute localization, Hadoop relies heavily on intermediate I/O operations on hard drive disks. As a result, extremely large volumes of data can be processed by Hadoop in a reliable manner at the cost of processing time. This framework is very suitable for extracting value from Big Data in batch mode.
- 區(qū)塊鏈通俗讀本
- 大數(shù)據(jù)Hadoop 3.X分布式處理實(shí)戰(zhàn)
- The Game Jam Survival Guide
- 數(shù)據(jù)庫(kù)技術(shù)實(shí)用教程
- LabVIEW 完全自學(xué)手冊(cè)
- 大數(shù)據(jù)治理與安全:從理論到開(kāi)源實(shí)踐
- 計(jì)算機(jī)視覺(jué)
- Python 3爬蟲(chóng)、數(shù)據(jù)清洗與可視化實(shí)戰(zhàn)
- Arquillian Testing Guide
- 碼上行動(dòng):利用Python與ChatGPT高效搞定Excel數(shù)據(jù)分析
- 一本書(shū)講透數(shù)據(jù)治理:戰(zhàn)略、方法、工具與實(shí)踐
- 一本書(shū)讀懂區(qū)塊鏈(第2版)
- 實(shí)用數(shù)據(jù)結(jié)構(gòu)基礎(chǔ)(第四版)
- 計(jì)算機(jī)應(yīng)用實(shí)務(wù)(第3版)
- 反饋:化解不確定性的數(shù)字認(rèn)知論