- Real-Time Big Data Analytics
- Sumit Gupta Shilpi
- 436字
- 2021-07-16 12:54:32
The Big Data infrastructure
Technologies providing the capability to store, process, and analyze data are the core of any Big Data stack. The era of tables and records ran for a very long time, after the standard relational data store took over from file-based sequential storage. We were able to harness the storage and compute power very well for enterprises, but eventually the journey ended when we ran into the five Vs.
At the end of its era, we could see our, so far, robust RDBMS struggling to survive in a cost-effective manner as a tool for data storage and processing. The scaling of traditional RDBMS at the compute power expected to process a huge amount of data with low latency came at a very high price. This led to the emergence of new technologies that were low cost, low latency, and highly scalable at low cost, or were open source. Today, we deal with Hadoop clusters with thousands of nodes, hurling and churning thousands of terabytes of data.
The key technologies of the Hadoop ecosystem are as follows:
- Hadoop: The yellow elephant that took the data storage and computation arena by surprise. It's designed and developed as a distributed framework for data storage and computation on commodity hardware in a highly reliable and scalable manner. Hadoop works by distributing the data in chunks over all the nodes in the cluster and then processing the data concurrently on all the nodes. Two key moving components in Hadoop are mappers and reducers.
- NoSQL: This is an abbreviation for No-SQL, which actually is not the traditional structured query language. It's basically a tool to process a huge volume of multi-structured data; widely known ones are HBase and Cassandra. Unlike traditional database systems, they generally have no single point of failure and are scalable.
- MPP (short for Massively Parallel Processing) databases: These are computational platforms that are able to process data at a very fast rate. The basic working uses the concept of segmenting the data into chunks across different nodes in the cluster, and then processing the data in parallel. They are similar to Hadoop in terms of data segmentation and concurrent processing at each node. They are different from Hadoop in that they don't execute on low-end commodity machines, but on high-memory, specialized hardware. They have SQL-like interfaces for the interaction and retrieval of data, and they generally end up processing data faster because they use in-memory processing. This means that, unlike Hadoop that operates at disk level, MPP databases load the data into memory and operate upon the collective memory of all nodes in the cluster.
- Flask Web全棧開發(fā)實(shí)戰(zhàn)
- Mastering Ext JS(Second Edition)
- Spring Boot 2實(shí)戰(zhàn)之旅
- LabVIEW程序設(shè)計(jì)基礎(chǔ)與應(yīng)用
- R語言數(shù)據(jù)可視化之美:專業(yè)圖表繪制指南
- Java Web及其框架技術(shù)
- Python零基礎(chǔ)快樂學(xué)習(xí)之旅(K12實(shí)戰(zhàn)訓(xùn)練)
- INSTANT Weka How-to
- Java Web應(yīng)用開發(fā)技術(shù)與案例教程(第2版)
- Microsoft Azure Storage Essentials
- Learning Concurrency in Kotlin
- 新一代SDN:VMware NSX 網(wǎng)絡(luò)原理與實(shí)踐
- Go語言編程
- Angular應(yīng)用程序開發(fā)指南
- Python語言科研繪圖與學(xué)術(shù)圖表繪制從入門到精通