- Real-Time Big Data Analytics
- Sumit Gupta Shilpi
- 436字
- 2021-07-16 12:54:32
The Big Data infrastructure
Technologies providing the capability to store, process, and analyze data are the core of any Big Data stack. The era of tables and records ran for a very long time, after the standard relational data store took over from file-based sequential storage. We were able to harness the storage and compute power very well for enterprises, but eventually the journey ended when we ran into the five Vs.
At the end of its era, we could see our, so far, robust RDBMS struggling to survive in a cost-effective manner as a tool for data storage and processing. The scaling of traditional RDBMS at the compute power expected to process a huge amount of data with low latency came at a very high price. This led to the emergence of new technologies that were low cost, low latency, and highly scalable at low cost, or were open source. Today, we deal with Hadoop clusters with thousands of nodes, hurling and churning thousands of terabytes of data.
The key technologies of the Hadoop ecosystem are as follows:
- Hadoop: The yellow elephant that took the data storage and computation arena by surprise. It's designed and developed as a distributed framework for data storage and computation on commodity hardware in a highly reliable and scalable manner. Hadoop works by distributing the data in chunks over all the nodes in the cluster and then processing the data concurrently on all the nodes. Two key moving components in Hadoop are mappers and reducers.
- NoSQL: This is an abbreviation for No-SQL, which actually is not the traditional structured query language. It's basically a tool to process a huge volume of multi-structured data; widely known ones are HBase and Cassandra. Unlike traditional database systems, they generally have no single point of failure and are scalable.
- MPP (short for Massively Parallel Processing) databases: These are computational platforms that are able to process data at a very fast rate. The basic working uses the concept of segmenting the data into chunks across different nodes in the cluster, and then processing the data in parallel. They are similar to Hadoop in terms of data segmentation and concurrent processing at each node. They are different from Hadoop in that they don't execute on low-end commodity machines, but on high-memory, specialized hardware. They have SQL-like interfaces for the interaction and retrieval of data, and they generally end up processing data faster because they use in-memory processing. This means that, unlike Hadoop that operates at disk level, MPP databases load the data into memory and operate upon the collective memory of all nodes in the cluster.
- Apache ZooKeeper Essentials
- 造個小程序:與微信一起干件正經事兒
- Scratch 3.0少兒編程與邏輯思維訓練
- R用戶Python學習指南:數據科學方法
- ArcGIS for Desktop Cookbook
- ActionScript 3.0從入門到精通(視頻實戰版)
- Flink技術內幕:架構設計與實現原理
- Mastering Apache Camel
- Laravel Design Patterns and Best Practices
- 算法訓練營:海量圖解+競賽刷題(入門篇)
- Learning GraphQL and Relay
- LabVIEW案例實戰
- PHP編程(第4版)
- Learning HTML5 by Creating Fun Games
- C語言開發入門教程