官术网_书友最值得收藏!

  • Real-Time Big Data Analytics
  • Sumit Gupta Shilpi
  • 436字
  • 2021-07-16 12:54:32

The Big Data infrastructure

Technologies providing the capability to store, process, and analyze data are the core of any Big Data stack. The era of tables and records ran for a very long time, after the standard relational data store took over from file-based sequential storage. We were able to harness the storage and compute power very well for enterprises, but eventually the journey ended when we ran into the five Vs.

At the end of its era, we could see our, so far, robust RDBMS struggling to survive in a cost-effective manner as a tool for data storage and processing. The scaling of traditional RDBMS at the compute power expected to process a huge amount of data with low latency came at a very high price. This led to the emergence of new technologies that were low cost, low latency, and highly scalable at low cost, or were open source. Today, we deal with Hadoop clusters with thousands of nodes, hurling and churning thousands of terabytes of data.

The key technologies of the Hadoop ecosystem are as follows:

  • Hadoop: The yellow elephant that took the data storage and computation arena by surprise. It's designed and developed as a distributed framework for data storage and computation on commodity hardware in a highly reliable and scalable manner. Hadoop works by distributing the data in chunks over all the nodes in the cluster and then processing the data concurrently on all the nodes. Two key moving components in Hadoop are mappers and reducers.
  • NoSQL: This is an abbreviation for No-SQL, which actually is not the traditional structured query language. It's basically a tool to process a huge volume of multi-structured data; widely known ones are HBase and Cassandra. Unlike traditional database systems, they generally have no single point of failure and are scalable.
  • MPP (short for Massively Parallel Processing) databases: These are computational platforms that are able to process data at a very fast rate. The basic working uses the concept of segmenting the data into chunks across different nodes in the cluster, and then processing the data in parallel. They are similar to Hadoop in terms of data segmentation and concurrent processing at each node. They are different from Hadoop in that they don't execute on low-end commodity machines, but on high-memory, specialized hardware. They have SQL-like interfaces for the interaction and retrieval of data, and they generally end up processing data faster because they use in-memory processing. This means that, unlike Hadoop that operates at disk level, MPP databases load the data into memory and operate upon the collective memory of all nodes in the cluster.
主站蜘蛛池模板: 宁城县| 子洲县| 柏乡县| 东城区| 德昌县| 同心县| 大名县| 安新县| 柏乡县| 噶尔县| 思茅市| 正宁县| 美姑县| 文登市| 宁海县| 张家口市| 新河县| 岳阳市| 南溪县| 聂拉木县| 毕节市| 原阳县| 广宗县| 郴州市| 迁西县| 康定县| 西乌珠穆沁旗| 额敏县| 蕲春县| 论坛| 桐乡市| 电白县| 九龙城区| 华蓥市| 宁陕县| 四子王旗| 崇义县| 凯里市| 河西区| 大港区| 栖霞市|