官术网_书友最值得收藏!

MapReduce abstractions

This simple MapReduce example requires more than 50 lines of Java code (mostly because of infrastructure and boilerplate code). In SQL, a similar implementation would just require the following:

SELECT level, count(*) FROM table GROUP BY level

Hive is a technology originating from Facebook that translates SQL commands, such as the preceding one, into sets of map and reduce phases. SQL offers convenient ubiquity, and it is known by almost everyone.

However, SQL is declarative and expresses the logic of a computation without describing its control flow. So, there are use cases that will be unusual to implement in SQL, and some problems are too complex to be expressed in relational algebra. For example, SQL handles joins naturally, but it has no built-in mechanism for splitting data into streams and applying different operations to each substream.

Pig is a technology originating from Yahoo that offers a relational data-flow language. It is procedural, supports splits, and provides useful operators for joining and grouping data. Code can be inserted anywhere in the data flow and is appealing because it is easy to read and learn.

However, Pig is a purpose-built language; it excels at simple data flows, but it is inefficient for implementing non-trivial algorithms.

In Pig, the same example can be implemented as follows:

LogLine    = load 'file.logs' as (level, message);
LevelGroup = group LogLine by level;
Result     = foreach LevelGroup generate group, COUNT(LogLine);
store Result into 'Results.txt';

Both Pig and Hive support extra functionality through loadable user-defined functions (UDF) implemented in Java classes.

Cascading is implemented in Java and designed to be expressive and extensible. It is based on the design pattern of pipelines that many other technologies follow. The pipeline is inspired from the original chain of responsibility design pattern and allows ordered lists of actions to be executed. It provides a Java-based API for data-processing flows.

Developers with functional programming backgrounds quickly introduced new domain specific languages that leverage its capabilities. Scalding, Cascalog, and PyCascading are popular implementations on top of Cascading, which are implemented in programming languages such as Scala, Clojure, and Python.

主站蜘蛛池模板: 东港市| 上蔡县| 新营市| 逊克县| 大洼县| 嘉义县| 许昌市| 中卫市| 郑州市| 深州市| 罗平县| 上饶县| 焉耆| 湘潭市| 青海省| 铅山县| 西畴县| 葵青区| 肇源县| 文安县| 扶余县| 宁陵县| 通山县| 宜宾县| 沧源| 新巴尔虎左旗| 花莲市| 共和县| 洪湖市| 灵璧县| 莒南县| 勃利县| 阿勒泰市| 通海县| 海晏县| 波密县| 廉江市| 雷波县| 逊克县| 乌什县| 定日县|