官术网_书友最值得收藏!

MapReduce abstractions

This simple MapReduce example requires more than 50 lines of Java code (mostly because of infrastructure and boilerplate code). In SQL, a similar implementation would just require the following:

SELECT level, count(*) FROM table GROUP BY level

Hive is a technology originating from Facebook that translates SQL commands, such as the preceding one, into sets of map and reduce phases. SQL offers convenient ubiquity, and it is known by almost everyone.

However, SQL is declarative and expresses the logic of a computation without describing its control flow. So, there are use cases that will be unusual to implement in SQL, and some problems are too complex to be expressed in relational algebra. For example, SQL handles joins naturally, but it has no built-in mechanism for splitting data into streams and applying different operations to each substream.

Pig is a technology originating from Yahoo that offers a relational data-flow language. It is procedural, supports splits, and provides useful operators for joining and grouping data. Code can be inserted anywhere in the data flow and is appealing because it is easy to read and learn.

However, Pig is a purpose-built language; it excels at simple data flows, but it is inefficient for implementing non-trivial algorithms.

In Pig, the same example can be implemented as follows:

LogLine    = load 'file.logs' as (level, message);
LevelGroup = group LogLine by level;
Result     = foreach LevelGroup generate group, COUNT(LogLine);
store Result into 'Results.txt';

Both Pig and Hive support extra functionality through loadable user-defined functions (UDF) implemented in Java classes.

Cascading is implemented in Java and designed to be expressive and extensible. It is based on the design pattern of pipelines that many other technologies follow. The pipeline is inspired from the original chain of responsibility design pattern and allows ordered lists of actions to be executed. It provides a Java-based API for data-processing flows.

Developers with functional programming backgrounds quickly introduced new domain specific languages that leverage its capabilities. Scalding, Cascalog, and PyCascading are popular implementations on top of Cascading, which are implemented in programming languages such as Scala, Clojure, and Python.

主站蜘蛛池模板: 铜陵市| 铁岭县| 朝阳区| 剑阁县| 万宁市| 井研县| 弋阳县| 赤壁市| 南京市| 黑龙江省| 阿合奇县| 靖边县| 花莲县| 栾城县| 宣化县| 和顺县| 顺平县| 雷山县| 徐水县| 绿春县| 瑞昌市| 抚宁县| 丰都县| 迁安市| 鱼台县| 讷河市| 华容县| 武功县| 五台县| 宿迁市| 大石桥市| 青田县| 渑池县| 鄂州市| 平顶山市| 吴旗县| 青冈县| 邵阳县| 洱源县| 布尔津县| 吉水县|