官术网_书友最值得收藏!

MapReduce abstractions

This simple MapReduce example requires more than 50 lines of Java code (mostly because of infrastructure and boilerplate code). In SQL, a similar implementation would just require the following:

SELECT level, count(*) FROM table GROUP BY level

Hive is a technology originating from Facebook that translates SQL commands, such as the preceding one, into sets of map and reduce phases. SQL offers convenient ubiquity, and it is known by almost everyone.

However, SQL is declarative and expresses the logic of a computation without describing its control flow. So, there are use cases that will be unusual to implement in SQL, and some problems are too complex to be expressed in relational algebra. For example, SQL handles joins naturally, but it has no built-in mechanism for splitting data into streams and applying different operations to each substream.

Pig is a technology originating from Yahoo that offers a relational data-flow language. It is procedural, supports splits, and provides useful operators for joining and grouping data. Code can be inserted anywhere in the data flow and is appealing because it is easy to read and learn.

However, Pig is a purpose-built language; it excels at simple data flows, but it is inefficient for implementing non-trivial algorithms.

In Pig, the same example can be implemented as follows:

LogLine    = load 'file.logs' as (level, message);
LevelGroup = group LogLine by level;
Result     = foreach LevelGroup generate group, COUNT(LogLine);
store Result into 'Results.txt';

Both Pig and Hive support extra functionality through loadable user-defined functions (UDF) implemented in Java classes.

Cascading is implemented in Java and designed to be expressive and extensible. It is based on the design pattern of pipelines that many other technologies follow. The pipeline is inspired from the original chain of responsibility design pattern and allows ordered lists of actions to be executed. It provides a Java-based API for data-processing flows.

Developers with functional programming backgrounds quickly introduced new domain specific languages that leverage its capabilities. Scalding, Cascalog, and PyCascading are popular implementations on top of Cascading, which are implemented in programming languages such as Scala, Clojure, and Python.

主站蜘蛛池模板: 新津县| 巴南区| 左贡县| 西城区| 巫溪县| 临潭县| 淮阳县| 加查县| 黔东| 安义县| 奉节县| 晋宁县| 庄浪县| 任丘市| 中卫市| 凤翔县| 青川县| 张家界市| 信阳市| 淄博市| 辽中县| 木兰县| 喀喇沁旗| 根河市| 太仆寺旗| 论坛| 富宁县| 遂平县| 宝清县| 广南县| 原阳县| 竹北市| 郴州市| 寻甸| 晋州市| 屯昌县| 福海县| 阜南县| 泾阳县| 肥乡县| 鹤山市|