- Programming MapReduce with Scalding
- Antonios Chalkiopoulos
- 348字
- 2021-12-08 12:44:21
MapReduce abstractions
This simple MapReduce example requires more than 50 lines of Java code (mostly because of infrastructure and boilerplate code). In SQL, a similar implementation would just require the following:
SELECT level, count(*) FROM table GROUP BY level
Hive is a technology originating from Facebook that translates SQL commands, such as the preceding one, into sets of map and reduce phases. SQL offers convenient ubiquity, and it is known by almost everyone.
However, SQL is declarative and expresses the logic of a computation without describing its control flow. So, there are use cases that will be unusual to implement in SQL, and some problems are too complex to be expressed in relational algebra. For example, SQL handles joins naturally, but it has no built-in mechanism for splitting data into streams and applying different operations to each substream.
Pig is a technology originating from Yahoo that offers a relational data-flow language. It is procedural, supports splits, and provides useful operators for joining and grouping data. Code can be inserted anywhere in the data flow and is appealing because it is easy to read and learn.
However, Pig is a purpose-built language; it excels at simple data flows, but it is inefficient for implementing non-trivial algorithms.
In Pig, the same example can be implemented as follows:
LogLine = load 'file.logs' as (level, message); LevelGroup = group LogLine by level; Result = foreach LevelGroup generate group, COUNT(LogLine); store Result into 'Results.txt';
Both Pig and Hive support extra functionality through loadable user-defined functions (UDF) implemented in Java classes.
Cascading is implemented in Java and designed to be expressive and extensible. It is based on the design pattern of pipelines that many other technologies follow. The pipeline is inspired from the original chain of responsibility design pattern and allows ordered lists of actions to be executed. It provides a Java-based API for data-processing flows.
Developers with functional programming backgrounds quickly introduced new domain specific languages that leverage its capabilities. Scalding, Cascalog, and PyCascading are popular implementations on top of Cascading, which are implemented in programming languages such as Scala, Clojure, and Python.
- Instant Testing with CasperJS
- Angular UI Development with PrimeNG
- SQL學習指南(第3版)
- 微服務設計原理與架構
- 青少年美育趣味課堂:XMind思維導圖制作
- Python爬蟲開發與項目實戰
- Building Minecraft Server Modifications
- Python數據分析從0到1
- 零基礎輕松學SQL Server 2016
- JavaCAPS基礎、應用與案例
- 從零開始學C#
- Spring Boot+MVC實戰指南
- Python Projects for Kids
- Learning Unreal Engine Game Development
- Mastering ASP.NET Core 2.0