官术网_书友最值得收藏!

The MapReduce pattern

The MapReduce pattern is an example of a common case where concurrency is needed. The following diagram shows a word frequency counter; given a huge text stream of trillions of words, we need to see how many times every word occurs in the text. The algorithm is super simple: we keep the running count of each word in a hash table, with the word as the key and the counter as the value. The hash table allows us to quickly look up the next word and increment the associated value (counter). 

Given the size of the input text, one single node does not have the memory for the entire hash table! Concurrency provides a solution, using the MapReduce pattern, as shown in the following diagram:

The solution is divide and conquer: we maintain a distributed hash table and run the same algorithm, suitably adapted for a cluster.

The master node reads—parses—the text and pushes it to a set of slave processing nodes. The idea is to distribute the text in such a way that one word is processed by one slave node only. For example, given three processing nodes, as shown in the preceding diagram, we would divide rangewise: push nodes starting with the characters {a..j} to node 1, {k..r} to node 2 and the rest—{s..z}—onto node 3. This is the map part (distribution of work).

Once the stream is exhausted, each slave node sends its frequency result back to the master, which prints the result. 

The slave nodes are all doing the same processing concurrently. Note that the algorithm would run faster if we add, more slave nodes; that is, if we scale it horizontally.  

主站蜘蛛池模板: 南丰县| 阿拉善盟| 唐海县| 玛沁县| 邢台市| 西盟| 高青县| 镇赉县| 天津市| 新田县| 明水县| 门头沟区| 白朗县| 龙里县| 武平县| 高邮市| 玉树县| 郯城县| 时尚| 扎鲁特旗| 满城县| 开江县| 阜平县| 沂源县| 江孜县| 安达市| 盐津县| 宁城县| 平和县| 澄江县| 阜新市| 寿光市| 兴义市| 阿拉善盟| 安陆市| 巴彦淖尔市| 大邑县| 甘德县| 水富县| 湖南省| 肥东县|