官术网_书友最值得收藏!

Parallel collections 

Say that I am describing some new and exciting algorithm to you. I start telling you about how the algorithm exploits hash tables. We typically think of such data structures as all  residing in memory, locked (if required), and worked upon by one thread.

For example, take a list of numbers. Say that we want to sum all these numbers. This operation could be parallelized on multiple cores by using threads.

Now, we need to stay away from explicit locking. An abstraction that works concurrently on our list would be nice. It would split the list, run the function on each sublist, and collate the result in the end, as shown in the following diagram. This is the typical MapReduce paradigm in action: 

The preceding diagram shows a Scala collection that has been parallelized in order to use concurrency internally.  

What if the data structure is so large that it cannot all fit in the memory of a single machine? We could split the collection across a cluster of machines instead.

The Apache Spark framework does this for us. Spark's Resilient Distributed Dataset (RDD) is a partitioned collection that spreads the data structure across cluster machines, and thus can work on huge collections, typically to perform analytical processing.   

主站蜘蛛池模板: 哈密市| 延庆县| 镇赉县| 长岭县| 明星| 安远县| 县级市| 西藏| 泗洪县| 白玉县| 临夏县| 离岛区| 彩票| 霍林郭勒市| 金华市| 三明市| 大庆市| 曲麻莱县| 车致| 交城县| 梁平县| 梨树县| 远安县| 沽源县| 乐东| 厦门市| 贡觉县| 富宁县| 且末县| 陇南市| 镇安县| 凤城市| 玉环县| 文成县| 济源市| 武威市| 玉山县| 巴彦县| 秭归县| 岳普湖县| 浠水县|