官术网_书友最值得收藏!

Broadcast variables and accumulators

Another core feature of Spark is the ability to create two special types of variables--broadcast variables and accumulators.

A broadcast variable is a read-only variable that is created from the driver program object and made available to the nodes that will execute the computation. This is very useful in applications that need to make the same data available to the worker nodes in an efficient manner, such as distributed systems. Spark makes creating broadcast variables as simple as calling a method on SparkContext, as follows:

val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))

A broadcast variable can be accessed from nodes other than the driver program that created it (that is, the worker nodes) by calling value on the variable:

sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++  
x).collect

This code creates a new RDD with three records from a collection (in this case, a Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with the relevant rom our new RDD appended to the broadcastAList that is our broadcast variable:

...
res1: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b,
c, d, e, 2), List(a, b, c, d, e, 3))

Notice the collect method in the preceding code. This is a Spark action that returns the entire RDD to the driver as a Scala (or Python or Java) collection.

We will often use when we wish to apply further processing to our results locally within the driver program.

Note that collect should generally only be used in cases where we really want to return the full result set to the driver and perform further processing. If we try to call collect on a very large dataset, we might run out of memory on the driver and crash our program.
It is preferable to perform as much heavy-duty processing on our Spark cluster as possible, preventing the driver from becoming a bottleneck. In many cases, however, such as during iterations in many machine learning models, collecting results to the driver is necessary.

On inspecting the result, we will see that for each of the three records in our new RDD, we now have a record that is our original broadcasted List, with the new element appended to it (that is, there is now "1", "2", or "3" at the end):

An accumulator is also a variable that is broadcasted to the worker nodes. The key difference between a broadcast variable and an accumulator is that while the broadcast variable is read-only, the accumulator can be added to. There are limitations to this, that is, in particular, the addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program. Each worker node can only access and add to its own local accumulator value, and only the driver program can access the global value. Accumulators are also accessed within the Spark code using the value method.

For more details on broadcast variables and accumulators, refer to the Shared Variables section of the Spark Programming Guide at http://spark.apache.org/docs/latest/programming-guide.html#shared-variables.
主站蜘蛛池模板: 岚皋县| 武隆县| 介休市| 游戏| 鄱阳县| 曲周县| 阜南县| 河北省| 信阳市| 白河县| 贡嘎县| 泉州市| 汕尾市| 马关县| 河间市| 阿拉善右旗| 新蔡县| 通城县| 大悟县| 蒙自县| 大邑县| 成都市| 高唐县| 休宁县| 牡丹江市| 宜君县| 湟源县| 昭觉县| 尼玛县| 蒙山县| 新乡县| 新乡县| 资溪县| 淮南市| 牙克石市| 延长县| 杭州市| 思茅市| 分宜县| 夏河县| 江川县|