官术网_书友最值得收藏!

Broadcast variables and accumulators

Another core feature of Spark is the ability to create two special types of variables--broadcast variables and accumulators.

A broadcast variable is a read-only variable that is created from the driver program object and made available to the nodes that will execute the computation. This is very useful in applications that need to make the same data available to the worker nodes in an efficient manner, such as distributed systems. Spark makes creating broadcast variables as simple as calling a method on SparkContext, as follows:

val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))

A broadcast variable can be accessed from nodes other than the driver program that created it (that is, the worker nodes) by calling value on the variable:

sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++  
x).collect

This code creates a new RDD with three records from a collection (in this case, a Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with the relevant rom our new RDD appended to the broadcastAList that is our broadcast variable:

...
res1: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b,
c, d, e, 2), List(a, b, c, d, e, 3))

Notice the collect method in the preceding code. This is a Spark action that returns the entire RDD to the driver as a Scala (or Python or Java) collection.

We will often use when we wish to apply further processing to our results locally within the driver program.

Note that collect should generally only be used in cases where we really want to return the full result set to the driver and perform further processing. If we try to call collect on a very large dataset, we might run out of memory on the driver and crash our program.
It is preferable to perform as much heavy-duty processing on our Spark cluster as possible, preventing the driver from becoming a bottleneck. In many cases, however, such as during iterations in many machine learning models, collecting results to the driver is necessary.

On inspecting the result, we will see that for each of the three records in our new RDD, we now have a record that is our original broadcasted List, with the new element appended to it (that is, there is now "1", "2", or "3" at the end):

An accumulator is also a variable that is broadcasted to the worker nodes. The key difference between a broadcast variable and an accumulator is that while the broadcast variable is read-only, the accumulator can be added to. There are limitations to this, that is, in particular, the addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program. Each worker node can only access and add to its own local accumulator value, and only the driver program can access the global value. Accumulators are also accessed within the Spark code using the value method.

For more details on broadcast variables and accumulators, refer to the Shared Variables section of the Spark Programming Guide at http://spark.apache.org/docs/latest/programming-guide.html#shared-variables.
主站蜘蛛池模板: 五大连池市| 且末县| 红原县| 新巴尔虎左旗| 云阳县| 玉溪市| 云梦县| 长武县| 漯河市| 孝昌县| 绥江县| 通道| 阜康市| 巴林右旗| 边坝县| 革吉县| 丘北县| 巨鹿县| 建瓯市| 高安市| 曲周县| 莎车县| 牙克石市| 高平市| 包头市| 内丘县| 防城港市| 运城市| 穆棱市| 龙川县| 雷波县| 武定县| 镶黄旗| 潞城市| 阿克陶县| 盐亭县| 佛教| 开封县| 广宗县| 布尔津县| 台东市|