- Machine Learning with Spark(Second Edition)
- Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
- 519字
- 2021-07-09 21:07:41
Broadcast variables and accumulators
Another core feature of Spark is the ability to create two special types of variables--broadcast variables and accumulators.
A broadcast variable is a read-only variable that is created from the driver program object and made available to the nodes that will execute the computation. This is very useful in applications that need to make the same data available to the worker nodes in an efficient manner, such as distributed systems. Spark makes creating broadcast variables as simple as calling a method on SparkContext, as follows:
val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))
A broadcast variable can be accessed from nodes other than the driver program that created it (that is, the worker nodes) by calling value on the variable:
sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++
x).collect
This code creates a new RDD with three records from a collection (in this case, a Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with the relevant rom our new RDD appended to the broadcastAList that is our broadcast variable:
...
res1: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b,
c, d, e, 2), List(a, b, c, d, e, 3))
Notice the collect method in the preceding code. This is a Spark action that returns the entire RDD to the driver as a Scala (or Python or Java) collection.
We will often use when we wish to apply further processing to our results locally within the driver program.
It is preferable to perform as much heavy-duty processing on our Spark cluster as possible, preventing the driver from becoming a bottleneck. In many cases, however, such as during iterations in many machine learning models, collecting results to the driver is necessary.
On inspecting the result, we will see that for each of the three records in our new RDD, we now have a record that is our original broadcasted List, with the new element appended to it (that is, there is now "1", "2", or "3" at the end):
An accumulator is also a variable that is broadcasted to the worker nodes. The key difference between a broadcast variable and an accumulator is that while the broadcast variable is read-only, the accumulator can be added to. There are limitations to this, that is, in particular, the addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program. Each worker node can only access and add to its own local accumulator value, and only the driver program can access the global value. Accumulators are also accessed within the Spark code using the value method.
- 大數(shù)據(jù)導(dǎo)論:思維、技術(shù)與應(yīng)用
- AutoCAD快速入門與工程制圖
- 輕松學(xué)C#
- 會聲會影X5視頻剪輯高手速成
- 協(xié)作機(jī)器人技術(shù)及應(yīng)用
- 電腦上網(wǎng)直通車
- 精通Excel VBA
- Python Data Science Essentials
- Creo Parametric 1.0中文版從入門到精通
- 網(wǎng)中之我:何明升網(wǎng)絡(luò)社會論稿
- Hands-On Data Warehousing with Azure Data Factory
- The DevOps 2.1 Toolkit:Docker Swarm
- 大數(shù)據(jù)案例精析
- INSTANT Adobe Story Starter
- Building Google Cloud Platform Solutions