官术网_书友最值得收藏!

Caching RDDs

One of the most powerful features of Spark is the ability to cache data in memory across a cluster. This is achieved through the use of the cache method on an RDD:

rddFromTextFile.cache
res0: rddFromTextFile.type = MapPartitionsRDD[1] at textFile at
<console>:27

Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first time an action is called on the RDD that initiates a computation, the data is read from its source and put into memory. Hence, the first time such an operation is called, the time it takes to run the task is partly dependent on the time it takes to read the data from the input source. However, when the data is accessed the next time (for example, in subsequent queries in analytics or iterations in a machine learning model), the data can be read directly from memory, thus avoiding expensive I/O operations and speeding up the computation, in many cases, by a significant factor.

If we now call the count or sum function on our cached RDD, the RDD is loaded into memory:

val aveLengthOfRecordChained = rddFromTextFile.map(line => 
line.size).sum / rddFromTextFile.count
Spark also allows more fine-grained control over caching behavior. You can use the persist method to specify what approach Spark uses to cache data. More information on RDD caching can be found here:
http://spark.apache.org/docs/latest/programmingguide.html#rdd-persistence
主站蜘蛛池模板: 江华| 伽师县| 蓝田县| 阿克陶县| 治县。| 锡林浩特市| 吉木乃县| 顺平县| 陆川县| 甘德县| 大石桥市| 舞阳县| 阳曲县| 建宁县| 锦屏县| 葫芦岛市| 无锡市| 蕉岭县| 福贡县| 化德县| 清流县| 建湖县| 嵩明县| 洛川县| 西昌市| 北碚区| 辰溪县| 文成县| 长宁县| 潜山县| 北辰区| 延安市| 广饶县| 苏州市| 鄯善县| 绥宁县| 廊坊市| 闻喜县| 哈密市| 临汾市| 洪雅县|