- Machine Learning with Spark(Second Edition)
- Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
- 225字
- 2021-07-09 21:07:41
Caching RDDs
One of the most powerful features of Spark is the ability to cache data in memory across a cluster. This is achieved through the use of the cache method on an RDD:
rddFromTextFile.cache
res0: rddFromTextFile.type = MapPartitionsRDD[1] at textFile at
<console>:27
Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first time an action is called on the RDD that initiates a computation, the data is read from its source and put into memory. Hence, the first time such an operation is called, the time it takes to run the task is partly dependent on the time it takes to read the data from the input source. However, when the data is accessed the next time (for example, in subsequent queries in analytics or iterations in a machine learning model), the data can be read directly from memory, thus avoiding expensive I/O operations and speeding up the computation, in many cases, by a significant factor.
If we now call the count or sum function on our cached RDD, the RDD is loaded into memory:
val aveLengthOfRecordChained = rddFromTextFile.map(line =>
line.size).sum / rddFromTextFile.count
http://spark.apache.org/docs/latest/programmingguide.html#rdd-persistence
- Introduction to DevOps with Kubernetes
- Circos Data Visualization How-to
- JavaScript實例自學手冊
- 錯覺:AI 如何通過數據挖掘誤導我們
- Photoshop CS4經典380例
- Effective DevOps with AWS
- OpenStack Cloud Computing Cookbook(Second Edition)
- Ceph:Designing and Implementing Scalable Storage Systems
- 面向對象程序設計綜合實踐
- 工業機器人入門實用教程
- AVR單片機工程師是怎樣煉成的
- 筆記本電腦維修之電路分析基礎
- Creating ELearning Games with Unity
- 計算機組裝與維修實訓
- Mastering DynamoDB