- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 355字
- 2021-07-02 18:46:08
Data caching
Many machine learning algorithms are iterative in nature and thus require multiple passes over the data. However, all data stored in Spark RDD are by default transient, since RDD just stores the transformation to be executed and not the actual data. That means each action would recompute data again and again by executing the transformation stored in RDD.
Hence, Spark provides a way to persist the data in case we need to iterate over it. Spark also publishes several StorageLevels to allow storing data with various options:
- NONE: No caching at all
- MEMORY_ONLY: Caches RDD data only in memory
- DISK_ONLY: Write cached RDD data to a disk and releases from memory
- MEMORY_AND_DISK: Caches RDD in memory, if it's not possible to offload data to a disk
- OFF_HEAP: Use external memory storage which is not part of JVM heap
Furthermore, Spark gives users the ability to cache data in two flavors: raw (for example, MEMORY_ONLY) and serialized (for example, MEMORY_ONLY_SER). The later uses large memory buffers to store serialized content of RDD directly. Which one to use is very task and resource dependent. A good rule of thumb is if the dataset you are working with is less than 10 gigs then raw caching is preferred to serialized caching. However, once you cross over the 10 gigs soft-threshold, raw caching imposes a greater memory footprint than serialized caching.
Spark can be forced to cache by calling the cache() method on RDD or directly via calling the method persist with the desired persistent target - persist(StorageLevels.MEMORY_ONLY_SER). It is useful to know that RDD allows us to set up the storage level only once.
The decision on what to cache and how to cache is part of the Spark magic; however, the golden rule is to use caching when we need to access RDD data several times and choose a destination based on the application preference respecting speed and storage. A great blogpost which goes into far more detail than what is given here is available at:
http://sujee.net/2015/01/22/understanding-spark-caching/#.VpU1nJMrLdc
Cached RDDs can be accessed as well from the H2O Flow UI by evaluating the cell with getRDDs:

- Extending Jenkins
- 解構產品經理:互聯網產品策劃入門寶典
- Cocos2d-x游戲開發:手把手教你Lua語言的編程方法
- 零基礎學Java程序設計
- 精通網絡視頻核心開發技術
- .NET 3.5編程
- 低代碼平臺開發實踐:基于React
- Go語言精進之路:從新手到高手的編程思想、方法和技巧(2)
- 圖數據庫實戰
- Statistical Application Development with R and Python(Second Edition)
- 時空數據建模及其應用
- 跟戴銘學iOS編程:理順核心知識點
- Wearable:Tech Projects with the Raspberry Pi Zero
- Java EE輕量級解決方案:S2SH
- LabVIEW案例實戰