- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 226字
- 2021-07-02 18:55:26
Memory
In order to avoid OOM (Out of Memory) messages for the tasks on your Apache Spark cluster, please consider a number of questions for the tuning:
- Consider the level of physical memory available on your Spark worker nodes. Can it be increased? Check on the memory consumption of operating system processes during high workloads in order to get an idea of free memory. Make sure that the workers have enough memory.
- Consider data partitioning. Can you increase the number of partitions? As a rule of thumb, you should have at least as many partitions as you have available CPU cores on the cluster. Use the repartition function on the RDD API.
- Can you modify the storage fraction and the memory used by the JVM for storage and caching of RDDs? Workers are competing for memory against data storage. Use the Storage page on the Apache Spark user interface to see if this fraction is set to an optimal value. Then update the following properties:
- spark.memory.fraction
- spark.memory.storageFraction
- spark.memory.offHeap.enabled=true
- spark.memory.offHeap.size
In addition, the following two things can be done in order to improve performance:
- Consider using Parquet as a storage format, which is much more storage effective than CSV or JSON
- Consider using the DataFrame/Dataset API instead of the RDD API as it might resolve in more effective executions (more about this in the next three chapters)
推薦閱讀
- Apache Oozie Essentials
- ThinkPHP 5實戰
- Mastering Ubuntu Server
- Essential Angular
- Learning Neo4j 3.x(Second Edition)
- Java 11 Cookbook
- Visual Basic程序設計上機實驗教程
- 一塊面包板玩轉Arduino編程
- Mastering VMware Horizon 7(Second Edition)
- 人人都能開發RPA機器人:UiPath從入門到實戰
- Spring Data JPA從入門到精通
- Java多線程并發體系實戰(微課視頻版)
- 零基礎學編程系列(全5冊)
- Less Web Development Cookbook
- Linux Networking Cookbook