官术网_书友最值得收藏!

Memory

In order to avoid OOM (Out of Memory) messages for the tasks on your Apache Spark cluster, please consider a number of questions for the tuning:

  • Consider the level of physical memory available on your Spark worker nodes. Can it be increased? Check on the memory consumption of operating system processes during high workloads in order to get an idea of free memory. Make sure that the workers have enough memory.
  • Consider data partitioning. Can you increase the number of partitions? As a rule of thumb, you should have at least as many partitions as you have available CPU cores on the cluster. Use the repartition function on the RDD API.
  • Can you modify the storage fraction and the memory used by the JVM for storage and caching of RDDs? Workers are competing for memory against data storage. Use the Storage page on the Apache Spark user interface to see if this fraction is set to an optimal value. Then update the following properties:
  • spark.memory.fraction
  • spark.memory.storageFraction
  • spark.memory.offHeap.enabled=true
  • spark.memory.offHeap.size

In addition, the following two things can be done in order to improve performance:

  • Consider using Parquet as a storage format, which is much more storage effective than CSV or JSON
  • Consider using the DataFrame/Dataset API instead of the RDD API as it might resolve in more effective executions (more about this in the next three chapters)
主站蜘蛛池模板: 钟山县| 娄底市| 盐山县| 繁峙县| 冷水江市| 伊宁市| 莱芜市| 德庆县| 杭锦后旗| 特克斯县| 陵川县| 灌阳县| 刚察县| 柳江县| 北辰区| 怀宁县| 广州市| 云梦县| 三都| 孟津县| 当雄县| 临颍县| 丰都县| 临潭县| 明星| 建湖县| 满洲里市| 卢氏县| 胶南市| 台东市| 疏附县| 云南省| 策勒县| 海南省| 佛山市| 鄂尔多斯市| 南昌县| 临洮县| 塔河县| 荥经县| 明星|