官术网_书友最值得收藏!

Benefits of using Spark ML as compared to existing libraries

AMQ Lab at Berkley Evaluated Spark, and RDDs were evaluated through a series of experiments on Amazon EC2 as well as benchmarks of user applications.

  • Algorithms used: Logistical Regression and k-means
  • Use case: First iteration, multiple iterations.

All the tests used m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. HDFS was for storage with 256 MB blocks. Refer to the following graph:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for Logistical Regression:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for K Means clustering algorithm.

The overall results show the following:

  • Spark outperforms Hadoop by up to 20 times in iterative machine learning and graph applications. The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects.
  • The applications written perform and scale well. Spark can speed up an analytics report that was running on Hadoop by 40 times.
  • When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
  • Spark was be used to query a 1-TB dataset interactively with latencies of 5-7 seconds.

Spark versus Hadoop for a SORT Benchmark--In 2014, the Databricks team participated in a SORT benchmark test (http://sortbenchmark.org/). This was done on a 100-TB dataset. Hadoop was running in a dedicated data center and a Spark cluster of over 200 nodes was run on EC2. Spark was run on HDFS distributed storage.

Spark was 3 times faster than Hadoop and used 10 times fewer machines. Refer to the following graph:

主站蜘蛛池模板: 花莲县| 姚安县| 朝阳县| 阳春市| 长宁县| 明水县| 台北县| 五指山市| 中方县| 哈巴河县| 会理县| 宜丰县| 浪卡子县| 白城市| 即墨市| 昭觉县| 望都县| 徐汇区| 衡南县| 平泉县| 宁河县| 洱源县| 大关县| 正镶白旗| 许昌县| 东安县| 五指山市| 新密市| 商河县| 舟山市| 西乌珠穆沁旗| 双桥区| 康定县| 岳阳县| 栖霞市| 同江市| 邯郸县| 绵竹市| 连江县| 综艺| 家居|