官术网_书友最值得收藏!

Benefits of using Spark ML as compared to existing libraries

AMQ Lab at Berkley Evaluated Spark, and RDDs were evaluated through a series of experiments on Amazon EC2 as well as benchmarks of user applications.

  • Algorithms used: Logistical Regression and k-means
  • Use case: First iteration, multiple iterations.

All the tests used m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. HDFS was for storage with 256 MB blocks. Refer to the following graph:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for Logistical Regression:

The preceding graph shows the comparison between the performance of Hadoop and Spark for the first and subsequent iteration for K Means clustering algorithm.

The overall results show the following:

  • Spark outperforms Hadoop by up to 20 times in iterative machine learning and graph applications. The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects.
  • The applications written perform and scale well. Spark can speed up an analytics report that was running on Hadoop by 40 times.
  • When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.
  • Spark was be used to query a 1-TB dataset interactively with latencies of 5-7 seconds.

Spark versus Hadoop for a SORT Benchmark--In 2014, the Databricks team participated in a SORT benchmark test (http://sortbenchmark.org/). This was done on a 100-TB dataset. Hadoop was running in a dedicated data center and a Spark cluster of over 200 nodes was run on EC2. Spark was run on HDFS distributed storage.

Spark was 3 times faster than Hadoop and used 10 times fewer machines. Refer to the following graph:

主站蜘蛛池模板: 上饶县| 夹江县| 邹城市| 恩施市| 罗城| 称多县| 沁源县| 康平县| 开封市| 宣化县| 三明市| 三明市| 铅山县| 扬州市| 九江县| 新宁县| 永定县| 闽清县| 高邮市| 建昌县| 翼城县| 定州市| 桂阳县| 和龙市| 眉山市| 宜春市| 奎屯市| 缙云县| 湄潭县| 北宁市| 孝感市| 康乐县| 广丰县| 普格县| 杭锦旗| 黔西县| 湘西| 怀来县| 抚顺县| 普洱| 泉州市|