官术网_书友最值得收藏!

Introduction

Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.

Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases.

Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.

Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.

Though Spark is written in Scala, and this book only focuses on recipes in Scala, Spark also supports Java and Python.

Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements.

The following figure shows the Spark ecosystem:

The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management feature that Spark can seamlessly use.

主站蜘蛛池模板: 荔波县| 丹阳市| 额济纳旗| 宁海县| 加查县| 邮箱| 富平县| 瑞安市| 泽普县| 乾安县| 合肥市| 上饶市| 海南省| 天峨县| 兴宁市| 威海市| 中宁县| 白朗县| 元谋县| 汤阴县| 安远县| 泊头市| 剑河县| 高平市| 玛多县| 甘洛县| 庆云县| 富民县| 苏尼特左旗| 广丰县| 开江县| 阿拉善左旗| 金昌市| 长泰县| 威宁| 万载县| 漳平市| 望城县| 林州市| 景泰县| 江华|