官术网_书友最值得收藏!

A First Taste and What’s New in Apache Spark V2

Apache Spark is a distributed and highly scalable in-memory data analytics system, providing you with the ability to develop applications in Java, Scala, and Python, as well as languages such as R. It has one of the highest contribution/involvement rates among the Apache top-level projects at this time. Apache systems, such as Mahout, now use it as a processing engine instead of MapReduce. It is also possible to use a Hive context to have the Spark applications process data directly to and from Apache Hive.

Initially, Apache Spark provided four main submodules--SQL, MLlib, GraphX, and Streaming. They will all be explained in their own chapters, but a simple overview would be useful here. The modules are interoperable, so data can be passed between them. For instance, streamed data can be passed to SQL and a temporary table can be created. Since version 1.6.0, MLlib has a sibling called SparkML with a different API, which we will cover in later chapters.

The following figure explains how this book will address Apache Spark and its modules:

The top two rows show Apache Spark and its submodules. Wherever possible, we will try to illustrate by giving an example of how the functionality may be extended using extra tools.

We infer that Spark is an in-memory processing system. When used at scale (it cannot exist alone), the data must reside somewhere. It will probably be used along with the Hadoop tool set and the associated ecosystem.

Luckily, Hadoop stack providers, such as IBM and Hortonworks, provide you with an open data platform, a Hadoop stack, and cluster manager, which integrates with Apache Spark, Hadoop, and most of the current stable toolset fully based on open source.

During this book, we will use Hortonworks Data Platform (HDP?) Sandbox 2.6.

You can use an alternative configuration, but we find that the open data platform provides most of the tools that we need and automates the configuration, leaving us more time for development.

In the following sections, we will cover each of the components mentioned earlier in more detail before we dive into the material starting in the next chapter:

  • Spark Machine Learning
  • Spark Streaming
  • Spark SQL
  • Spark Graph Processing
  • Extended Ecosystem
  • Updates in Apache Spark
  • Cluster design
  • Cloud-based deployments
  • Performance parameters
主站蜘蛛池模板: 乌海市| 铜陵市| 香港 | 玉林市| 察雅县| 永兴县| 东乌珠穆沁旗| 乌拉特后旗| 石景山区| 永寿县| 抚州市| 西华县| 龙川县| 汨罗市| 绥德县| 泸西县| 陵川县| 大港区| 江都市| 靖州| 葫芦岛市| 轮台县| 沈丘县| 红桥区| 中山市| 龙泉市| 都安| 开鲁县| 乌审旗| 洛宁县| 大化| 广河县| 本溪| 卓资县| 南京市| 来凤县| 抚顺市| 长海县| 东丽区| 南岸区| 汉中市|