書名： Hands-On Data Analysis with Scala
作者名： Rajesh Gupta
本章字數： 163字
更新時間： 2021-06-24 14:51:07

Apache Spark

Apache Spark (https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Spark provides APIs for batch as well as stream data processing in a distributed computing environment. Spark's API can be broadly divided into the following five categories:

Core: RDD
SQL structured: DataFrames and Datasets
Streaming: Structured streaming and DStreams
MLlib: Machine learning
GraphX: Graph processing

Apache Spark is a very active open source project. New features are added and performance improvements made on a regular basis. Typically, there is a new minor release of Apache Spark every three months with significant performance and feature improvements. At the time of writing, 2.4.0 is the most recent version of Spark.

The following is Spark core's SBT dependency:

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1"

Spark version 2.4.0 has introduced support for Scala version 2.12; however, we will be using Scala version 2.11 for exploring Spark's feature sets. Spark will be covered in more detail in the subsequent chapters.

官术网_书友最值得收藏!

Hands-On Data Analysis with Scala

Apache Spark