- Hands-On Data Analysis with Scala
- Rajesh Gupta
- 163字
- 2021-06-24 14:51:07
Apache Spark
Apache Spark (https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Spark provides APIs for batch as well as stream data processing in a distributed computing environment. Spark's API can be broadly divided into the following five categories:
- Core: RDD
- SQL structured: DataFrames and Datasets
- Streaming: Structured streaming and DStreams
- MLlib: Machine learning
- GraphX: Graph processing
Apache Spark is a very active open source project. New features are added and performance improvements made on a regular basis. Typically, there is a new minor release of Apache Spark every three months with significant performance and feature improvements. At the time of writing, 2.4.0 is the most recent version of Spark.
The following is Spark core's SBT dependency:
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1"
Spark version 2.4.0 has introduced support for Scala version 2.12; however, we will be using Scala version 2.11 for exploring Spark's feature sets. Spark will be covered in more detail in the subsequent chapters.