官术网_书友最值得收藏!

Apache Spark

Apache Spark (https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Spark provides APIs for batch as well as stream data processing in a distributed computing environment. Spark's API can be broadly divided into the following five categories:

  • Core: RDD
  • SQL structured: DataFrames and Datasets
  • Streaming: Structured streaming and DStreams
  • MLlib: Machine learning
  • GraphX: Graph processing

Apache Spark is a very active open source project. New features are added and performance improvements made on a regular basis. Typically, there is a new minor release of Apache Spark every three months with significant performance and feature improvements. At the time of writing, 2.4.0 is the most recent version of Spark.

The following is Spark core's SBT dependency:

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1"

Spark version 2.4.0 has introduced support for Scala version 2.12; however, we will be using Scala version 2.11 for exploring Spark's feature sets. Spark will be covered in more detail in the subsequent chapters.

主站蜘蛛池模板: 兴山县| 布尔津县| 响水县| 东丰县| 武山县| 武鸣县| 昔阳县| 开阳县| 武川县| 石林| 望江县| 荥经县| 乃东县| 平凉市| 邯郸市| 湖州市| 三台县| 襄汾县| 松阳县| 武穴市| 师宗县| 丁青县| 德令哈市| 图们市| 镇康县| 灵川县| 桐柏县| 万荣县| 阿瓦提县| 昔阳县| 察雅县| 揭东县| 浦城县| 彩票| 洛扎县| 遂昌县| 军事| 伊川县| 大连市| 广水市| 芷江|