官术网_书友最值得收藏!

Apache Spark

Apache Spark (https://spark.apache.org/) is a unified analytics engine for large-scale data processing. Spark provides APIs for batch as well as stream data processing in a distributed computing environment. Spark's API can be broadly divided into the following five categories:

  • Core: RDD
  • SQL structured: DataFrames and Datasets
  • Streaming: Structured streaming and DStreams
  • MLlib: Machine learning
  • GraphX: Graph processing

Apache Spark is a very active open source project. New features are added and performance improvements made on a regular basis. Typically, there is a new minor release of Apache Spark every three months with significant performance and feature improvements. At the time of writing, 2.4.0 is the most recent version of Spark.

The following is Spark core's SBT dependency:

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1"

Spark version 2.4.0 has introduced support for Scala version 2.12; however, we will be using Scala version 2.11 for exploring Spark's feature sets. Spark will be covered in more detail in the subsequent chapters.

主站蜘蛛池模板: 彭水| 湘乡市| 固原市| 鄂托克旗| 喀什市| 卫辉市| 阿合奇县| 济宁市| 泾川县| 彩票| 冷水江市| 湖口县| 黑河市| 嵊泗县| 山阴县| 石泉县| 吉林省| 铜陵市| 上犹县| 樟树市| 聊城市| 扶绥县| 武宣县| 湖南省| 光泽县| 大余县| 齐河县| 长沙县| 遂川县| 镶黄旗| 封丘县| 惠安县| 辽阳市| 昌平区| 镇巴县| 独山县| 内乡县| 阿城市| 易门县| 新兴县| 友谊县|