The SparkSession--your gateway to structured data processing

The SparkSession is the starting point for working with columnar data in Apache Spark. It replaces SQLContext used in previous versions of Apache Spark. It was created from the Spark context and provides the means to load and save data files of different types using DataFrames and Datasets and manipulate columnar data with SQL, among other things. It can be used for the following functions:

Executing SQL via the sql method
Registering user-defined functions via the udf method
Caching
Creating DataFrames
Creating Datasets

The examples in this chapter are written in Scala as we prefer the language, but you can develop in Python, R, and Java as well. As stated previously, the SparkSession is created from the Spark context.

Using the SparkSession allows you to implicitly convert RDDs into DataFrames or Datasets. For instance, you can convert RDD into a DataFrame or Dataset by calling the toDF or toDS methods:

 import spark.implicits._
 val rdd = sc.parallelize(List(1,2,3))
 val df = rdd.toDF
 val ds = rdd.toDS

As you can see, this is very simple as the corresponding methods are on the RDD object itself.

We are making use of Scala implicits function here because the RDD API wasn't designed with DataFrames or Datasets in mind and is therefore lacking the toDF or toDS methods. However, by importing the respective implicits, this behavior is added on the fly. If you want to learn more about Scala implicits, the following links are recommended:

http://stackoverflow.com/questions/10375633/understanding-implicit-in-scala
http://www.artima.com/pins1ed/implicit-conversions-and-parameters.html
https://dzone.com/articles/learning-scala-implicits-with-spark

Next, we will examine some of the supported file formats available to import and save data.

官术网_书友最值得收藏!

Mastering Apache Spark 2.x（Second Edition）

The SparkSession--your gateway to structured data processing