官术网_书友最值得收藏!

The SparkSession--your gateway to structured data processing

The SparkSession is the starting point for working with columnar data in Apache Spark. It replaces SQLContext used in previous versions of Apache Spark. It was created from the Spark context and provides the means to load and save data files of different types using DataFrames and Datasets and manipulate columnar data with SQL, among other things. It can be used for the following functions:

  • Executing SQL via the sql method
  • Registering user-defined functions via the udf method
  • Caching
  • Creating DataFrames
  • Creating Datasets
The examples in this chapter are written in Scala as we prefer the language, but you can develop in Python, R, and Java as well. As stated previously, the SparkSession is created from the Spark context.

Using the SparkSession allows you to implicitly convert RDDs into DataFrames or Datasets. For instance, you can convert RDD into a DataFrame or Dataset by calling the toDF or toDS methods:

 import spark.implicits._
val rdd = sc.parallelize(List(1,2,3))
val df = rdd.toDF
val ds = rdd.toDS

As you can see, this is very simple as the corresponding methods are on the RDD object itself.

We are making use of Scala implicits function here because the RDD API wasn't designed with DataFrames or Datasets in mind and is therefore lacking the toDF or toDS methods. However, by importing the respective implicits, this behavior is added on the fly. If you want to learn more about Scala implicits, the following links are recommended:

Next, we will examine some of the supported file formats available to import and save data.

主站蜘蛛池模板: 额济纳旗| 乐至县| 察隅县| 武清区| 佛学| 巫溪县| 临朐县| 溧阳市| 伊宁市| 宜丰县| 佛教| 桂东县| 英超| 监利县| 白城市| 迭部县| 巴马| 沽源县| 泽州县| 高安市| 台北县| 容城县| 新巴尔虎右旗| 永靖县| 阿坝| 铁力市| 广汉市| 白河县| 汶川县| 丹东市| 日喀则市| 金门县| 东平县| 莱芜市| 旬阳县| 中宁县| 安徽省| 六安市| 马龙县| 高邑县| 富民县|