官术网_书友最值得收藏!

The SparkSession--your gateway to structured data processing

The SparkSession is the starting point for working with columnar data in Apache Spark. It replaces SQLContext used in previous versions of Apache Spark. It was created from the Spark context and provides the means to load and save data files of different types using DataFrames and Datasets and manipulate columnar data with SQL, among other things. It can be used for the following functions:

  • Executing SQL via the sql method
  • Registering user-defined functions via the udf method
  • Caching
  • Creating DataFrames
  • Creating Datasets
The examples in this chapter are written in Scala as we prefer the language, but you can develop in Python, R, and Java as well. As stated previously, the SparkSession is created from the Spark context.

Using the SparkSession allows you to implicitly convert RDDs into DataFrames or Datasets. For instance, you can convert RDD into a DataFrame or Dataset by calling the toDF or toDS methods:

 import spark.implicits._
val rdd = sc.parallelize(List(1,2,3))
val df = rdd.toDF
val ds = rdd.toDS

As you can see, this is very simple as the corresponding methods are on the RDD object itself.

We are making use of Scala implicits function here because the RDD API wasn't designed with DataFrames or Datasets in mind and is therefore lacking the toDF or toDS methods. However, by importing the respective implicits, this behavior is added on the fly. If you want to learn more about Scala implicits, the following links are recommended:

Next, we will examine some of the supported file formats available to import and save data.

主站蜘蛛池模板: 莆田市| 祁连县| 宣武区| 兴安盟| 孟村| 民丰县| 贺兰县| 朝阳市| 宁国市| 保德县| 永定县| 建瓯市| 启东市| 漳浦县| 日土县| 枣庄市| 陇川县| 开鲁县| 株洲县| 甘孜| 宝山区| 四川省| 铁岭市| 阳西县| 区。| 绿春县| 贵阳市| 东源县| 扶沟县| 新津县| 保山市| 黔南| 华亭县| 安康市| 灵山县| 阳山县| 沈阳市| 定结县| 淄博市| 朝阳市| 类乌齐县|