官术网_书友最值得收藏!

The SparkSession--your gateway to structured data processing

The SparkSession is the starting point for working with columnar data in Apache Spark. It replaces SQLContext used in previous versions of Apache Spark. It was created from the Spark context and provides the means to load and save data files of different types using DataFrames and Datasets and manipulate columnar data with SQL, among other things. It can be used for the following functions:

  • Executing SQL via the sql method
  • Registering user-defined functions via the udf method
  • Caching
  • Creating DataFrames
  • Creating Datasets
The examples in this chapter are written in Scala as we prefer the language, but you can develop in Python, R, and Java as well. As stated previously, the SparkSession is created from the Spark context.

Using the SparkSession allows you to implicitly convert RDDs into DataFrames or Datasets. For instance, you can convert RDD into a DataFrame or Dataset by calling the toDF or toDS methods:

 import spark.implicits._
val rdd = sc.parallelize(List(1,2,3))
val df = rdd.toDF
val ds = rdd.toDS

As you can see, this is very simple as the corresponding methods are on the RDD object itself.

We are making use of Scala implicits function here because the RDD API wasn't designed with DataFrames or Datasets in mind and is therefore lacking the toDF or toDS methods. However, by importing the respective implicits, this behavior is added on the fly. If you want to learn more about Scala implicits, the following links are recommended:

Next, we will examine some of the supported file formats available to import and save data.

主站蜘蛛池模板: 重庆市| 榆林市| 莒南县| 准格尔旗| 昌都县| 海门市| 水富县| 阿克| 铁岭县| 浏阳市| 名山县| 宁远县| 海原县| 刚察县| 轮台县| 博白县| 乐陵市| 新乡市| 南皮县| 渝北区| 彭州市| 镇雄县| 禄丰县| 特克斯县| 新兴县| 贡觉县| 延长县| 岗巴县| 隆子县| 乌拉特前旗| 上林县| 四川省| 古浪县| 黔江区| 沐川县| 晋中市| 德化县| 丁青县| 竹溪县| 阿鲁科尔沁旗| 和顺县|