官术网_书友最值得收藏!

Using Datasets

This API as been introduced since Apache Spark 1.6 as experimental and finally became a first-class citizen in Apache Spark 2.0. It is basically a strongly typed version of DataFrames.

DataFrames are kept for backward compatibility and are not going to be deprecated for two reasons. First, a DataFrame since Apache Spark 2.0 is nothing else but a Dataset where the type is set to Row. This means that you actually lose the strongly static typing and fall back to a dynamic typing. This is also the second reason why DataFrames are going to stay. Dynamically typed languages such as Python or R are not capable of using Datasets because there isn't a concept of strong, static types in the language.

So what are Datasets exactly? Let's create one:

import spark.implicits._
case class Person(id: Long, name: String)
val caseClassDS = Seq(Person(1,"Name1"),Person(2,"Name2")).toDS()

As you can see, we are defining a case class in order to determine the types of objects stored in the Dataset. This means that we have a strong, static type here that is clear at compile time already. So no dynamic type inference is taking place there. This allows for a lot of further performance optimization and also adds compile type safety to your applications. We'll cover the performance optimization aspect in more detail in the Chapter 3, The Catalyst Optimizer, and Chapter 4, Project Tungsten. As you have seen before, DataFrames can be created from an RDD containing Row objects. These also have a schema. Note that the difference between Datasets and DataFrame is that the Row objects are not static types as the schema can be created during runtime by passing it to the constructor using StructType objects (refer to the last section on DataFrames). As mentioned before, a DataFrame-equivalent Dataset would contain only elements of the Row type. However, we can do better. We can define a static case class matching the schema of our client data stored in client.json:

 case class Client(
age: Long,
countryCode: String,
familyName: String,
id: String,
name: String
)

Now we can reread our client.json file but this time, we convert it to a Dataset:

val ds = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter2/client.json").as[Client]

Now we can use it similarly to DataFrames, but under the hood, typed objects are used to represent the data:

If we print the schema, we get the same as the formerly used DataFrame:

主站蜘蛛池模板: 来宾市| 安岳县| 营口市| 桐乡市| 武清区| 海丰县| 绥化市| 剑阁县| 铅山县| 江孜县| 绍兴县| 石景山区| 泗水县| 宜城市| 寻甸| 渑池县| 巴林左旗| 新郑市| 淄博市| 新竹市| 建瓯市| 固始县| 临邑县| 双牌县| 泽州县| 桂阳县| 新宁县| 凉城县| 滦南县| 巴楚县| 大埔区| 乌鲁木齐县| 五大连池市| 祁门县| 安平县| 潞西市| 鹤山市| 遂昌县| 肃南| 英超| 喀什市|