官术网_书友最值得收藏!

Using Datasets

This API as been introduced since Apache Spark 1.6 as experimental and finally became a first-class citizen in Apache Spark 2.0. It is basically a strongly typed version of DataFrames.

DataFrames are kept for backward compatibility and are not going to be deprecated for two reasons. First, a DataFrame since Apache Spark 2.0 is nothing else but a Dataset where the type is set to Row. This means that you actually lose the strongly static typing and fall back to a dynamic typing. This is also the second reason why DataFrames are going to stay. Dynamically typed languages such as Python or R are not capable of using Datasets because there isn't a concept of strong, static types in the language.

So what are Datasets exactly? Let's create one:

import spark.implicits._
case class Person(id: Long, name: String)
val caseClassDS = Seq(Person(1,"Name1"),Person(2,"Name2")).toDS()

As you can see, we are defining a case class in order to determine the types of objects stored in the Dataset. This means that we have a strong, static type here that is clear at compile time already. So no dynamic type inference is taking place there. This allows for a lot of further performance optimization and also adds compile type safety to your applications. We'll cover the performance optimization aspect in more detail in the Chapter 3, The Catalyst Optimizer, and Chapter 4, Project Tungsten. As you have seen before, DataFrames can be created from an RDD containing Row objects. These also have a schema. Note that the difference between Datasets and DataFrame is that the Row objects are not static types as the schema can be created during runtime by passing it to the constructor using StructType objects (refer to the last section on DataFrames). As mentioned before, a DataFrame-equivalent Dataset would contain only elements of the Row type. However, we can do better. We can define a static case class matching the schema of our client data stored in client.json:

 case class Client(
age: Long,
countryCode: String,
familyName: String,
id: String,
name: String
)

Now we can reread our client.json file but this time, we convert it to a Dataset:

val ds = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter2/client.json").as[Client]

Now we can use it similarly to DataFrames, but under the hood, typed objects are used to represent the data:

If we print the schema, we get the same as the formerly used DataFrame:

主站蜘蛛池模板: 汾西县| 伊宁县| 江川县| 招远市| 云浮市| 牟定县| 湖北省| 湖口县| 柯坪县| 天等县| 建平县| 江门市| 收藏| 韶关市| 拜泉县| 巫溪县| 永善县| 大港区| 南京市| 樟树市| 耒阳市| 来宾市| 安顺市| 双峰县| 册亨县| 油尖旺区| 焦作市| 团风县| 余干县| 九寨沟县| 榆树市| 西乌珠穆沁旗| 广南县| 湘潭县| 淅川县| 垦利县| 淮安市| 宜阳县| 黎平县| 都匀市| 顺昌县|