官术网_书友最值得收藏!

Using Datasets

This API as been introduced since Apache Spark 1.6 as experimental and finally became a first-class citizen in Apache Spark 2.0. It is basically a strongly typed version of DataFrames.

DataFrames are kept for backward compatibility and are not going to be deprecated for two reasons. First, a DataFrame since Apache Spark 2.0 is nothing else but a Dataset where the type is set to Row. This means that you actually lose the strongly static typing and fall back to a dynamic typing. This is also the second reason why DataFrames are going to stay. Dynamically typed languages such as Python or R are not capable of using Datasets because there isn't a concept of strong, static types in the language.

So what are Datasets exactly? Let's create one:

import spark.implicits._
case class Person(id: Long, name: String)
val caseClassDS = Seq(Person(1,"Name1"),Person(2,"Name2")).toDS()

As you can see, we are defining a case class in order to determine the types of objects stored in the Dataset. This means that we have a strong, static type here that is clear at compile time already. So no dynamic type inference is taking place there. This allows for a lot of further performance optimization and also adds compile type safety to your applications. We'll cover the performance optimization aspect in more detail in the Chapter 3, The Catalyst Optimizer, and Chapter 4, Project Tungsten. As you have seen before, DataFrames can be created from an RDD containing Row objects. These also have a schema. Note that the difference between Datasets and DataFrame is that the Row objects are not static types as the schema can be created during runtime by passing it to the constructor using StructType objects (refer to the last section on DataFrames). As mentioned before, a DataFrame-equivalent Dataset would contain only elements of the Row type. However, we can do better. We can define a static case class matching the schema of our client data stored in client.json:

 case class Client(
age: Long,
countryCode: String,
familyName: String,
id: String,
name: String
)

Now we can reread our client.json file but this time, we convert it to a Dataset:

val ds = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter2/client.json").as[Client]

Now we can use it similarly to DataFrames, but under the hood, typed objects are used to represent the data:

If we print the schema, we get the same as the formerly used DataFrame:

主站蜘蛛池模板: 濮阳县| 辽阳市| 宁乡县| 会宁县| 永和县| 牟定县| 平远县| 布尔津县| 曲靖市| 襄樊市| 庆云县| 永寿县| 鄂州市| 涞水县| 奉新县| 永善县| 慈利县| 布尔津县| 七台河市| 微山县| 轮台县| 枣强县| 鹤峰县| 黄冈市| 唐海县| 平湖市| 德保县| 雷山县| 河北省| 西乌珠穆沁旗| 澄江县| 临泉县| 泰安市| 金湖县| 尤溪县| 新巴尔虎左旗| 凤台县| 抚松县| 新巴尔虎左旗| 赞皇县| 石屏县|