- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 395字
- 2021-07-02 18:55:30
Using Datasets
This API as been introduced since Apache Spark 1.6 as experimental and finally became a first-class citizen in Apache Spark 2.0. It is basically a strongly typed version of DataFrames.
DataFrames are kept for backward compatibility and are not going to be deprecated for two reasons. First, a DataFrame since Apache Spark 2.0 is nothing else but a Dataset where the type is set to Row. This means that you actually lose the strongly static typing and fall back to a dynamic typing. This is also the second reason why DataFrames are going to stay. Dynamically typed languages such as Python or R are not capable of using Datasets because there isn't a concept of strong, static types in the language.
So what are Datasets exactly? Let's create one:
import spark.implicits._
case class Person(id: Long, name: String)
val caseClassDS = Seq(Person(1,"Name1"),Person(2,"Name2")).toDS()
As you can see, we are defining a case class in order to determine the types of objects stored in the Dataset. This means that we have a strong, static type here that is clear at compile time already. So no dynamic type inference is taking place there. This allows for a lot of further performance optimization and also adds compile type safety to your applications. We'll cover the performance optimization aspect in more detail in the Chapter 3, The Catalyst Optimizer, and Chapter 4, Project Tungsten. As you have seen before, DataFrames can be created from an RDD containing Row objects. These also have a schema. Note that the difference between Datasets and DataFrame is that the Row objects are not static types as the schema can be created during runtime by passing it to the constructor using StructType objects (refer to the last section on DataFrames). As mentioned before, a DataFrame-equivalent Dataset would contain only elements of the Row type. However, we can do better. We can define a static case class matching the schema of our client data stored in client.json:
case class Client(
age: Long,
countryCode: String,
familyName: String,
id: String,
name: String
)
Now we can reread our client.json file but this time, we convert it to a Dataset:
val ds = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter2/client.json").as[Client]
Now we can use it similarly to DataFrames, but under the hood, typed objects are used to represent the data:

If we print the schema, we get the same as the formerly used DataFrame:

- 國際大學生程序設計競賽中山大學內部選拔真題解(二)
- Modular Programming with Python
- Visual Basic程序開發(學習筆記)
- Python入門很簡單
- 名師講壇:Java微服務架構實戰(SpringBoot+SpringCloud+Docker+RabbitMQ)
- UML 基礎與 Rose 建模案例(第3版)
- Couchbase Essentials
- Web Developer's Reference Guide
- Solr權威指南(下卷)
- Drupal 8 Development Cookbook(Second Edition)
- Google Adsense優化實戰
- 虛擬現實建模與編程(SketchUp+OSG開發技術)
- Learning Alfresco Web Scripts
- ArcGIS Blueprints
- Oracle 11g寶典