- Mastering Spark for Data Science
- Andrew Morgan Antoine Amend David George Matthew Hallett
- 285字
- 2021-07-09 18:49:33
Chapter 3. Input Formats and Schema
The aim of this chapter is to demonstrate how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. When writing analytics, or even better, building libraries of reusable software, you generally have to work with interfaces of fixed input types. Therefore, having flexibility in how you transition data between schemas, depending on the purpose, can deliver considerable downstream value, both in terms of widening the type of analysis possible and the re-use of existing code.
Our primary objective is to learn about the data format features that accompany Spark, although we will also delve into the finer points of data management by introducing proven methods that will enhance your data handling and increase your productivity. After all, it is most likely that you will be required to formalize your work at some point, and an introduction to how to avoid the potential long-term pitfalls is invaluable when writing analytics, and long after.
With this is mind, we will use this chapter to look at the traditionally well understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark.
In addition, while honing our Spark skills, we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.
We will cover the following topics:
- Dimensional modeling: benefits and weaknesses in relation to Spark
- Focus on the GDELT model
- Lifting the lid on schema-on-read
- Avro object model
- Parquet storage model
Let's start with some best practice.
- Practical Ansible 2
- Dreamweaver 8中文版商業(yè)案例精粹
- 手把手教你玩轉(zhuǎn)RPA:基于UiPath和Blue Prism
- 傳感器技術(shù)應(yīng)用
- Learning C for Arduino
- Machine Learning with Apache Spark Quick Start Guide
- Chef:Powerful Infrastructure Automation
- ESP8266 Robotics Projects
- 工業(yè)自動(dòng)化技術(shù)實(shí)訓(xùn)指導(dǎo)
- Apache Spark Quick Start Guide
- FANUC工業(yè)機(jī)器人虛擬仿真教程
- 網(wǎng)站規(guī)劃與網(wǎng)頁(yè)設(shè)計(jì)
- 超好玩的Python少兒編程
- 圖像傳感器應(yīng)用技術(shù)
- VMware vSphere 6.5 Cookbook(Third Edition)