- Spark Cookbook
- Rishi Yadav
- 178字
- 2021-07-16 13:44:01
Introduction
Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.
Hadoop supported storage means a storage format that can work with Hadoop's InputFormat
and OutputFormat
interfaces. InputFormat
is responsible for creating InputSplits
from input data and piding it further into records. OutputFormat
is responsible for writing to storage.
We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat
interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.
We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.
- C++程序設計教程
- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- 零起步玩轉掌控板與Mind+
- C#程序設計(慕課版)
- Raspberry Pi for Secret Agents(Third Edition)
- Python數據分析(第2版)
- Effective Python Penetration Testing
- Scala謎題
- 基于ARM Cortex-M4F內核的MSP432 MCU開發實踐
- 算法設計與分析:基于C++編程語言的描述
- Visual Basic程序設計全程指南
- Java EE Web應用開發基礎
- Secret Recipes of the Python Ninja
- 你好!Python
- 菜鳥成長之路