pg电子埃及探秘宝典

書名： Spark Cookbook
作者名： Rishi Yadav
本章字數： 178字
更新時間： 2021-07-16 13:44:01

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.

Hadoop supported storage means a storage format that can work with Hadoop's InputFormat and OutputFormat interfaces. InputFormat is responsible for creating InputSplits from input data and piding it further into records. OutputFormat is responsible for writing to storage.

We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.

We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.

官术网_书友最值得收藏!

Spark Cookbook

Introduction