官术网_书友最值得收藏!

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.

Hadoop supported storage means a storage format that can work with Hadoop's InputFormat and OutputFormat interfaces. InputFormat is responsible for creating InputSplits from input data and piding it further into records. OutputFormat is responsible for writing to storage.

We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.

We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.

主站蜘蛛池模板: 辽宁省| 平顺县| 商河县| 巩义市| 延吉市| 香格里拉县| 岗巴县| 巴林右旗| 雷州市| 曲水县| 唐山市| 兴隆县| 青岛市| 大埔县| 佛学| 金门县| 毕节市| 贵德县| 唐山市| 澳门| 栾川县| 思茅市| 噶尔县| 南丰县| 思茅市| 县级市| 三门县| 吉木乃县| 仁布县| 宜城市| 房产| 隆德县| 增城市| 曲阳县| 龙里县| 黑山县| 沅江市| 奉贤区| 维西| 鄂托克旗| 天全县|