官术网_书友最值得收藏!

Introduction

Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides cost-effective storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.

Hadoop supported storage means a storage format that can work with Hadoop's InputFormat and OutputFormat interfaces. InputFormat is responsible for creating InputSplits from input data and piding it further into records. OutputFormat is responsible for writing to storage.

We will start with writing to the local filesystem and then move over to loading data from HDFS. In the Loading data from HDFS recipe, we will cover the most common file format: regular text files. In the next recipe, we will cover how to use any InputFormat interface to load data in Spark. We will also explore loading data stored in Amazon S3, a leading cloud storage platform.

We will explore loading data from Apache Cassandra, which is a NoSQL database. Finally, we will explore loading data from a relational database.

主站蜘蛛池模板: 和龙市| 张家界市| 贵港市| 淮北市| 洛浦县| 始兴县| 勐海县| 商都县| 襄垣县| 文水县| 富顺县| 微山县| 石台县| 中山市| 诸城市| 眉山市| 盐边县| 洛阳市| 灵寿县| 永丰县| 阿荣旗| 长丰县| 登封市| 资阳市| 秦安县| 渭南市| 石景山区| 龙岩市| 嘉荫县| 宁明县| 海城市| 高平市| 彭泽县| 新泰市| 奉新县| 洪湖市| 绩溪县| 望奎县| 东乡| 鹿泉市| 瑞金市|