- Learning Apache Spark 2
- Muhammad Asif Abbasi
- 497字
- 2021-07-09 18:46:02
Commonly supported file systems
Until now we have mostly focused on the functional aspects of Spark and hence tried to move away from the discussion of filesystems supported by Spark. You might have seen a couple of examples around HDFS, but the primary focus has been local file systems. However, in production environments, it will be extremely rare that you will be working on a local filesystem and chances are that you will be working with distributed file systems such as HDFS and Amazon S3.
Working with HDFS
Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS provides the ability to store large amounts of data across commodity hardware and companies are already storing massive amounts of data on HDFS by moving it off their traditional database systems and creating data lakes on Hadoop. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode
and the HDFS port.
If you are running Spark on YARN inside a Hadoop cluster, you might not even need to mention the details of NameNode
and HDFS, as the path that you will pass will default to HDFS.
Most of the methods that we have seen previously can be used with HDFS. The path to be specified for HDFS is as follows:
hdfs://master:port/filepath
As an example, we have the following settings for our Hadoop cluster:
NameNode Node: hadoopmaster.packtpub.comHDFS Port: 8020File Location: /spark/samples/productsales.csv
The path that you need to specify would be as follows:
hdfs://hadoopmaster.packtpub.com:8020/spark/samples/productsales.csv
Working with Amazon S3
S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. As of 2013, Amazon S3 was reported to store more than 2 trillion objects. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. Notable users of S3 include Netflix, Reddit, Dropbox, Mojang (creators of Minecraft), Tumblr, and Pinterest.
S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet. Accessing S3 data is relatively straightforward as you need a path starting with s3n://
to be passed to Spark's file input methods.
However, before reading from S3, you do need to either set the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables, or pass them as a part of your path:
- Configuring the parameters:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myaccessKeyID") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "mySecretAccessKey") val data = sc.textFile("s3n://bucket/fileLocation")
- Passing the Access Key Id and Secret Key:
val data = sc.textFile("s3n://MyAccessKeyID:MySecretKey@svr/fileloc")
Having looked at the most common file systems, let's focus our attention on Spark's ability to interact with common databases and structured sources. We've already highlighted Spark's ability to fetch data from CSV and TSV files and loading them to DataFrames. However, it is about time we discuss Spark's ability to interact with databases, which will be covered in much more detail in Chapter 4, Spark SQL.
- Visualforce Development Cookbook(Second Edition)
- PostgreSQL 11 Server Side Programming Quick Start Guide
- 基于LPC3250的嵌入式Linux系統開發
- 腦動力:PHP函數速查效率手冊
- 程序設計語言與編譯
- 計算機網絡應用基礎
- 最后一個人類
- Supervised Machine Learning with Python
- DevOps Bootcamp
- SAP Business Intelligence Quick Start Guide
- Mastering OpenStack(Second Edition)
- TensorFlow Deep Learning Projects
- Redash v5 Quick Start Guide
- Linux常用命令簡明手冊
- 51單片機應用程序開發與實踐