官术网_书友最值得收藏!

Commonly supported file systems

Until now we have mostly focused on the functional aspects of Spark and hence tried to move away from the discussion of filesystems supported by Spark. You might have seen a couple of examples around HDFS, but the primary focus has been local file systems. However, in production environments, it will be extremely rare that you will be working on a local filesystem and chances are that you will be working with distributed file systems such as HDFS and Amazon S3.

Working with HDFS

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS provides the ability to store large amounts of data across commodity hardware and companies are already storing massive amounts of data on HDFS by moving it off their traditional database systems and creating data lakes on Hadoop. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode and the HDFS port.

If you are running Spark on YARN inside a Hadoop cluster, you might not even need to mention the details of NameNode and HDFS, as the path that you will pass will default to HDFS.

Most of the methods that we have seen previously can be used with HDFS. The path to be specified for HDFS is as follows:

hdfs://master:port/filepath

As an example, we have the following settings for our Hadoop cluster:

NameNode Node: hadoopmaster.packtpub.comHDFS Port: 8020File Location: /spark/samples/productsales.csv

The path that you need to specify would be as follows:

hdfs://hadoopmaster.packtpub.com:8020/spark/samples/productsales.csv

Working with Amazon S3

S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. As of 2013, Amazon S3 was reported to store more than 2 trillion objects. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. Notable users of S3 include Netflix, Reddit, Dropbox, Mojang (creators of Minecraft), Tumblr, and Pinterest.

S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet. Accessing S3 data is relatively straightforward as you need a path starting with s3n:// to be passed to Spark's file input methods.

However, before reading from S3, you do need to either set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, or pass them as a part of your path:

  • Configuring the parameters:
      sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myaccessKeyID")
      sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
      "mySecretAccessKey")
      val data = sc.textFile("s3n://bucket/fileLocation")
  • Passing the Access Key Id and Secret Key:
 val data = sc.textFile("s3n://MyAccessKeyID:MySecretKey@svr/fileloc")

Having looked at the most common file systems, let's focus our attention on Spark's ability to interact with common databases and structured sources. We've already highlighted Spark's ability to fetch data from CSV and TSV files and loading them to DataFrames. However, it is about time we discuss Spark's ability to interact with databases, which will be covered in much more detail in Chapter 4, Spark SQL.

主站蜘蛛池模板: 吉林省| 五河县| 阳江市| 惠安县| 台山市| 仁寿县| 福泉市| 交城县| 怀远县| 南充市| 什邡市| 普兰县| 富宁县| 高安市| 胶州市| 百色市| 栾城县| 巴东县| 牟定县| 资阳市| 青冈县| 吴忠市| 嵩明县| 崇明县| 贵定县| 西和县| 上高县| 将乐县| 芦山县| 中宁县| 中超| 陆川县| 汉阴县| 格尔木市| 浏阳市| 沂南县| 增城市| 南漳县| 于都县| 嘉鱼县| 凤凰县|