官术网_书友最值得收藏!

Loading data from a local machine to HDFS

In this recipe, we are going to load data from a local machine's disk to HDFS.

Getting ready

To perform this recipe, you should have an already Hadoop running cluster.

How to do it...

Performing this recipe is as simple as copying data from one folder to another. There are a couple of ways to copy data from the local machine to HDFS.

  • Using the copyFromLocal command
    • To copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this:
      hadoop fs -mkdir /mydir1
      hadoop fs -copyFromLocal /usr/local/hadoop/LICENSE.txt /mydir1
      
  • Using the put command
    • We will first create the directory, and then put the local file in HDFS:
      hadoop fs -mkdir /mydir2
      hadoop fs -put /usr/local/hadoop/LICENSE.txt /mydir2
      

You can validate that the files have been copied to the correct folders by listing the files:

hadoop fs -ls /mydir1
hadoop fs -ls /mydir2

How it works...

When you use HDFS copyFromLocal or the put command, the following things will occur:

  1. First of all, the HDFS client (the command prompt, in this case) contacts NameNode because it needs to copy the file to HDFS.
  2. NameNode then asks the client to break the file into chunks of different cluster block sizes. In Hadoop 2.X, the default block size is 128MB.
  3. Based on the capacity and availability of space in DataNodes, NameNode will decide where these blocks should be copied.
  4. Then, the client starts copying data to specified DataNodes for a specific block. The blocks are copied sequentially one after another.
  5. When a single block is copied, the block is sent to DataNode in packets that are 4MB in size. With each packet, a checksum is sent; once the packet copying is done, it is verified with checksum to check whether it matches. The packets are then sent to the next DataNode where the block will be replicated.
  6. The HDFS client's responsibility is to copy the data to only the first node; the replication is taken care by respective DataNode. Thus, the data block is pipelined from one DataNode to the next.
  7. When the block copying and replication is taking place, metadata on the file is updated in NameNode by DataNode.
主站蜘蛛池模板: 兖州市| 保德县| 福建省| 永清县| 久治县| 绿春县| 古丈县| 来凤县| 曲水县| 盈江县| 卢湾区| 永兴县| 胶州市| 阿勒泰市| 汨罗市| 天台县| 文安县| 郯城县| 象山县| 东乌珠穆沁旗| 安平县| 前郭尔| 吉木乃县| 华阴市| 邛崃市| 万载县| 吉木乃县| 拉孜县| 达拉特旗| 黄山市| 交口县| 南华县| 东平县| 濮阳县| 无锡市| 弥渡县| 罗江县| 潮安县| 澜沧| 无极县| 北京市|