Setting up a pseudo Hadoop cluster

In the last section, we managed run Hadoop in a standalone mode. In this section, we will create a pseudo Hadoop cluster on a single node. So, let's try and set up HDFS daemons on a system in the pseudo distributed mode. When we set up HDFS in a pseudo distributed mode, we install name nodes and data nodes on the same machine, but before we start the instances for HDFS, we need to set the configuration files correctly. We will study different configuration files in the next chapter. First, open core-sites.xml with the following command:

hadoop@base0:/$  vim etc/hadoop/core-sites.xml

Now, set the DFS default name for the file system using the fs.default.name property. The core site file is responsible for storing all of the configuration related to Hadoop Core. Replace the content of the file with the following snippet:

<configuration>
   <property>
           <name>fs.default.name</name>
           <value>hdfs://localhost:9000</value>
   </property>
</configuration>

Setting the preceding property simplifies all of your command-line work, as you do not need to provide the file system location every time you use the CLI (command-line interface) of HDFS. The port 9000 is the location where name nodes are supposed to receive a heartbeat from data nodes (in this case, on the same machine). You can also provide your machine IP address as well, if you want to make your file system accessible from the outside. The file should look like the following screenshot:

Similarly, we now need to set up the hdfs-site.xml file with a replication property. Since we are running in a pseudo distributed mode on a single system, we will set the replication factor to 1, as follows:

hadoop@base0:/$  vim etc/hadoop/hdfs-sites.xml

Now add the following code snippet to the file:

<configuration>
      <property>
         <name>dfs.replication</name>
         <value>1</value>
      </property>
</configuration>

The HDFS site file is responsible for storing all configuration related to HDFS (including name node, secondary name node, and data node). When setting up HDFS for the first time, the HDFS needs to be formatted. This process will create a file system and additional storage structures on name nodes (primarily the metadata part of HDFS). Type the following command on your Linux shell to format the name node:

hadoop@base0:/$  bin/hdfs namenode -format

You can now start the HDFS processes by running the following command from Hadoop's home directory:

hadoop@base0:/$  ./sbin/start-dfs.sh

The logs can be traced at $HADOOP_HOME/logs/. Now, access http://localhost:9870 from your browser, and you should see the DFS health page, as shown in the following screenshot:

As you can see, data note-related information can be found on http://localhost:9864. If you try running the same example again on the node, it will not run; this is because the input folder is defaulted to HDFS, and the system can no longer find it, thereby throwing InvalidInputException. To run the same example, you need to create an input folder first and copy the files into it. So, let's create an input folder on HDFS with the following code:

hadoop@base0:/$  ./bin/hdfs dfs -mkdir /user

hadoop@base0:/$  ./bin/hdfs dfs -mkdir /user/hadoop

hadoop@base0:/$  ./bin/hdfs dfs -mkdir /user/hadoop/input

Now the folders have been created, you can copy the content from the input folder present on the local machine to HDFS with the following command:

hadoop@base0:/$  ./bin/hdfs dfs -copyFromLocal input/* /user/hadoop/input/

Input the following to check the content of the input folder:

hadoop@base0:/$  ./bin/hdfs dfs -ls input/

Now run your program with the input folder name, and output folder; you should be able to see the outcome on HDFS inside /user/hadoop/<output-folder>. You can run the following concatenated command on your folder:

hadoop@base0:/$  ./bin/hdfs dfs -cat <output folder path>/part-r-00000

Note that the output of your MapReduce program can be seen through the name node in your browser, as shown in the following screenshot:

Congratulations! You have successfully set up your pseudo distributed Hadoop node installation. We will look at setting up YARN for clusters, as well as pseudo distributed setup, in Chapter 5, Building Rich YARN Applications. Before we jump into the Hadoop cluster setup, let's first look at planning and sizing with Hadoop.

官术网_书友最值得收藏!

Apache Hadoop 3 Quick Start Guide

Setting up a pseudo Hadoop cluster