First of all, for all master nodes (name node and secondary name node) and slaves, you need to enable keyless SSH entry in both directions, as described in previous sections. Similarly, you will need a Java environment on all of the available nodes, as most of Hadoop is based on Java itself.
When you add nodes to your cluster, you need to copy all of your configuration and your Hadoop folder. The same applies to all components of Hadoop, including HDFS, YARN, MapReduce, and so on.
It is a good idea to have a shared network drive with access to all hosts, as this will enable easier file sharing. Alternatively, you can write a simple shell script to make multiple copies using SCP. So, create a file (targets.txt) with a list of hosts (user@system) at each line, as follows:
hadoop@base0
hadoop@base1
hadoop@base2
…..
Now create the following script in a text file and save it as .sh (for example, scpall.sh):
#!/bin/sh # This is a SCP script to copy files to all folders for dest in $(< targets.txt); do scp $1 ${dest}:$2 done
You can call the preceding script with the first parameter as the source file name, and the second parameter as the target directory location, as follows:
When identifying slaves or master nodes, you can choose to use the IP address or the host name. It is better to use host names for readability, but bear in mind that they require DNS entries to resolve an IP address. If you do not have access allowing you to introduce DNS entries (DNS entries are usually controlled by the IT teams of an organization), you can simply work an entry out by adding entries in the /etc/hosts file using a root login. The following screenshot illustrates how this file can be updated; the same file can be passed to all hosts through the SCP utility or shared folder:
Now download the Hadoop distribution as discussed. If you are working with multiple slave nodes, you can configure the folder for one slave and then simply copy it to another slave using the scpall utility. The slave configuration is usually similar. When we refer to slaves, we mean the nodes that do not have any master processes, such as name node, secondary name node, or YARN services.
Let's now proceed with the configuration of important files.
First, edit etc/hadoop/core-site.xml. It should have no metadata except an empty <configuration> tab, so add the following entries to it using the relevant code.
For core-site.xml, input:
<!-- Put site-specific property overrides in this file. --><configuration> <property> <name>fs.default.name</name> <value>hdfs://<master-host>:9000</value> </property> </configuration>
Here, the <master-host> is the host name where your name node is configured. This configuration is to go in all of the data nodes in Hadoop. Remember to set up the Hadoop DFS replication factor as planned and add its entry in etc/hadoop/hdfs-site.xml.
For hdfs-site.xml, input:
<!-- Put site-specific property overrides in this file. --><configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
The preceding snippet covers the configuration needed to run the HDFS. We will look at important, specific aspects of these configuration files in Chapter 3, Deep Dive into the Hadoop Distributed File System.
Another important configuration required is the etc/hadoop/workers file, which lists all of the data nodes. You will need to add the data nodes' host names and save it as follows:
base0
base1
base2
..
In this case, we are using base* names for all Hadoop nodes. This configuration has to happen over all of the nodes that are participating in the cluster. You may use the scpall.sh script to propagate the changes. Once this is done, the configuration is complete.
Let's start by formatting the name node first, as follows:
hadoop@base0:/$ bin/hdfs namenode -format
Once formatted, you can start HDFS by running the following command from any Hadoop directory:
You should see an overview similar to that in the following screenshot. If you go to the Datanodes tab, you should see all DataNodes in the active stage: