官术网_书友最值得收藏!

Installing a single-node cluster - HDFS components

Usually the term cluster means a group of machines, but in this recipe, we will be installing various Hadoop daemons on a single node. The single machine will act as both the master and slave for the storage and processing layer.

Getting ready

You will need some information before stepping through this recipe.

Although Hadoop can be configured to run as root user, it is a good practice to run it as a non-privileged user. In this recipe, we are using the node name nn1.cluster1.com, preinstalled with CentOS 6.5.

Tip

Create a system user named hadoop and set a password for that user.

Install JDK, which will be used by Hadoop services. The minimum recommended version of JDK is 1.7, but Open JDK can also be used.

How to do it...

  1. Log into the machine/host as root user and install jdk:
    # yum install jdk –y
    or it can also be installed using the command as below
    # rpm –ivh jdk-1.7u45.rpm
    
  2. Once Java is installed, make sure Java is in PATH for execution. This can be done by setting JAVA_HOME and exporting it as an environment variable. The following screenshot shows the content of the directory where Java gets installed:
    # export JAVA_HOME=/usr/java/latest
    
    How to do it...
  3. Now we need to copy the tarball hadoop-2.7.3.tar.gz--which was built in the Build Hadoop section earlier in this chapter—to the home directory of the user root. For this, the user needs to login to the node where Hadoop was built and execute the following command:
    # scp –r hadoop-2.7.3.tar.gz root@nn1.cluster1.com:~/
    
  4. Create a directory named/opt/cluster to be used for Hadoop:
    # mkdir –p /opt/cluster
    
  5. Then untar the hadoop-2.7.3.tar.gz to the preceding created directory:
    # tar –xzvf hadoop-2.7.3.tar.gz -C /opt/Cluster/
    
  6. Create a user named hadoop, if you haven't already, and set the password as hadoop:
    # useradd hadoop
    # echo hadoop | passwd --stdin hadoop
    
  7. As step 6 was done by the root user, the directory and file under /opt/cluster will be owned by the root user. Change the ownership to the Hadoop user:
    # chown -R hadoop:hadoop /opt/cluster/
    
  8. If the user lists the directory structure under /opt/cluster, he will see it as follows:
    How to do it...
  9. The directory structure under /opt/cluster/hadoop-2.7.3 will look like the one shown in the following screenshot:
    How to do it...
  10. The listing shows etc, bin, sbin, and other directories.
  11. The etc/hadoop directory is the one that contains the configuration files for configuring various Hadoop daemons. Some of the key files are core-site.xml, hdfs-site.xml, hadoop-env.xml, and mapred-site.xml among others, which will be explained in the later sections:
    How to do it...
  12. The directories bin and sbin contain executable binaries, which are used to start and stop Hadoop daemons and perform other operations such as filesystem listing, copying, deleting, and so on:
    How to do it...
    How to do it...
  13. To execute a command /opt/cluster/hadoop-2.7.3/bin/hadoop, a complete path to the command needs to be specified. This could be cumbersome, and can be avoided by setting the environment variable HADOOP_HOME.
  14. Similarly, there are other variables that need to be set that point to the binaries and the configuration file locations:
    How to do it...
  15. The environment file is set up system-wide so that any user can use the commands. Once the hadoopenv.sh file is in place, execute the command to export the variables defined in it:
    How to do it...
  16. Change to the Hadoop user using the command su – hadoop:
    How to do it...
  17. Change to the /opt/cluster directory and create a symlink:
    How to do it...
  18. To verify that the preceding changes are in place, the user can execute either the which Hadoop or which java commands, or the user can execute the command hadoop directly without specifying the complete path.
  19. In addition to setting the environment as discussed, the user has to add the JAVA_HOME variable in the hadoop-env.sh file.
  20. The next thing is to set up the Namenode address, which specifies the host:port address on which it will listen. This is done using the file core-site.xml:
    How to do it...
  21. The important thing to keep in mind is the property fs.defaultFS.
  22. The next thing that the user needs to configure is the location where Namenode will store its metadata. This can be any location, but it is recommended that you always have a dedicated disk for it. This is configured in the file hdfs-site.xml:
    How to do it...
  23. The next step is to format the Namenode. This will create an HDFS file system:
    $ hdfs namenode -format
    
  24. Similarly, we have to add the rule for the Datanode directory under hdfs-site.xml. Nothing needs to be done to the core-site.xml file:
    How to do it...
  25. Then the services need to be started for Namenode and Datanode:
    $ hadoop-daemon.sh start namenode
    $ hadoop-daemon.sh start datanode
    
  26. The command jps can be used to check for running daemons:
    How to do it...

How it works...

The master Namenode stores metadata and the slave node Datanode stores the blocks. When the Namenode is formatted, it creates a data structure that contains fsimage, edits, and VERSION. These are very important for the functioning of the cluster.

The parameters dfs.data.dir and dfs.datanode.data.dir are used for the same purpose, but are used across different versions. The older parameters are deprecated in favor of the newer ones, but they will still work. The parameter dfs.name.dir has been deprecated in favor of dfs.namenode.name.dir in Hadoop 2.x. The intention of showing both versions of the parameter is to bring to the user's notice that parameters are evolving and ever changing, and care must be taken by referring to the release notes for each Hadoop version.

There's more...

Setting up ResourceManager and NodeManager

In the preceding recipe, we set up the storage layer—that is, the HDFS for storing data—but what about the processing layer?. The data on HDFS needs to be processed to make a meaningful decision using MapReduce, Tez, Spark, or any other tool. To run the MapReduce, Spark or other processing framework we need to have ResourceManager, NodeManager.

主站蜘蛛池模板: 鹤庆县| 孝昌县| 巴林左旗| 双辽市| 南阳市| 乌拉特后旗| 田阳县| 平和县| 揭东县| 隆德县| 资中县| 长治市| 乌海市| 绥滨县| 平原县| 临澧县| 兴化市| 垫江县| 固阳县| 乐平市| 常德市| 绥棱县| 玉门市| 梅河口市| 巴彦淖尔市| 栖霞市| 噶尔县| 麦盖提县| 扶沟县| 白水县| 富阳市| 长岭县| 汉源县| 石家庄市| 清河县| 宁阳县| 南投县| 阜南县| 隆昌县| 龙山县| 中方县|