- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 944字
- 2021-07-23 20:32:52
Setting up HDFS
HDFS is a block structured distributed filesystem that is designed to store petabytes of data reliably on top of clusters made out of commodity hardware. HDFS supports storing massive amounts of data and provides high throughput access to the data. HDFS stores file data across multiple nodes with redundancy to ensure fault-tolerance and high aggregate bandwidth.
HDFS is the default distributed filesystem used by the Hadoop MapReduce computations. Hadoop supports data locality aware processing of data stored in HDFS. HDFS architecture consists mainly of a centralized NameNode that handles the filesystem metadata and DataNodes that store the real data blocks. HDFS data blocks are typically coarser grained and perform better with large streaming reads.
To set up HDFS, we first need to configure a NameNode and DataNodes, and then specify the DataNodes in the slaves
file. When we start the NameNode, the startup script will start the DataNodes.
Tip
Installing HDFS directly using Hadoop release artifacts as mentioned in this recipe is recommended for development testing and for advanced use cases only. For regular production clusters, we recommend using a packaged Hadoop distribution as mentioned in the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe. Packaged Hadoop distributions make it much easier to install, configure, maintain, and update the components of the Hadoop ecosystem.
Getting ready
You can follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode. If you are using a single machine, use it as both the name node as well as the DataNode.
- Install JDK 1.6 or above (Oracle JDK 1.7 is preferred) in all machines that will be used to set up the HDFS cluster. Set the
JAVA_HOME
environment variable to point to the Java installation. - Download Hadoop by following the Setting up Hadoop v2 on your local machine recipe.
How to do it...
Now let's set up HDFS in the distributed mode:
- Set up password-less SSH from the master node, which will be running the NameNode, to the DataNodes. Check that you can log in to localhost and to all other nodes using SSH without a passphrase by running one of the following commands:
$ ssh localhost $ ssh <IPaddress>
Tip
Configuring password-less SSH
If the command in step 1 returns an error or asks for a password, create SSH keys by executing the following command (you may have to manually enable SSH beforehand depending on your OS):
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Move the
~/.ssh/id_dsa.pub
file to all the nodes in the cluster. Then add the SSH keys to the~/.ssh/authorized_keys
file in each node by running the following command (if theauthorized_keys
file does not exist, run the following command. Otherwise, skip to thecat
command):$ touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys
Now with permissions set, add your key to the
~/.ssh/authorized_keys
file:$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Then you should be able to execute the following command successfully, without providing a password:
$ ssh localhost
- In each server, create a directory for storing HDFS data. Let's call that directory
{HADOOP_DATA_DIR}
. Create two subdirectories inside the data directory as{HADOOP_DATA_DIR}/data
and{HADOOP_DATA_DIR}/name
. Change the directory permissions to755
by running the following command for each directory:$ chmod –R 755 <HADOOP_DATA_DIR>
- In the NameNode, add the IP addresses of all the slave nodes, each on a separate line, to the
{HADOOP_HOME}/etc/hadoop/slaves
file. When we start the NameNode, it will use thisslaves
file to start the DataNodes. - Add the following configurations to
{HADOOP_HOME}/etc/hadoop/core-site.xml
. Before adding the configurations, replace the{NAMENODE}
strings with the IP of the master node:<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://{NAMENODE}:9000/</value> </property> </configuration>
- Add the following configurations to the
{HADOOP_HOME}/etc/hadoop
/hdfs-site.xml
files in the{HADOOP_HOME}/etc/hadoop
directory. Before adding the configurations, replace the{HADOOP_DATA_DIR}
with the directory you created in the first step. Replicate thecore-site.xml
andhdfs-site.xml
files we modified in steps 4 and 5 to all the nodes.<configuration> <property> <name>dfs.namenode.name.dir</name> <!-- Path to store namespace and transaction logs --> <value>{HADOOP_DATA_DIR}/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <!-- Path to store data blocks in datanode --> <value>{HADOOP_DATA_DIR}/data</value> </property> </configuration>
- From the NameNode, run the following command to format a new filesystem:
$ $HADOOP_HOME/bin/hdfs namenode –format
You will see the following line in the output after the successful completion of the previous command:
… 13/04/09 08:44:51 INFO common.Storage: Storage directory /…/dfs/name has been successfully formatted. ….
- Start the HDFS using the following command:
$ $HADOOP_HOME/sbin/start-dfs.sh
This command will first start a NameNode in the master node. Then it will start the DataNode services in the machines mentioned in the
slaves
file. Finally, it'll start the secondary NameNode. - HDFS comes with a monitoring web console to verify the installation and to monitor the HDFS cluster. It also lets users explore the contents of the HDFS filesystem. The HDFS monitoring console can be accessed from
http://{NAMENODE}:50070/
. Visit the monitoring console and verify whether you can see the HDFS startup page. Here, replace{NAMENODE}
with the IP address of the node running the HDFS NameNode. - Alternatively, you can use the following command to get a report about the HDFS status:
$ $HADOOP_HOME/bin/hadoop dfsadmin -report
- Finally, shut down the HDFS cluster using the following command:
$ $HADOOP_HOME/sbin/stop-dfs.sh
See also
- In the HDFS command-line file operations recipe, we will explore how to use HDFS to store and manage files.
- The HDFS setup is only a part of the Hadoop installation. The Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 recipe describes how to set up the rest of Hadoop.
- The Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe explores how to use a packaged Hadoop distribution to install the Hadoop ecosystem in your cluster.
- Java程序設計(慕課版)
- Monkey Game Development:Beginner's Guide
- Docker進階與實戰
- C#完全自學教程
- OpenCV 3和Qt5計算機視覺應用開發
- Functional Programming in JavaScript
- 深入淺出Android Jetpack
- WordPress 4.0 Site Blueprints(Second Edition)
- Natural Language Processing with Java and LingPipe Cookbook
- Building Microservices with .NET Core
- Applied Deep Learning with Python
- Python第三方庫開發應用實戰
- Splunk Essentials
- ASP.NET jQuery Cookbook(Second Edition)
- Python深度學習:基于PyTorch