- Elasticsearch for Hadoop
- Vishal Shukla
- 1462字
- 2021-07-09 21:34:29
Setting up Hadoop for Elasticsearch
For our exploration on Hadoop and Elasticsearch, we will use an Ubuntu-based host. However, you may opt to run any other Linux OS and set up Hadoop and Elasticsearch.
Being a Hadoop user, if you already have Hadoop set up in your local machine, you may jump directly to the section, Setting up Elasticsearch.
Hadoop supports three cluster modes: the stand-alone mode, the pseudo-distributed mode, and the fully-distributed mode. To make it good enough to walk through the examples of the book, we will consider the pseudo-distributed mode on a Linux operating system. This mode will ensure that without getting into the complexity of setting up so many nodes, we will mirror the components in such a way that they behave no differently to the real production environment. In the pseudo-distributed mode, each component runs on its own JVM process.
Setting up Java
The examples in this book are developed and tested against Oracle Java 1.8. These examples should run fine with other distributions of Java 8 as well.
In order to set up Oracle Java 8, open the terminal and execute the following steps:
- First, add and update the repository for Java 8 with the following command:
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update
- Next, install Java 8 and configure the environment variables, as shown in the following command:
$ sudo apt-get install oracle-java8-set-default
- Now, verify the installation as follows:
$ java -version
This should show an output similar to the following code; it may vary a bit based on the exact version:
java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
Setting up a dedicated user
To ensure that our ES-Hadoop environment is clean and isolated from the rest of the applications and to be able to manage security and permissions easily, we will set up a dedicated user. Perform the following steps:
- First, add the
hadoop
group with the following command:$ sudo addgroup hadoop
- Then, add the
eshadoop
user to thehadoop
group, as shown in the following command:$ sudo adduser eshadoop hadoop
- Finally, add the
eshadoop
user to thesudoers
list by adding the user to thesudo
group as follows:$ sudo adduser eshadoop sudo
Now, you need to relogin with the eshadoop
user to execute further steps.
Installing SSH and setting up the certificate
In order to manage nodes, Hadoop requires an SSH access, so let's install and run the SSH. Perform the following steps:
- First, install
ssh
with the following command:$ sudo apt-get install ssh
- Then, generate a new SSH key pair using the
ssh-keygen
utility, by using the following command:$ ssh-keygen -t rsa -P '' -C email@example.com
- Now, confirm the key generation by issuing the following command. This command should display at least a couple of files with
id_rsa
andid_rsa.pub
. We just created an RSA key pair with an empty password so that Hadoop can interact with nodes without the need to enter the passphrase:$ ls -l ~/.ssh
- To enable the SSH access to your local machine, you need to specify that the newly generated public key is an authorized key to log in using the following command:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- Finally, do not forget to test the password-less
ssh
using the following command:$ ssh localhost
Downloading Hadoop
Using the following commands, download Hadoop and extract the file to /usr/local
so that it is available for other users as well. Perform the following steps:
- First, download the Hadoop tarball by running the following command:
$ wget http://ftp.wayne.edu/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
- Next, extract the tarball to the
/usr/local
directory with the following command:$ sudo tar vxzf hadoop-2.6.0.tar.gz -C /usr/local
- Now, rename the Hadoop directory using the following command:
$ cd /usr/local $ sudo mv hadoop-2.6.0 hadoop
- Finally, change the owner of all the files to the
eshadoop
user and thehadoop
group with the following command:$ sudo chown -R eshadoop:hadoop hadoop
Setting up environment variables
The next step is to set up environment variables. You can do so by exporting the required variables to the .bashrc
file for the user.
Open the .bashrc
file using any editor of your choice, then add the following export declarations to set up our environment variables:
#Set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-8-oracle #Set Hadoop related environment variable export HADOOP_INSTALL=/usr/local/hadoop #Add bin and sbin directory to PATH export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin #Set few more Hadoop related environment variable export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Once you have saved the .bashrc
file, you can relogin to have your new environment variables visible, or you can source the .bashrc
file using the following command:
$ source ~/.bashrc
Configuring Hadoop
Now, we need to set up the JAVA_HOME
environment variable in the hadoop-env.sh
file that is used by Hadoop. You can find it in $HADOOP_INSTALL/etc/hadoop
.
Next, change the JAVA_HOME
path to reflect to your Java installation directory. On my machine, it looks similar to the following:
$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle
Now, let's relogin and confirm the configuration using the following command:
$ hadoop version
As you know, we will set up our Hadoop environment in a pseudo-distributed mode. In this mode, each Hadoop daemon runs in a separate Java process. The next step is to configure these daemons. So, let's switch to the following folder that contains all the Hadoop configuration files:
$ cd $HADOOP_INSTALL/etc/hadoop
The configuration of core-site.xml
will set up the temporary directory for Hadoop and the default filesystem. In our case, the default filesystem refers to the NameNode. Let's change the content of the <configuration>
section of core-site.xml
so that it looks similar to the following code:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/eshadoop/hdfs/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Now, we will configure the replication factor for HDFS files. To set the replication to 1
, change the content of the <configuration>
section of hdfs-site.xml
so that it looks similar to the following code:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Note
We will run Hadoop in the pseudo-distributed mode. In order to do this, we need to configure the YARN resource manager. YARN handles the resource management and scheduling responsibilities in the Hadoop cluster so that the data processing and data storage components can focus on their respective tasks.
Configure yarn-site.xml
in order to configure the auxiliary service name and classes, as shown in the following code:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
Hadoop provides mapred-site.xml.template
, which you can rename to mapred-site.xml
and change the content of the <configuration>
section to the following code; this will ensure that the MapReduce jobs run on YARN as opposed to running them in-process locally:
<configuration> <property> <name>mapred.job.tracker</name> <value>yarn</value> </property> </configuration>
We have already configured all the Hadoop daemons, including HDFS, YARN, and the JobTracker. You may already be aware that HDFS relies on NameNode and DataNodes. NameNode contains the storage-related metadata, whereas DataNode stores the real data in the form of blocks. When you set up your Hadoop cluster, it is required to format NameNode before you can start using HDFS. We can do so with the following command:
$ hadoop namenode -format
Note
If you were already using the data nodes of HDFS, do not format the name node unless you know what you are doing. When you format NameNode, you will lose all the storage metadata, just as the blocks are distributed among DataNodes. This means that although you didn't physically remove the data from DataNodes, the data will be inaccessible to you. Therefore, it is always good to remove the data in DataNodes when you format the NameNode.
Starting Hadoop daemons
Now, we have all the prerequisites set up along with all the Hadoop daemons. In order to run our first MapReduce job, we need all the required Hadoop daemons running.
Let's start with HDFS using the following command. This command starts the NameNode, SecondaryNameNode, and DataNode daemons:
$ start-dfs.sh
The next step is to start the YARN resource manager using the following command (YARN will start the ResourceManager and NodeManager daemons):
$ start-yarn.sh
If the preceding two commands were successful in starting HDFS and YARN, you should be able to check the running daemons using the jps
tool (this tool lists the running JVM process on your machine):
$ jps
If everything worked successfully, you should see the following services running:
13386 SecondaryNameNode 13059 NameNode 13179 DataNode 17490 Jps 13649 NodeManager 13528 ResourceManager
- Google Flutter Mobile Development Quick Start Guide
- Design Principles for Process:driven Architectures Using Oracle BPM and SOA Suite 12c
- JavaScript+jQuery網頁特效設計任務驅動教程(第2版)
- JIRA 7 Administration Cookbook(Second Edition)
- Django Design Patterns and Best Practices
- Python金融數據分析
- Cassandra Data Modeling and Analysis
- GameMaker Programming By Example
- Spring Boot企業級項目開發實戰
- C#實踐教程(第2版)
- ElasticSearch Cookbook(Second Edition)
- Java程序設計案例教程
- Spring+Spring MVC+MyBatis從零開始學
- Django 3.0應用開發詳解
- Struts 2.x權威指南