- Big Data Analytics
- Venkat Ankam
- 902字
- 2021-08-20 10:32:23
Installing Hadoop plus Spark clusters
Before installing Hadoop and Spark, let's understand the versions of Hadoop and Spark. Spark is offered as a service in all three popular Hadoop distributions from Cloudera, Hortonworks, and MapR. The current Hadoop and Spark versions are 2.7.2 and 2.0 respectively as of writing this book. However, Hadoop distributions might have a lower version of Spark as Hadoop and Spark release cycles do not coincide.
For the upcoming chapters' practical exercises, let's use one of the free virtual machines (VM) from Cloudera, Hortonworks, and MapR, or use an open source version of Apache Spark. These VMs makes it easy to get started with Spark and Hadoop. The same exercises can be run on bigger clusters as well.
The prerequisites to use virtual machines on your laptop are as follows:
- RAM of 8 GB and above
- At least two virtual CPUs
- The latest VMWare Player or Oracle VirtualBox must be installed for Windows or Linux OS
- The latest Oracle VirtualBox or VMWare Fusion for Mac
- Virtualization is enabled in BIOS
- Chrome 25+, IE 9+, Safari 6+, or Firefox 18+ is recommended (HDP Sandbox will not run on IE 10)
- Putty
- WinSCP
The instructions to download and run Cloudera Distribution for Hadoop (CDH) are as follows:
- Download the latest quickstart CDH VM from http://www.cloudera.com/content/www/en-us/downloads.html. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
- Extract it to a directory (use 7-Zip or WinZip).
- In case of VMWare Player, click on Open a Virtual Machine, and point to the directory where you have extracted the VM. Select the
cloudera-quickstart-vm-5.x.x-x-vmware.vmx
file and click on Open. - Click on Edit virtual machine settings and then increase memory to
7 GB
(if your laptop has 8 GB RAM) or8 GB
(if your laptop has more than 8 GB RAM). Increase the number of processors to four. Click on OK. - Click on Play virtual machine.
- Select I copied it and click on OK.
- This should get your VM up and running.
- Cloudera Manager is installed on the VM but is turned off by default. If you would like to use Cloudera Manager, double-click and run Launch Cloudera Manager Express to set up Cloudera Manager. This will be helpful in the starting / stopping / restarting of services on the cluster.
- Credentials for the VM are username (
cloudera
) and password (cloudera
).
If you would like to use the Cloudera Quickstart Docker image, follow the instructions on http://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera.
The instructions to download and run Hortonworks Data Platform (HDP) Sandbox are as follows:
- Download the latest HDP Sandbox from http://hortonworks.com/products/hortonworks-sandbox/#install. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
- Follow the instructions from install guides on the same downloads page.
- Open the browser and enter the address as shown in sandbox, for example,
http://192.168.139.158/
. Click on View Advanced Options to see all the links. - Access the sandbox with
putty
as the root user andhadoop
as the initial password. You need to change the password on the first login. Also, run theambari-admin-password-reset
command to reset Ambari admin password. - To start using Ambari, open the browser and enter
ipaddressofsandbox:8080
with admin credentials created in the preceding step. Start the services needed in Ambari. - To map the hostname to the IP address in Windows, go to
C:\Windows\System32\drivers\etc\hosts
and enter the IP address and hostname with a space separator. You need admin rights to do this.
The instructions to download and run MapR Sandbox are as follows:
- Download the latest sandbox from https://www.mapr.com/products/mapr-sandbox-hadoop/download. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
- Follow the instructions to set up Sandbox at http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop.
- Use Putty to log in to the sandbox.
- The root password is
mapr
. - To launch HUE or MapR Control System (MCS), navigate to the URL provided by MapR Sandbox.
- To map the hostname to the IP address in Windows, go to
C:\Windows\System32\drivers\etc\hosts
and enter the IP address and hostname with a space separator.
The instructions to download and run Apache Spark prebuilt binaries, in case you have a preinstalled Hadoop cluster, are given here. The following instructions can also be used to install the latest version of Spark and use it on the preceding VMs:
- Download Spark prebuilt for Hadoop from the following location:
wget http://apache.mirrors.tds.net/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz tar xzvf spark-2.0.0-bin-hadoop2.7.tgz cd spark-2.0.0-bin-hadoop2.7
- Add
SPARK_HOME
andPATH
variables to the profile script as shown in the following commands so that these environment variables will be set every time you log in:[cloudera@quickstart ~]$ cat /etc/profile.d/spark2.sh export SPARK_HOME=/home/cloudera/spark-2.0.0-bin-hadoop2.7 export PATH=$PATH:/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin
- Let Spark know about the Hadoop configuration directory and Java home by adding the following environment variables to
spark-env.sh
. Copy the template files in theconf
directory:cp conf/spark-env.sh.template conf/spark-env.sh cp conf/spark-defaults.conf.template conf/spark-defaults.conf vi conf/spark-env.sh export HADOOP_CONF_DIR=/etc/hadoop/conf export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
- Copy
hive-site.xml
to theconf
directory of Spark.cp /etc/hive/conf/hive-site.xml conf/
- Change the log level to
ERROR
in thespark-2.0.0-bin-hadoop2.7/conf/log4j.properties
file after copying the template file.
Tip
Programming languages version requirements to run Spark:
Java: 7+
Python: 2.6+/3.1+
R: 3.1+
Scala: Spark 1.6 and below 2.10, and Spark 2.0 and above 2.11
Note that the preceding virtual machines are single node clusters. If you are planning to set up multi-node clusters, follow the guidelines as per the distribution, such as CDH, HDP, or MapR. If you are planning to use a standalone cluster manager, the setup is described in the following chapter.
- Objective-C應(yīng)用開發(fā)全程實(shí)錄
- Cocos2d-x游戲開發(fā):手把手教你Lua語言的編程方法
- Mastering Kotlin
- Learning Bayesian Models with R
- 網(wǎng)頁設(shè)計(jì)與制作教程(HTML+CSS+JavaScript)(第2版)
- Scientific Computing with Scala
- INSTANT Sinatra Starter
- Mastering Business Intelligence with MicroStrategy
- 智能搜索和推薦系統(tǒng):原理、算法與應(yīng)用
- 移動(dòng)增值應(yīng)用開發(fā)技術(shù)導(dǎo)論
- 新印象:解構(gòu)UI界面設(shè)計(jì)
- 零基礎(chǔ)學(xué)Python編程(少兒趣味版)
- SignalR:Real-time Application Development(Second Edition)
- Android Sensor Programming By Example
- 人人都能開發(fā)RPA機(jī)器人:UiPath從入門到實(shí)戰(zhàn)