下载注册就送6元

書名： Big Data Analytics
作者名： Venkat Ankam
本章字數(shù)： 524字
更新時間： 2021-08-20 10:32:24

Starting Spark daemons

If you are planning to use a standalone cluster manager, you need to start the Spark master and worker daemons which are the core components in Spark's architecture. Starting/stopping daemons varies slightly from distribution to distribution. Hadoop distributions such as Cloudera, Hortonworks, and MapR provide Spark as a service with YARN as the default resource manager. This means that all Spark applications will run on the YARN framework by default. But, we need to start spark master and worker roles to use Spark's standalone resource manager. If you are planning to use the YARN resource manager, you don't need to start these daemons. Please follow the following procedure depending on the type of distribution you are using. Downloading and installation instructions can be found in Chapter 2, Getting Started with Apache Hadoop and Apache Spark, for all these distributions.

Working with CDH

Cloudera Distribution for Hadoop (CDH) is an open source distribution including Hadoop, Spark, and many other projects needed for Big Data Analytics. Cloudera Manager is used for installing and managing the CDH platform. If you are planning to use the YARN resource manager, start the Spark service in Cloudera Manager. To start Spark daemons for Spark's standalone resource manager, use the following procedure:

Spark on the CDH platform is configured to work with YARN. Moreover, spark 2.0 is not available on CDH yet. So, download the latest pre-built spark 2.0 package for Hadoop as explained in Chapter 2, Getting Started with Apache Hadoop and Apache Spark. If you would like to use Spark 1.6 version, run the /usr/lib/spark/start-all.sh command.

Start the service with following commands.

cd /home/cloudera/spark-2.0.0-bin-hadoop2.7/sbin
sudo ./start-all.sh

Check the Spark UI at http://quickstart.cloudera:8080/.

Working with HDP, MapR, and Spark pre-built packages

Hortonworks Data Platform (HDP) and MapR Converged Data Platform distributions also include Hadoop, Spark, and many other projects needed for Big Data Analytics. While HDP uses Apache Ambari for deploying and managing the cluster, MapR uses the MapR Control System (MCS). Spark's pre-built package has no specific manager component for managing Spark. If you are planning to use the YARN resource manager, start the Spark service in Ambari or MCS. To Start Spark daemons for using Spark's standalone resource manager, use the following procedure.

Start services with the following commands:
- HDP: /usr/hdp/current/spark-client/sbin/start-all.sh
- MapR: /opt/mapr/spark/spark-*/sbin/start-all.sh
- Spark Package pre-built for Hadoop: ./sbin/start-all.sh
For a multi node cluster, start spark worker roles on all machines with the following command:
```
./sbin/start-slave.sh spark://masterhostname:7077 
```
Another option is to provide a list of the hostnames of the workers in the /conf/slaves file and then use the ./start-all.sh command to start worker roles on all machines automatically.
Check logs located in the logs directory. Look at the master web UI at http://masterhostname:8080. If this port is already taken by another service, the next available port will be used. For example, in HDP, port 8080 is taken by Ambari, so the standalone master will bind to 8081. To find the correct port number, check the logs.
Note
All programs in this chapter are executed on CDH 5.8 VM. For other environments, the file paths might change but the concepts are the same in any environment.

官术网_书友最值得收藏!

Big Data Analytics

Starting Spark daemons

Working with CDH

Working with HDP, MapR, and Spark pre-built packages

Note