官术网_书友最值得收藏!

Starting Spark daemons

If you are planning to use a standalone cluster manager, you need to start the Spark master and worker daemons which are the core components in Spark's architecture. Starting/stopping daemons varies slightly from distribution to distribution. Hadoop distributions such as Cloudera, Hortonworks, and MapR provide Spark as a service with YARN as the default resource manager. This means that all Spark applications will run on the YARN framework by default. But, we need to start spark master and worker roles to use Spark's standalone resource manager. If you are planning to use the YARN resource manager, you don't need to start these daemons. Please follow the following procedure depending on the type of distribution you are using. Downloading and installation instructions can be found in Chapter 2, Getting Started with Apache Hadoop and Apache Spark, for all these distributions.

Working with CDH

Cloudera Distribution for Hadoop (CDH) is an open source distribution including Hadoop, Spark, and many other projects needed for Big Data Analytics. Cloudera Manager is used for installing and managing the CDH platform. If you are planning to use the YARN resource manager, start the Spark service in Cloudera Manager. To start Spark daemons for Spark's standalone resource manager, use the following procedure:

  1. Spark on the CDH platform is configured to work with YARN. Moreover, spark 2.0 is not available on CDH yet. So, download the latest pre-built spark 2.0 package for Hadoop as explained in Chapter 2, Getting Started with Apache Hadoop and Apache Spark. If you would like to use Spark 1.6 version, run the /usr/lib/spark/start-all.sh command.
  2. Start the service with following commands.
    cd /home/cloudera/spark-2.0.0-bin-hadoop2.7/sbin
    sudo ./start-all.sh
    
  3. Check the Spark UI at http://quickstart.cloudera:8080/.

Working with HDP, MapR, and Spark pre-built packages

Hortonworks Data Platform (HDP) and MapR Converged Data Platform distributions also include Hadoop, Spark, and many other projects needed for Big Data Analytics. While HDP uses Apache Ambari for deploying and managing the cluster, MapR uses the MapR Control System (MCS). Spark's pre-built package has no specific manager component for managing Spark. If you are planning to use the YARN resource manager, start the Spark service in Ambari or MCS. To Start Spark daemons for using Spark's standalone resource manager, use the following procedure.

  1. Start services with the following commands:
    • HDP: /usr/hdp/current/spark-client/sbin/start-all.sh
    • MapR: /opt/mapr/spark/spark-*/sbin/start-all.sh
    • Spark Package pre-built for Hadoop: ./sbin/start-all.sh

    For a multi node cluster, start spark worker roles on all machines with the following command:

    ./sbin/start-slave.sh spark://masterhostname:7077 
    

    Another option is to provide a list of the hostnames of the workers in the /conf/slaves file and then use the ./start-all.sh command to start worker roles on all machines automatically.

  2. Check logs located in the logs directory. Look at the master web UI at http://masterhostname:8080. If this port is already taken by another service, the next available port will be used. For example, in HDP, port 8080 is taken by Ambari, so the standalone master will bind to 8081. To find the correct port number, check the logs.

    Note

    All programs in this chapter are executed on CDH 5.8 VM. For other environments, the file paths might change but the concepts are the same in any environment.

主站蜘蛛池模板: 定远县| 红桥区| 双流县| 东光县| 岳阳县| 绥中县| 会同县| 巩留县| 杭锦旗| 安庆市| 莆田市| 贺州市| 固阳县| 蒙城县| 汝南县| 湘乡市| 上思县| 正定县| 务川| 扎囊县| 石河子市| 阿勒泰市| 祁连县| 镇宁| 永川市| 黄陵县| 白水县| 铜梁县| 巴彦县| 宁蒗| 咸丰县| 常山县| 天峻县| 诸城市| 大同县| 元朗区| 南雄市| 贡山| 封丘县| 恭城| 江川县|