官术网_书友最值得收藏!

  • Big Data Analytics
  • Venkat Ankam
  • 524字
  • 2021-08-20 10:32:24

Starting Spark daemons

If you are planning to use a standalone cluster manager, you need to start the Spark master and worker daemons which are the core components in Spark's architecture. Starting/stopping daemons varies slightly from distribution to distribution. Hadoop distributions such as Cloudera, Hortonworks, and MapR provide Spark as a service with YARN as the default resource manager. This means that all Spark applications will run on the YARN framework by default. But, we need to start spark master and worker roles to use Spark's standalone resource manager. If you are planning to use the YARN resource manager, you don't need to start these daemons. Please follow the following procedure depending on the type of distribution you are using. Downloading and installation instructions can be found in Chapter 2, Getting Started with Apache Hadoop and Apache Spark, for all these distributions.

Working with CDH

Cloudera Distribution for Hadoop (CDH) is an open source distribution including Hadoop, Spark, and many other projects needed for Big Data Analytics. Cloudera Manager is used for installing and managing the CDH platform. If you are planning to use the YARN resource manager, start the Spark service in Cloudera Manager. To start Spark daemons for Spark's standalone resource manager, use the following procedure:

  1. Spark on the CDH platform is configured to work with YARN. Moreover, spark 2.0 is not available on CDH yet. So, download the latest pre-built spark 2.0 package for Hadoop as explained in Chapter 2, Getting Started with Apache Hadoop and Apache Spark. If you would like to use Spark 1.6 version, run the /usr/lib/spark/start-all.sh command.
  2. Start the service with following commands.
    cd /home/cloudera/spark-2.0.0-bin-hadoop2.7/sbin
    sudo ./start-all.sh
    
  3. Check the Spark UI at http://quickstart.cloudera:8080/.

Working with HDP, MapR, and Spark pre-built packages

Hortonworks Data Platform (HDP) and MapR Converged Data Platform distributions also include Hadoop, Spark, and many other projects needed for Big Data Analytics. While HDP uses Apache Ambari for deploying and managing the cluster, MapR uses the MapR Control System (MCS). Spark's pre-built package has no specific manager component for managing Spark. If you are planning to use the YARN resource manager, start the Spark service in Ambari or MCS. To Start Spark daemons for using Spark's standalone resource manager, use the following procedure.

  1. Start services with the following commands:
    • HDP: /usr/hdp/current/spark-client/sbin/start-all.sh
    • MapR: /opt/mapr/spark/spark-*/sbin/start-all.sh
    • Spark Package pre-built for Hadoop: ./sbin/start-all.sh

    For a multi node cluster, start spark worker roles on all machines with the following command:

    ./sbin/start-slave.sh spark://masterhostname:7077 
    

    Another option is to provide a list of the hostnames of the workers in the /conf/slaves file and then use the ./start-all.sh command to start worker roles on all machines automatically.

  2. Check logs located in the logs directory. Look at the master web UI at http://masterhostname:8080. If this port is already taken by another service, the next available port will be used. For example, in HDP, port 8080 is taken by Ambari, so the standalone master will bind to 8081. To find the correct port number, check the logs.

    Note

    All programs in this chapter are executed on CDH 5.8 VM. For other environments, the file paths might change but the concepts are the same in any environment.

主站蜘蛛池模板: 福鼎市| 扎赉特旗| 山西省| 颍上县| 襄汾县| 治县。| 田阳县| 北流市| 堆龙德庆县| 资溪县| 桓台县| 扎鲁特旗| 辽宁省| 漠河县| 黄陵县| 黑水县| 株洲县| 苗栗县| 永登县| 荥阳市| 金昌市| 金华市| 清水县| 滁州市| 噶尔县| 招远市| 铜梁县| 永年县| 屯昌县| 汪清县| 郑州市| 南澳县| 泽库县| 长子县| 宿迁市| 镇江市| 江津市| 安塞县| 视频| 绥化市| 辉南县|