官术网_书友最值得收藏!

Using Spark components

Spark provides a different command-line interface, that is read–eval–print loop (REPL) for different programming languages. You can choose the type of REPL from the following, based on the language of your choice:

  1. Spark shell for Scala: If you want to use Scala for accessing Spark APIs, you can start the Spark Scala shell with the following command:
 spark-shell

          The following screen will be displayed after the execution of the previous command:

     Once the driver (one of Spark's components) is started, you can access all of the Scala and Java APIs in the shell:

  1. Spark shell for Python: If your preferred choice of coding is Python, then you can start the Python shell of Spark with a command:
  • Add Python to Spark Path.
  • Open the .bash_profile and add the following lines:
nano ~/.bash_profile

export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
  • Save the ~/.bash_profile:
pyspark

Once the shell has been loaded, you can start using the Python commands to access Spark APIs, as shown in the following output:

  1. Spark SQL: If you have worked on a relational database management system (RDBMS) like Oracle, MySQL, or Teradata, and you want to apply your SQL programming skills to Spark, you can use the Spark SQL module to write queries for different structured datasets. To start the Spark SQL shell, all you need to do is type the following command into your machine's Terminal:
 spark-sql 

The following screenshot shows the set of executions that would happen when you open spark-sql. As you can see, spark-sql uses underlying database, which is DERBY by default. In Chapter 6Spark SQL, you will find out how we can connect spark-sql to Hive metastore:

You would have a spark-sql shell connected to default Derby data store:

  1. Spark Submit: The multi-lingual feature of Spark also allows you to use Java for accessing Spark APIs. Since Java (up to version 8) does not provide the REPL feature, Spark APIs are accessed and executed with the help of the following command:
spark-submit 

The following syntax explains how we can specify jar with logic, the number of executors, the executor's resource specification, and the mode of execution for the application (Standalone or YARN):

./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--executor-memory 20G \
--total-executor-cores 100
--conf <key>=<value> \
<application-jar> \
[application-arguments]

Here, we can describe the different logic as follows:

  • --class: This is the class containing the main method, and it is the entry point of the application (for example, org.apache.spark.examples.SparkPi).
  • --master: This is the key property to define the master of your application. Depending on the standalone mode or the cluster mode, the master could be local, yarn, or  spark://host:port (for example, spark://192.168.56.101:7077). More options are available at https://spark.apache.org/docs/latest/submitting-applications.html#master-urls.
  • --deploy-mode: This is used to start the driver on any of worker nodes in the cluster or locally where the command is executed (client) (default: client).
  • --conf: Spark configurations that you want to overwrite for your application as key=value format. 
  • application-jar: This is the path of your application jar. If it is present in HDFS, then you need to specify the HDFS path as hdfs:// path or if it is a file path, then it should be a valid path on a driver node, file://path. 
  • application-arguments: These are the arguments that you have to specify for your application's main class.
主站蜘蛛池模板: 板桥市| 清河县| 威信县| 云浮市| 临邑县| 曲水县| 会泽县| 津市市| 揭阳市| 衢州市| 大名县| 信丰县| 云龙县| 遂平县| 大姚县| 永和县| 射洪县| 洞头县| 庆阳市| 石台县| 平武县| 深州市| 龙游县| 揭东县| 炉霍县| 丰宁| 垫江县| 那曲县| 乌鲁木齐市| 嘉鱼县| 梁河县| 乳源| 邳州市| 闽侯县| 霍林郭勒市| 金华市| 长垣县| 蒲江县| 遂昌县| 额济纳旗| 娄底市|