- Apache Spark Quick Start Guide
- Shrey Mehrotra Akash Grade
- 610字
- 2021-07-02 13:39:59
Using Spark components
Spark provides a different command-line interface, that is read–eval–print loop (REPL) for different programming languages. You can choose the type of REPL from the following, based on the language of your choice:
- Spark shell for Scala: If you want to use Scala for accessing Spark APIs, you can start the Spark Scala shell with the following command:
spark-shell
The following screen will be displayed after the execution of the previous command:

Once the driver (one of Spark's components) is started, you can access all of the Scala and Java APIs in the shell:
- Spark shell for Python: If your preferred choice of coding is Python, then you can start the Python shell of Spark with a command:
- Add Python to Spark Path.
- Open the .bash_profile and add the following lines:
nano ~/.bash_profile
export PYSPARK_PYTHON=python3
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
- Save the ~/.bash_profile:
pyspark
Once the shell has been loaded, you can start using the Python commands to access Spark APIs, as shown in the following output:

- Spark SQL: If you have worked on a relational database management system (RDBMS) like Oracle, MySQL, or Teradata, and you want to apply your SQL programming skills to Spark, you can use the Spark SQL module to write queries for different structured datasets. To start the Spark SQL shell, all you need to do is type the following command into your machine's Terminal:
spark-sql
The following screenshot shows the set of executions that would happen when you open spark-sql. As you can see, spark-sql uses underlying database, which is DERBY by default. In Chapter 6, Spark SQL, you will find out how we can connect spark-sql to Hive metastore:

You would have a spark-sql shell connected to default Derby data store:

- Spark Submit: The multi-lingual feature of Spark also allows you to use Java for accessing Spark APIs. Since Java (up to version 8) does not provide the REPL feature, Spark APIs are accessed and executed with the help of the following command:
spark-submit
The following syntax explains how we can specify jar with logic, the number of executors, the executor's resource specification, and the mode of execution for the application (Standalone or YARN):
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--executor-memory 20G \
--total-executor-cores 100
--conf <key>=<value> \
<application-jar> \
[application-arguments]
Here, we can describe the different logic as follows:
- --class: This is the class containing the main method, and it is the entry point of the application (for example, org.apache.spark.examples.SparkPi).
- --master: This is the key property to define the master of your application. Depending on the standalone mode or the cluster mode, the master could be local, yarn, or spark://host:port (for example, spark://192.168.56.101:7077). More options are available at https://spark.apache.org/docs/latest/submitting-applications.html#master-urls.
- --deploy-mode: This is used to start the driver on any of worker nodes in the cluster or locally where the command is executed (client) (default: client).
- --conf: Spark configurations that you want to overwrite for your application as key=value format.
- application-jar: This is the path of your application jar. If it is present in HDFS, then you need to specify the HDFS path as hdfs:// path or if it is a file path, then it should be a valid path on a driver node, file://path.
- application-arguments: These are the arguments that you have to specify for your application's main class.