官术网_书友最值得收藏!

Experimenting with the Spark shell

The best way to learn Spark is through the Spark shell. There are two different shells for Scala and Python. But since the GraphX library is the most complete in Scala at the time this book was written, we are going to use the spark-shell, that is, the Scala shell. Let's launch the Spark shell inside the $SPARKHOME/bin from the command line:

$SPARKHOME/bin/spark-shell

If you set the current directory (cd) to $SPARKHOME, you can simply launch the shell with:

cd $SPARKHOME
./bin/spark-shell

Note

If you happen to get an error saying something like: Failed to find Spark assembly in spark-1.4.1/assembly/target/scala-2.10. You need to build Spark before running this program, then it means that you have downloaded the Spark source code instead of a prebuilt version of Spark. In that case, go back to the project website and choose a prebuilt version of Spark.

If you were successful in launching the Spark shell, you should see the welcome message like this:

 Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ '/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 1.4.1
 /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java)

For a sanity check, you can type in some Scala expressions or declarations and have them evaluated. Let's type some commands into the shell now:

scala> sc
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@52e52233
scala> val myRDD = sc.parallelize(List(1,2,3,4,5))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> sc.textFile("README.md").filter(line => line contains "Spark").count()
res2: Long = 21

Here is what you can tell about the preceding code. First, we displayed the Spark context defined by the variable sc, which is automatically created when you launch the Spark shell. The Spark context is the point of entry to the Spark API. Second, we created an RDD named myRDD that was obtained by calling the parallelize function for a list of five numbers. Finally, we loaded the README.md file into an RDD, filtered the lines that contain the word "Spark", and finally invoked an action on the filtered RDD to count the number of those lines.

Note

It is expected that you are already familiar with the basic RDD transformations and actions, such as map, reduce, and filter. If that is not the case, I recommend that you learn them first, perhaps by reading the programming guide at https://spark.apache.org/docs/latest/programming-guide.html or an introductory book such as Fast Data Processing with Spark by Packt Publishing and Learning Spark by O'Reilly Media.

Don't panic if you did not fully grasp the mechanisms behind RDDs. The following refresher, however, helps you to remember the important points. RDD is the core data abstraction in Spark to represent a distributed collection of large datasets that can be partitioned and processed in parallel across a cluster of machines. The Spark API provides a uniform set of operations to transform and reduce the data within an RDD. On top of these abstractions and operations, the GraphX library also offers a flexible API that enables us to create graphs and operate on them easily.

Perhaps, when you ran the preceding commands in the Spark shell, you were overwhelmed by the long list of logging statements that start with INFO. There is a way to reduce the amount of information that Spark outputs in the shell.

Tip

You can reduce the level of verbosity of the Spark shell as follows:

  • First, go to the $SCALAHOME/conf folder
  • Then, create a new file called log4j.properties
  • Inside the conf folder, open the template file log4j.properties.template and copy all its content into log4j.properties
  • Find and replace the line log4j.rootCategory=INFO, console with either one of these two lines:
    • log4j.rootCategory=WARN, console
    • log4j.rootCategory=ERROR, console
  • Finally, restart the Spark shell and you should now see fewer logging messages in the shell outputs
主站蜘蛛池模板: 肥乡县| 枣庄市| 江安县| 蕲春县| 贺州市| 哈尔滨市| 天全县| 深州市| 郯城县| 景泰县| 双城市| 柘荣县| 碌曲县| 齐齐哈尔市| 秦安县| 西青区| 兰西县| 肥东县| 福州市| 阜城县| 裕民县| 镶黄旗| 双峰县| 楚雄市| 宁蒗| 自治县| 怀远县| 庄浪县| 肥城市| 深泽县| 类乌齐县| 郧西县| 滦南县| 桂东县| 丹寨县| 南城县| 行唐县| 安仁县| 西城区| 天门市| 军事|