- Spark Cookbook
- Rishi Yadav
- 402字
- 2021-07-16 13:44:00
Exploring the Spark shell
Spark comes bundled with a REPL shell, which is a wrapper around the Scala shell. Though the Spark shell looks like a command line for simple things, in reality a lot of complex queries can also be executed using it. This chapter explores different development environments in which Spark applications can be developed.
How to do it...
Hadoop MapReduce's word count becomes very simple with the Spark shell. In this recipe, we are going to create a simple 1-line text file, upload it to the Hadoop distributed file system (HDFS), and use Spark to count occurrences of words. Let's see how:
- Create the
words
directory by using the following command:$ mkdir words
- Get into the
words
directory:$ cd words
- Create a
sh.txt
text file and enter"to be or not to be"
in it:$ echo "to be or not to be" > sh.txt
- Start the Spark shell:
$ spark-shell
- Load the
words
directory as RDD:Scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")
- Count the number of lines ( result: 1):
Scala> words.count
- Divide the line (or lines) into multiple words:
Scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
- Convert
word
to (word,1)—that is, output1
as the value for each occurrence ofword
as a key:Scala> val wordsMap = wordsFlatMap.map( w => (w,1))
- Use the
reduceByKey
method to add the number of occurrences for each word as a key (the function works on two consecutive values at a time represented bya
andb
):Scala> val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
- Sort the results:
Scala> val wordCountSorted = wordCount.sortByKey(true)
- Print the RDD:
Scala> wordCountSorted.collect.foreach(println)
- Doing all of the preceding operations in one step is as follows:
Scala> sc.textFile("hdfs://localhost:9000/user/hduser/words"). flatMap(_.split("\\W+")).map( w => (w,1)). reduceByKey( (a,b) => (a+b)).sortByKey(true).collect.foreach(println)
This gives us the following output:
(or,1) (to,2) (not,1) (be,2)
Now you understand the basics, load HDFS with a large amount of text—for example, stories—and see the magic.
If you have the files in a compressed format, you can load them as is in HDFS. Both Hadoop and Spark have codecs for unzipping, which they use based on file extensions.
When wordsFlatMap
was converted to wordsMap
RDD, there was an implicit conversion. This converts RDD into PairRDD
. This is an implicit conversion, which does not require anything to be done. If you are doing it in Scala code, please add the following import
statement:
import org.apache.spark.SparkContext._
- Dynamics 365 for Finance and Operations Development Cookbook(Fourth Edition)
- UI設計基礎培訓教程
- 數據庫系統原理及MySQL應用教程(第2版)
- 編程的修煉
- Learning AWS Lumberyard Game Development
- 單片機應用技術
- SEO實戰密碼
- Android Native Development Kit Cookbook
- Symfony2 Essentials
- Java Web開發詳解
- NGINX Cookbook
- Kotlin開發教程(全2冊)
- Java零基礎實戰
- 區塊鏈技術進階與實戰(第2版)
- Django實戰:Python Web典型模塊與項目開發