Word count on RDD

Let's run a word count problem on stringRDD. Word count is the HelloWorld of the big data world. Word count means that we will count the occurrence of each word in the RDD:

So first we will create pairRDD as follows:

scala>valpairRDD=stringRdd.map( s => (s,1))
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:26

The pairRDD consists of pairs of the word and one (integer) where word represents strings of stringRDD.

Now, we will run the reduceByKey operation on this RDD to count the occurrence of each word as follows:

scala>valwordCountRDD=pairRDD.reduceByKey((x,y) =>x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28

Now, let's run collect on it to see the result:

scala>valwordCountList=wordCountRDD.collect
wordCountList: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))
scala>wordCountList
res3: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))

As per the output of wordCountList, every string in stringRDD appears once expect Java, which appeared twice.

It is shown in the following screenshot:

官术网_书友最值得收藏!

Apache Spark 2.x for Java Developers

Word count on RDD