- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 159字
- 2021-07-02 19:02:01
Word count on RDD
Let's run a word count problem on stringRDD. Word count is the HelloWorld of the big data world. Word count means that we will count the occurrence of each word in the RDD:
So first we will create pairRDD as follows:
scala>valpairRDD=stringRdd.map( s => (s,1))
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:26
The pairRDD consists of pairs of the word and one (integer) where word represents strings of stringRDD.
Now, we will run the reduceByKey operation on this RDD to count the occurrence of each word as follows:
scala>valwordCountRDD=pairRDD.reduceByKey((x,y) =>x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28
Now, let's run collect on it to see the result:
scala>valwordCountList=wordCountRDD.collect
wordCountList: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))
scala>wordCountList
res3: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))
As per the output of wordCountList, every string in stringRDD appears once expect Java, which appeared twice.
It is shown in the following screenshot:

推薦閱讀
- Facebook Application Development with Graph API Cookbook
- Learning NServiceBus(Second Edition)
- 軟件項目估算
- Arduino開發實戰指南:LabVIEW卷
- VSTO開發入門教程
- INSTANT Weka How-to
- Hadoop+Spark大數據分析實戰
- Visual Basic程序設計與應用實踐教程
- Big Data Analytics
- Java Web程序設計任務教程
- Python Data Analysis Cookbook
- AutoCAD 2009實訓指導
- GameMaker Essentials
- 軟件體系結構
- Webpack實戰:入門、進階與調優(第2版)