官术网_书友最值得收藏!

Word count on RDD

Let's run a word count problem on stringRDD. Word count is the HelloWorld of the big data world. Word count means that we will count the occurrence of each word in the RDD:

So first we will create pairRDD as follows:

scala>valpairRDD=stringRdd.map( s => (s,1))
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at <console>:26

The pairRDD consists of pairs of the word and one (integer) where word represents strings of stringRDD.

Now, we will run the reduceByKey operation on this RDD to count the occurrence of each word as follows:

scala>valwordCountRDD=pairRDD.reduceByKey((x,y) =>x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28

Now, let's run collect on it to see the result:

scala>valwordCountList=wordCountRDD.collect
wordCountList: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))
scala>wordCountList
res3: Array[(String, Int)] = Array((Python,1), (JavaScript,1), (Java,2), (Scala,1), (Ruby,1))

As per the output of wordCountList, every string in stringRDD appears once expect Java, which appeared twice.

It is shown in the following screenshot:

主站蜘蛛池模板: 五家渠市| 修水县| 乐清市| 疏附县| 新疆| 苏尼特左旗| 同德县| 新余市| 陈巴尔虎旗| 凉城县| 米林县| 社旗县| 景东| 靖安县| 陕西省| 大名县| 宜宾市| 卢湾区| 开远市| 巨野县| 响水县| 庆云县| 乌鲁木齐市| 故城县| 科技| 宾川县| 华坪县| 南投市| 调兵山市| 昂仁县| 兰西县| 襄垣县| 昭通市| 昌平区| 东安县| 额尔古纳市| 横山县| 象州县| 甘洛县| 长寿区| 吴江市|