- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 224字
- 2021-07-02 19:02:01
Counting the number of words in a file
Let's read the file people.txt placed in $SPARK_HOME/examples/src/main/resources:

scala>val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
file: org.apache.spark.rdd.RDD[String] = /usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24
The next step is to flatten the contents of the file, that is, we will create an RDD by splitting each line with , and flatten all the words in the list, as follows:
scala>valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:26
The contents of flattenFile RDD looks as follows:
scala>flattenFile.collect
res5: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)
Now, we can count all the words in this RDD as follows:
scala>val count = flattenFile.count
count: Long = 6
scala> count
res2: Long = 6
It is shown in the following screenshot:

Whenever any action such as count gets called, the Spark creates a directed acyclic graph (DAG) to depict the lineage dependency of each RDD. Spark provides a debug method toDebugString() to show such lineage dependencies of the RDD:
scala>flattenFile.toDebugString
It is shown in the following screenshot:

The indentations represent the shuffle while the number in the parentheses indicates the parallelism level at each stage.
In this section, we became familiar with some Spark CLI concepts. In the next section, we will discuss various components of Spark job.
- HornetQ Messaging Developer’s Guide
- C語言程序設計習題解析與上機指導(第4版)
- Java應用開發與實踐
- Django:Web Development with Python
- 小程序開發原理與實戰
- C語言程序設計同步訓練與上機指導(第三版)
- 運用后端技術處理業務邏輯(藍橋杯軟件大賽培訓教材-Java方向)
- 西門子S7-200 SMART PLC編程從入門到實踐
- 利用Python進行數據分析
- 從零開始學C#
- INSTANT Adobe Edge Inspect Starter
- 運維前線:一線運維專家的運維方法、技巧與實踐
- JBoss:Developer's Guide
- 測試架構師修煉之道:從測試工程師到測試架構師
- Python Digital Forensics Cookbook