- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 224字
- 2021-07-02 19:02:01
Counting the number of words in a file
Let's read the file people.txt placed in $SPARK_HOME/examples/src/main/resources:

scala>val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
file: org.apache.spark.rdd.RDD[String] = /usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24
The next step is to flatten the contents of the file, that is, we will create an RDD by splitting each line with , and flatten all the words in the list, as follows:
scala>valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:26
The contents of flattenFile RDD looks as follows:
scala>flattenFile.collect
res5: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)
Now, we can count all the words in this RDD as follows:
scala>val count = flattenFile.count
count: Long = 6
scala> count
res2: Long = 6
It is shown in the following screenshot:

Whenever any action such as count gets called, the Spark creates a directed acyclic graph (DAG) to depict the lineage dependency of each RDD. Spark provides a debug method toDebugString() to show such lineage dependencies of the RDD:
scala>flattenFile.toDebugString
It is shown in the following screenshot:

The indentations represent the shuffle while the number in the parentheses indicates the parallelism level at each stage.
In this section, we became familiar with some Spark CLI concepts. In the next section, we will discuss various components of Spark job.
- SoapUI Cookbook
- 構建移動網站與APP:HTML 5移動開發入門與實戰(跨平臺移動開發叢書)
- Python進階編程:編寫更高效、優雅的Python代碼
- Java設計模式及實踐
- 名師講壇:Java微服務架構實戰(SpringBoot+SpringCloud+Docker+RabbitMQ)
- 概率成形編碼調制技術理論及應用
- Extending Puppet(Second Edition)
- 單片機C語言程序設計實訓100例
- Swift 4 Protocol-Oriented Programming(Third Edition)
- Clojure for Machine Learning
- Go語言底層原理剖析
- Docker:容器與容器云(第2版)
- HTML5 WebSocket權威指南
- 零基礎學Java(第5版)
- Java EE 程序設計