官术网_书友最值得收藏!

Counting the number of words in a file

Let's read the file people.txt placed in $SPARK_HOME/examples/src/main/resources:

The textFile() method can be used to read the file as follows:
scala>val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
file: org.apache.spark.rdd.RDD[String] = /usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24

The next step is to flatten the contents of the file, that is, we will create an RDD by splitting each line with , and flatten all the words in the list, as follows:

scala>valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:26

The contents of flattenFile RDD looks as follows:

scala>flattenFile.collect
res5: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)

Now, we can count all the words in this RDD as follows:

scala>val count = flattenFile.count
count: Long = 6
scala> count
res2: Long = 6

It is shown in the following screenshot:

Whenever any action such as count gets called, the Spark creates a directed acyclic graph (DAG) to depict the lineage dependency of each RDD. Spark provides a debug method toDebugString() to show such lineage dependencies of the RDD:

scala>flattenFile.toDebugString

It is shown in the following screenshot:

The indentations represent the shuffle while the number in the parentheses indicates the parallelism level at each stage.

In this section, we became familiar with some Spark CLI concepts. In the next section, we will discuss various components of Spark job.

主站蜘蛛池模板: 达拉特旗| 祁东县| 新蔡县| 麟游县| 榆林市| 剑河县| 邵东县| 上饶市| 睢宁县| 乌拉特中旗| 连南| 和静县| 含山县| 阿瓦提县| 永州市| 沁源县| 胶州市| 阿勒泰市| 平舆县| 沭阳县| 墨脱县| 游戏| 六安市| 百色市| 柳江县| 正宁县| 肃宁县| 弋阳县| 荣成市| 康马县| 东港市| 桐乡市| 博兴县| 龙南县| 南木林县| 泸水县| 博罗县| 武冈市| 南康市| 静海县| 白银市|