官术网_书友最值得收藏!

Counting the number of words in a file

Let's read the file people.txt placed in $SPARK_HOME/examples/src/main/resources:

The textFile() method can be used to read the file as follows:
scala>val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
file: org.apache.spark.rdd.RDD[String] = /usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24

The next step is to flatten the contents of the file, that is, we will create an RDD by splitting each line with , and flatten all the words in the list, as follows:

scala>valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:26

The contents of flattenFile RDD looks as follows:

scala>flattenFile.collect
res5: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)

Now, we can count all the words in this RDD as follows:

scala>val count = flattenFile.count
count: Long = 6
scala> count
res2: Long = 6

It is shown in the following screenshot:

Whenever any action such as count gets called, the Spark creates a directed acyclic graph (DAG) to depict the lineage dependency of each RDD. Spark provides a debug method toDebugString() to show such lineage dependencies of the RDD:

scala>flattenFile.toDebugString

It is shown in the following screenshot:

The indentations represent the shuffle while the number in the parentheses indicates the parallelism level at each stage.

In this section, we became familiar with some Spark CLI concepts. In the next section, we will discuss various components of Spark job.

主站蜘蛛池模板: 丰县| 寿光市| 尉氏县| 中方县| 渭源县| 合作市| 柳河县| 托克逊县| 乌拉特后旗| 鄂尔多斯市| 壶关县| 崇义县| 固原市| 南充市| 芦溪县| 邢台县| 罗定市| 伊春市| 仁寿县| 本溪市| 蒲城县| 曲阜市| 永和县| 双牌县| 廊坊市| 易门县| 铜陵市| 崇信县| 云霄县| 明星| 防城港市| 武川县| 收藏| 枝江市| 丹巴县| 通江县| 三江| 疏附县| 乌海市| 行唐县| 千阳县|