官术网_书友最值得收藏!

Counting the number of words in a file

Let's read the file people.txt placed in $SPARK_HOME/examples/src/main/resources:

The textFile() method can be used to read the file as follows:
scala>val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
file: org.apache.spark.rdd.RDD[String] = /usr/local/spark/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24

The next step is to flatten the contents of the file, that is, we will create an RDD by splitting each line with , and flatten all the words in the list, as follows:

scala>valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:26

The contents of flattenFile RDD looks as follows:

scala>flattenFile.collect
res5: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)

Now, we can count all the words in this RDD as follows:

scala>val count = flattenFile.count
count: Long = 6
scala> count
res2: Long = 6

It is shown in the following screenshot:

Whenever any action such as count gets called, the Spark creates a directed acyclic graph (DAG) to depict the lineage dependency of each RDD. Spark provides a debug method toDebugString() to show such lineage dependencies of the RDD:

scala>flattenFile.toDebugString

It is shown in the following screenshot:

The indentations represent the shuffle while the number in the parentheses indicates the parallelism level at each stage.

In this section, we became familiar with some Spark CLI concepts. In the next section, we will discuss various components of Spark job.

主站蜘蛛池模板: 霍城县| 铅山县| 论坛| 西宁市| 永春县| 额济纳旗| 昭苏县| 齐齐哈尔市| 乐东| 峨眉山市| 探索| 融水| 霍山县| 若羌县| 常州市| 宜昌市| 博爱县| 巴塘县| 晋州市| 左贡县| 贺兰县| 尚志市| 松溪县| 华亭县| 江阴市| 福鼎市| 霞浦县| 安图县| 丰顺县| 宜都市| 中山市| 连山| 玉门市| 岳阳市| 邵武市| 越西县| 汉川市| 江永县| 射阳县| 铁岭县| 屏东县|