官术网_书友最值得收藏!

Creating and filtering RDD

Let's start by creating an RDD of strings:

scala>val stringRdd=sc.parallelize(Array("Java","Scala","Python","Ruby","JavaScript","Java"))
stringRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

Now, we will filter this RDD to keep only those strings that start with the letter J:

scala>valfilteredRdd = stringRdd.filter(s =>s.startsWith("J"))
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:26

In the first chapter, we learnt that if an operation on RDD returns an RDD then it is a transformation, or else it is an action.

The output of the preceding command clearly shows that filter the operation returned an RDD so the filter is a transformation.

Now, we will run an action on filteredRdd to see it's elements. Let's run collect on the filteredRdd:

scala>val list = filteredRdd.collect
list: Array[String] = Array(Java, JavaScript, Java)

As per the output of the previous command, the collect operation returned an array of strings. So, it is an action.

Now, let's see the elements of the list variable:

scala> list
res5: Array[String] = Array(Java, JavaScript, Java)

We are left with only elements that start with J, which was our desired outcome:

主站蜘蛛池模板: 和林格尔县| 从江县| 溆浦县| 乐亭县| 从江县| 建始县| 溧水县| 大连市| 汉沽区| 闵行区| 云林县| 天津市| 隆子县| 虞城县| 合作市| 永兴县| 塘沽区| 穆棱市| 屯留县| 拉孜县| 万州区| 连城县| 贞丰县| 肃北| 松潘县| 安阳市| 林甸县| 芒康县| 晋州市| 香港| 内江市| 长乐市| 昌平区| 宁德市| 沅江市| 德兴市| 疏勒县| 尉氏县| 阜城县| 辽宁省| 萨嘎县|