- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 336字
- 2021-07-02 18:55:31
The Dataset API in action
We conclude on Datasets with a final aggregation example using the relational Dataset API. Note that we now have an additional choice of methods inspired by RDDs. So we can mix in the map function known from RDDs as follows:
val dsNew = ds.filter(r => {r.age >= 18}).
map(c => (c.age, c.countryCode)).
groupBy($"_2").
avg()
Let's understand how this works step by step:
- This basically takes the Dataset and filters it to rows containing clients with ages over 18.
- Then, from the client object c, we only take the age and countryCode columns. This process is again a projection and could have been done using the select method. The map method is only used here to show the capabilities of using lambda functions in conjunction with Datasets without directly touching the underlying RDD.
- Now, we group by countryCode. We are using the so-called Catalyst (DSL Domain Specific Language) in the groupBy method to actually refer to the second element of the tuple that we created in the previous step.
- Finally, we average on the groups that we previously created--basically averaging the age per country.
The result is a new strongly typed Dataset containing the average age for adults by country:

Now we have a quite complete picture of all the first-class citizens of ApacheSparkSQL, as shown in the following figure:

This basically shows that RDD is still the central data processing API where everything else builds on top. DataFrames allow for structured data APIs whereas Datasets bring it to the top with statically-typed domain objects, limited to Scala and Java. Both APIs are usable with SQL or a relational API as we can also run SQL queries against Datasets, as the following example illustrates:

This gives us some idea of the SQL-based functionality within Apache Spark, but what if we find that the method that needed is not available? Perhaps we need a new function. This is where user-defined functions (UDFs) are useful. We will cover them in the next section.
- Go Web編程
- Learning Apex Programming
- 大學(xué)計(jì)算機(jī)應(yīng)用基礎(chǔ)實(shí)踐教程
- 軟件測(cè)試項(xiàng)目實(shí)戰(zhàn)之性能測(cè)試篇
- C#程序設(shè)計(jì)教程
- Mastering Julia
- Java程序設(shè)計(jì)與實(shí)踐教程(第2版)
- JSP開(kāi)發(fā)案例教程
- Mastering Apache Maven 3
- 蘋(píng)果的產(chǎn)品設(shè)計(jì)之道:創(chuàng)建優(yōu)秀產(chǎn)品、服務(wù)和用戶體驗(yàn)的七個(gè)原則
- 用案例學(xué)Java Web整合開(kāi)發(fā)
- 從Power BI到Analysis Services:企業(yè)級(jí)數(shù)據(jù)分析實(shí)戰(zhàn)
- SAP Web Dynpro for ABAP開(kāi)發(fā)技術(shù)詳解:基礎(chǔ)應(yīng)用
- 3D Printing Designs:Octopus Pencil Holder
- 生成藝術(shù):Processing視覺(jué)創(chuàng)意入門(mén)