官术网_书友最值得收藏!

DataFrames

We have already used DataFrames in previous examples; it is based on a columnar format. Temporary tables can be created from it but we will expand on this in the next section. There are many methods available to the data frame that allow data manipulation and processing.

Let's start with a simple example and load some JSON data coming from an IoT sensor on a washing machine. We are again using the Apache Spark DataSource API under the hood to read and parse JSON data. The result of the parser is a data frame. It is possible to display a data frame schema as shown here:

As you can see, this is a nested data structure. So, the doc field contains all the information that we are interested in, and we want to get rid of the meta information that Cloudant/ApacheCouchDB added to the original JSON file. This can be accomplished by a call to the select method on the DataFrame:

This is the first time that we are using the DataFrame API for data processing. Similar to RDDs, a set of methods is composing a relational API that is in line with, or even exceeding, the expressiveness that SQL has. It is also possible to use the select method to filter columns from the data. In SQL or relational algebra, this is called projection. Let's now look at an example to better understand the concept:

If we want to see the contents of a DataFrame, we can call the show method on it. By default, the first 20 rows are returned. In this case, we've passed 3 as an optional parameter limiting the output to the first three rows.

Of course, the show method is only useful to debug because it is plain text and cannot be used for further downstream processing. However, we can chain calls together very easily.

Note that the result of a method on a DataFrame returns a DataFrame again--similar to the concept of RDD methods returning RDDs. This means that method calls can be chained as we can see in the next example.

It is possible to filter the data returned from the DataFrame using the filter method. Here, we filter on voltage and select voltage and frequency:

Semantically, the preceding statement is the same independently if we first filter and then select or vice versa. However, it might make a difference on performance due to which approach we choose. Fortunately, we don't have to take care of this as DataFrames - as RDDs - are lazy. This means that until we call a materialization method such as show, no data processing can take place. In fact, ApacheSparkSQL optimizes the order of the execution under the hood. How this works is covered in the Chapter 3, The Catalyst Optimizer.

There is also a groupby method to determine volume counts within a Dataset. So let's check the number of rows where we had an acceptable fluidlevel:

So, SQL-like actions can be carried out against DataFrames, including select, filter, sort, groupby, and print. The next section shows you how tables can be created from DataFrames and how SQL-based actions are carried out against them.

主站蜘蛛池模板: 永登县| 察哈| 高尔夫| 博客| 溧阳市| 监利县| 育儿| 牙克石市| 綦江县| 高要市| 板桥市| 手游| 大名县| 三江| 二连浩特市| 临颍县| 恩平市| 巴中市| 桐梓县| 乌恰县| 香河县| 延庆县| 阿图什市| 鲁甸县| 昌图县| 锡林郭勒盟| 安达市| 新疆| 南平市| 万州区| 项城市| 万荣县| 金昌市| 汝南县| 托里县| 永丰县| 甘肃省| 克东县| 青海省| 娄烦县| 云霄县|