- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 528字
- 2021-07-02 18:55:29
DataFrames
We have already used DataFrames in previous examples; it is based on a columnar format. Temporary tables can be created from it but we will expand on this in the next section. There are many methods available to the data frame that allow data manipulation and processing.
Let's start with a simple example and load some JSON data coming from an IoT sensor on a washing machine. We are again using the Apache Spark DataSource API under the hood to read and parse JSON data. The result of the parser is a data frame. It is possible to display a data frame schema as shown here:

As you can see, this is a nested data structure. So, the doc field contains all the information that we are interested in, and we want to get rid of the meta information that Cloudant/ApacheCouchDB added to the original JSON file. This can be accomplished by a call to the select method on the DataFrame:
This is the first time that we are using the DataFrame API for data processing. Similar to RDDs, a set of methods is composing a relational API that is in line with, or even exceeding, the expressiveness that SQL has. It is also possible to use the select method to filter columns from the data. In SQL or relational algebra, this is called projection. Let's now look at an example to better understand the concept:

Of course, the show method is only useful to debug because it is plain text and cannot be used for further downstream processing. However, we can chain calls together very easily.
It is possible to filter the data returned from the DataFrame using the filter method. Here, we filter on voltage and select voltage and frequency:
There is also a groupby method to determine volume counts within a Dataset. So let's check the number of rows where we had an acceptable fluidlevel:

So, SQL-like actions can be carried out against DataFrames, including select, filter, sort, groupby, and print. The next section shows you how tables can be created from DataFrames and how SQL-based actions are carried out against them.
- C#完全自學教程
- Web開發的貴族:ASP.NET 3.5+SQL Server 2008
- Python機器學習編程與實戰
- SQL Server與JSP動態網站開發
- Advanced Express Web Application Development
- Access 2010中文版項目教程
- Julia for Data Science
- Domain-Driven Design in PHP
- 汽車人機交互界面整合設計
- C#面向對象程序設計(第2版)
- 基于GPU加速的計算機視覺編程:使用OpenCV和CUDA實時處理復雜圖像數據
- PHP項目開發全程實錄(第4版)
- Java EE輕量級解決方案:S2SH
- ANSYS FLUENT 16.0超級學習手冊
- 系統分析師UML用例實戰