- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 228字
- 2021-07-02 18:55:29
Defining schemas manually
So first, we have to import some classes. Follow the code to do this:
import org.apache.spark.sql.types._
So let's define a schema for some CSV file. In order to create one, we can simply write the DataFrame from the previous section to HDFS (again using the Apache Spark Datasoure API):
washing_flat.write.csv("hdfs://localhost:9000/tmp/washing_flat.csv")
Let's double-check the contents of the directory in HDFS:

Finally, double-check the content of one file:

So, we are fine; we've lost the schema information but the rest of the information is preserved. We can see the following if we use the DataSource API to load this CSV again:
This shows you that we've lost the schema information because all columns are identified as strings now and the column names are also lost. Now let's create the schema manually:
val schema = StructType(
StructField("_id",StringType,true)::
StructField("_rev",StringType,true)::
StructField("count",LongType,true)::
StructField("flowrate",LongType,true)::
StructField("fluidlevel",StringType,true)::
StructField("frequency",LongType,true)::
StructField("hardness",LongType,true)::
StructField("speed",LongType,true)::
StructField("temperature",LongType,true)::
StructField("ts",LongType,true)::
StructField("voltage",LongType,true)::
Nil)
If we now load rawRDD, we basically get a list of strings, one string per row:

Now we have to transform this rawRDD into a slightly more usable RDD containing the Row object by splitting the row strings and creating the respective Row objects. In addition, we convert to the appropriate data types where necessary:

Finally, we recreate our data frame object using the following code:

If we now print the schema, we notice that it is the same again:

- 基于粒計算模型的圖像處理
- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Python機器學習:數(shù)據(jù)分析與評分卡建模(微課版)
- Android Development with Kotlin
- iOS開發(fā)實戰(zhàn):從零基礎到App Store上架
- Learning Selenium Testing Tools(Third Edition)
- Python貝葉斯分析(第2版)
- Getting Started with Greenplum for Big Data Analytics
- ServiceNow:Building Powerful Workflows
- Spring+Spring MVC+MyBatis從零開始學
- 從零開始學Python網(wǎng)絡爬蟲
- UML軟件建模
- 青少年學Python(第2冊)
- JavaWeb從入門到精通(視頻實戰(zhàn)版)
- 例解Python:Python編程快速入門踐行指南