So first, we have to import some classes. Follow the code to do this:
import org.apache.spark.sql.types._
So let's define a schema for some CSV file. In order to create one, we can simply write the DataFrame from the previous section to HDFS (again using the Apache Spark Datasoure API):
Let's double-check the contents of the directory in HDFS:
Finally, double-check the content of one file:
So, we are fine; we've lost the schema information but the rest of the information is preserved. We can see the following if we use the DataSource API to load this CSV again:
This shows you that we've lost the schema information because all columns are identified as strings now and the column names are also lost. Now let's create the schema manually:
If we now load rawRDD, we basically get a list of strings, one string per row:
Now we have to transform this rawRDD into a slightly more usable RDD containing the Row object by splitting the row strings and creating the respective Row objects. In addition, we convert to the appropriate data types where necessary:
Finally, we recreate our data frame object using the following code:
If we now print the schema, we notice that it is the same again: