官术网_书友最值得收藏!

Processing JSON files

JavaScript Object Notation (JSON) is a data interchange format developed by the JavaScript ecosystem. It is a text-based format and has the same expressiveness such as, for instance, XML. The following example uses the SparkSession method called read.json to load the HDFS-based JSON data file named adult.json. This uses the so-called Apache Spark DataSource API to read and parse JSON files, but we will come back to that later.

val dframe = spark.read.json("hdfs:///data/spark/adult.json")

The result is a DataFrame.

Data can be saved in the JSON format using the DataSource API as well, as shown by the following example:

import spark.implicits._
val df = sc.parallelize(Array(1,2,3)).toDF
df.write.json("hdfs://localhost:9000/tmp/test.json")

So, the resulting data can be seen on HDFS; the Hadoop filesystem ls command shows you that the data resides in the target directory as a success file and eight part files. This is because even though small, the underlying RDD was set to have eight partitions, therefore those eight partitions have been written. This is shown in the following image:

What if we want to obtain a single file? This can be accomplished by repartition to a single partition:

val df1 =df.repartition(1)
df1.write.json("hdfs://localhost:9000/tmp/test_single_partition.json")

If we now have a look at the folder, it is a single file:

There are two important things to know. First, we still get the file wrapped in a subfolder, but this is not a problem as HDFS treats folders equal to files and as long as the containing files stick to the same format, there is no problem. So, if you refer to /tmp/test_single_partition.json, which is a folder, you can also use it similarly to a single file.

In addition, all files starting with _ are ignored. This brings us to the second point, the _SUCCESS file. This is a framework-independent way to tell users of that file that the job writing this file (or folder respectively) has been successfully completed. Using the Hadoop filesystem's cat command, it is possible to display the contents of the JSON data:

If you want to dive more into partitioning and what it means when using it in conjunction with HDFS, it is recommended that you start with the following discussion thread on StackOverflow:
http://stackoverflow.com/questions/10666488/what-are-success-and-part-r-00000-files-in-hadoop.

Processing Parquet data is very similar, as we will see next.

主站蜘蛛池模板: 曲水县| 岱山县| 斗六市| 淄博市| 顺昌县| 嫩江县| 卢龙县| 东源县| 普陀区| 开远市| 喜德县| 汝州市| 呼伦贝尔市| 阿拉善右旗| 两当县| 淅川县| 永平县| 瑞金市| 祁连县| 日照市| 武鸣县| 通辽市| 自治县| 达州市| 察隅县| 通化市| 嵊泗县| 东港市| 大足县| 日喀则市| 香港| 罗山县| 白玉县| 乡城县| 吴旗县| 寿阳县| 分宜县| 邵武市| 嘉荫县| 阿勒泰市| 鱼台县|