官术网_书友最值得收藏!

Summary

In this chapter, we have seen why datasets should always be thoroughly understood before too much exploration work is undertaken. We have discussed the details of structured data and dimensional modeling, particularly with respect to how this applies to the GDELT dataset, and have expanded the GKG model to show its underlying complexity.

We have explained the difference between the traditional ETL and newer schema-on-read ELT techniques, and have touched upon some of the issues that data engineers face regarding data storage, compression, and data formats - specifically the advantages and implementations of Avro and Parquet. We have also demonstrated that there are several ways to explore data using the various Spark API, including examples of how to use SQL on the Spark shell.

We can conclude this chapter by mentioning that the code in our repository pulls everything together and is a full model for reading in raw GKG files (use the Apache NiFi GDELT data ingest pipeline from Chapter 1, Data Acquisition if you require some data).

In the next chapter, we will pe deeper into the GKG model by exploring the techniques used to explore and analyze data at scale. We will see how to develop and enrich our GKG data model using SQL, and investigate how Apache Zeppelin notebooks can provide a richer data science experience.

主站蜘蛛池模板: 栖霞市| 墨玉县| 丹棱县| 合川市| 西乡县| 绥江县| 东乌珠穆沁旗| 海盐县| 兰坪| 大邑县| 全南县| 阜城县| 舞钢市| 涿州市| 屯昌县| 当阳市| 平泉县| 陆川县| 虎林市| 米脂县| 如东县| 宿迁市| 中卫市| 许昌市| 大悟县| 昭觉县| 平阴县| 宜川县| 探索| 玉环县| 陆丰市| 古浪县| 屏东市| 阳泉市| 高阳县| 精河县| 九龙坡区| 高台县| 安宁市| 甘洛县| 岱山县|