官术网_书友最值得收藏!

  • Mastering Spark for Data Science
  • Andrew Morgan Antoine Amend David George Matthew Hallett
  • 362字
  • 2021-07-09 18:49:34

Chapter 4. Exploratory Data Analysis

Exploratory Data Analysis (EDA) performed in commercial settings is generally commissioned as part of a larger piece of work that is organized and executed along the lines of a feasibility assessment. The aim of this feasibility assessment, and thus the focus of what we can term an extended EDA, is to answer a broad set of questions about whether the data examined is fit for purpose and thus worthy of further investment.

Under this general remit, the data investigations are expected to cover several aspects of feasibility that include the practical aspects of using the data in production, such as its timeliness, quality, complexity, and coverage, as well as being appropriate for the intended hypothesis to be tested. While some of these aspects are potentially less fun from a data science perspective, these data quality led investigations are no less important than purely statistical insights. This is especially true when the datasets in question are very large and complex and when the investment needed to prepare the data for the data science might be significant. To illustrate this point, and to bring the topic to life, we present methods for doing an EDA of the vast and complex Global Knowledge Graph (GKG) data feeds, made available by the Global Database of Events, Language and Tone (GDELT) project.

In this chapter, we will create and interpret an EDA while covering the following topics:

  • Understanding the problems and design goals for planning and structuring an Extended Exploratory Data Analysis
  • What data profiling is, with examples, and how a general framework for data quality can be formed around the technique for continuous data quality monitoring
  • How to construct a general mask-based data profiler around the method
  • How to store the exploratory metrics to a standard schema, to facilitate the study of data drift in the metrics over time, with examples
  • How to use Apache Zeppelin notebooks for quick EDA work, and for plotting charts and graphs
  • How to extract and study the GCAM sentiments in GDELT, both as time series and as spatio-temporal datasets
  • How to extend Apache Zeppelin to generate custom charts using the plot.ly library
主站蜘蛛池模板: 普陀区| 六安市| 琼结县| 监利县| 宁波市| 什邡市| 临清市| 霍林郭勒市| 冕宁县| 武陟县| 莱西市| 青冈县| 和林格尔县| 宁德市| 苏尼特右旗| 嘉黎县| 阜城县| 将乐县| 兴安县| 郴州市| 比如县| 西峡县| 崇左市| 长泰县| 乌拉特前旗| 阿勒泰市| 永定县| 成安县| 马边| 财经| 余庆县| 湾仔区| 宁德市| 迁安市| 庆阳市| 乐安县| 密山市| 达日县| 湘乡市| 南靖县| 神木县|