- Mastering Spark for Data Science
- Andrew Morgan Antoine Amend David George Matthew Hallett
- 779字
- 2021-07-09 18:49:35
The problem, principles and planning
In this section, we will explore why an EDA might be required and discuss the important considerations for creating one.
Understanding the EDA problem
A difficult question that precedes an EDA project is: Can you give me an estimate and breakdown of your proposed EDA costs, please?
How we answer this question ultimately shapes our EDA strategy and tactics. In days gone by, the answer to this question typically started like this: Basically you pay by the column.... This rule of thumb is based on the premise that there is an iterable unit of data exploration work, and these units of work drive the estimate of effort and thus the rough price of performing the EDA.
What's interesting about this idea is that the units of work are quoted in terms of the data structures to investigate rather than functions that need writing. The reason for this is simple. Data processing pipelines of functions are assumed to exist already, rather than being new work, and so the quotation offered is actually the implied cost of configuring the new inputs' data structures to our standard data processing pipelines for exploring data.
This thinking brings us to the main EDA problem, that exploring seems hard to pin down in terms of planning tasks and estimating timings. The recommended approach is to consider explorations as configuration driven tasks. This helps us to structure and estimate the work more effectively, as well as helping to shape the thinking around the effort so that configuration is the central challenge, rather than the writing of a lot of ad hoc throw-away code.
The process of configuring data exploration also drives us to consider the processing templates we might need. We would need to configure these based on the form of the data we explore. For instance, we would need a standard exploration pipeline for structured data, for text data, for graph shaped data, for image data, for sound data, for time series data, and for spatial data. Once we have these templates, we need to simply map our input data to them and configure our ingestion filters to deliver a focused lens over the data.
Design principles
Modernizing these ideas for Apache Spark based EDA processing means that we need to design our configurable EDA functions and code with some general principles in mind:
- Easily reusable functions/features: We need to define our functions to work on general data structures in general ways so they produce good exploratory features and deliver them in ways that minimize the effort needed to configure them for new datasets
- Minimize intermediate data structures: We need to avoid proliferating intermediate schemas, helping to minimize intermediate configurations, and where possible create reusable data structures
- Data driven configuration: Where possible, we need to have configurations that can be generated from metadata to reduce the manual boilerplate work
- Templated visualizations: General reusable visualizations driven from common input schemas and metadata
Lastly, although it is not a strict principle per se, we need to construct exploratory tools that are flexible enough to discover data structures rather than depend on rigid pre-defined configurations. This helps when things go wrong, by helping us to reverse engineer the file content, the encodings, or the potential errors in the file definitions when we come across them.
General plan of exploration
The early stages of all EDA work are invariably based on the simple goal of establishing whether the data is of good quality. If we focus here, to create a general getting started plan that is widely applicable, then we can lay down a general set of tasks.
These tasks create the general shape of a proposed EDA project plan, which is as follows:
- Prepare source tools, source our input datasets, review the documentation, and so on. Review security of data where necessary.
- Obtain, decrypt, and stage the data in HDFS; collect non-functional requirements (NFRs) for planning.
- Run code point level frequency reports on the file content.
- Run a population check on the amount of missing data in the files' fields.
- Run a low grain format profiler to check on the high cardinality fields in the files.
- Run a high grain format profiler check on format-controlled fields in the files.
- Run referential integrity checks, where appropriate.
- Run in-dictionary checks, to verify external dimensions.
- Run basic numeric and statistical explorations of numeric data.
- Run more visualization-based explorations of key data of interest.
Note
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space. Many code points represent single characters, but they can also have other meanings, such as for formatting.