- Applied Supervised Learning with Python
- Benjamin Johnston Ishita Mathur
- 1058字
- 2021-06-11 13:44:45
Introduction
Say we have a problem statement that involves predicting whether a particular earthquake caused a tsunami or not. How do we decide what model to use? What do we know about the data we have? Nothing! But if we don't know and understand our data, chances are we'll end up building a model that's not very interpretable or reliable.
When it comes to data science, it's important to have a thorough understanding of the data we're dealing with, in order to generate features that are highly informative and, consequently, to build accurate and powerful models.
In order to gain this understanding, we perform an exploratory analysis on the data to see what the data can tell us about the relationships between the features and the target variable. Getting to know our data will even help us interpret the model we build and identify ways we can improve its accuracy.
The approach we take to achieve this is to allow the data to reveal its structure or model, which helps gain some new, often unsuspected, insight into the data. Let's learn more about this approach.
Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is defined as an approach to analyzing datasets to summarize their main characteristics, often with visual methods.
The purpose of EDA is to:
- Discover patterns within a dataset
- Spot anomalies
- Form hypotheses about the behavior of data
- Validate assumptions
Everything from basic summary statistics to complex visualizations help us gain an intuitive understanding of the data itself, which is highly important when it comes to forming new hypotheses about the data and uncovering what parameters affect the target variable. Often, discovering how the target variable varies across a single feature gives us an indication of how important a feature might be, and a variation across a combination of several features helps us come up with ideas for new informative features to engineer.
Most exploration and visualization is intended to understand the relationship between the features and the target variable. This is because we want to find out what relationships exist (or don't exist) between the data we have and the values we want to predict.
A very basic domain knowledge is usually necessary to be able to understand both the problem statement itself as well as what the data is telling us. In this chapter, we'll look at the ways we can get to know more about the data we have by analyzing the features we have.
EDA can tell us about:
- Features that are unclean, have missing values, or have outliers
- Features that are informative and are a good indicator of the target
- The kind of relationships features have with the target
- Further features that the data might need that we don't already have
- Edge cases you might need to account for separately
- Filters you might need to apply on the dataset
- The presence of incorrect or fake data points
Now that we've looked at why EDA is important and what it can tell us, let's talk about what exactly EDA involves. EDA can involve anything from looking at basic summary statistics to visualizing complex trends over multiple variables. However, even simple statistics and plots can be powerful tools, as they may reveal important facts about the data that could change our modeling perspective. When we see plots representing data, we are able to easily detect trends and patterns, compared to just raw data and numbers. These visualizations further allow us to ask questions such as "How?" and "Why?", and form hypotheses about the dataset that can be validated by further visualizations. This is a continuous process that leads to a deeper understanding of the data. This chapter will introduce you to some of the basic tools that can be used to explore any dataset while keeping in mind the ultimate problem statement.
We'll start by walking through some basic summary statistics and how to interpret them, followed by a section on finding, analyzing, and dealing with missing values. Then we'll look at univariate relationships, that is, distributions and the behavior of individual variables. This will be followed by the final section on exploring relationships between variables. In this chapter, you will be introduced to types of plots that can be used to gain a basic overview of the dataset and its features, as well as how to gain insights by creating visualizations that combine several features, and we'll then work through some examples on how they can be used.
The dataset that we will use for our exploratory analysis and visualizations has been taken from the Significant Earthquake Database from NOAA, available as a public dataset on Google BigQuery (table ID: 'bigquery-public-data.noaa_significant_earthquakes.earthquakes'). We will be using a subset of the columns available, the metadata for which is available at https://console.cloud.google.com/bigquery?project=packt-data&folder&organizationId&p=bigquery-public-data&d=noaa_significant_earthquakes&t=earthquakes&page=table, and loading it into a pandas DataFrame to perform the exploration. We'll primarily be using Matplotlib for most of our visualizations, along with Seaborn and Missingno for some. It is to be noted, however, that Seaborn merely provides a wrapper over Matplotlib's functionalities, so anything that is plotted using Seaborn can also be plotted using Matplotlib. We'll try to keep things interesting by mixing up visualizations from both libraries.
The exploration and analysis will be conducted keeping in mind a sample problem statement: Given the data we have, we want to predict whether an earthquake caused a tsunami or not. This will be a classification problem (more on this in Chapter 4, Classification) where the target variable is the flag_tsunami column.
Exercise 10: Importing Libraries for Data Exploration
Before we begin, let's first import the required libraries, which we will be using for most of our data manipulations and visualizations:
- In a Jupyter notebook, import the following libraries:
import json
import pandas as pd
import numpy as np
import missingno as msno
from sklearn.impute import SimpleImputer
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
The %matplotlib inline command allows Jupyter to display the plots inline within the notebook itself.
- We can also read in the metadata containing the data types for each column, which are stored in the form of a JSON file. Do this using the following command. This command opens the file in readable format and uses the json library to read the file into a dictionary:
with open('dtypes.json', 'r') as jsonfile:
dtyp = json.load(jsonfile)
Now, let's get started.
- Learning NServiceBus(Second Edition)
- JMeter 性能測試實戰(zhàn)(第2版)
- Django開發(fā)從入門到實踐
- Learning Network Forensics
- JavaScript:Moving to ES2015
- PhoneGap:Beginner's Guide(Third Edition)
- Visual FoxPro程序設計
- 執(zhí)劍而舞:用代碼創(chuàng)作藝術(shù)
- Android玩家必備
- ASP.NET程序開發(fā)范例寶典
- MySQL入門很輕松(微課超值版)
- Python期貨量化交易實戰(zhàn)
- Visual C#(學習筆記)
- 秒懂算法:用常識解讀數(shù)據(jù)結(jié)構(gòu)與算法
- JSP編程教程