官术网_书友最值得收藏!

Introduction

Say we have a problem statement that involves predicting whether a particular earthquake caused a tsunami or not. How do we decide what model to use? What do we know about the data we have? Nothing! But if we don't know and understand our data, chances are we'll end up building a model that's not very interpretable or reliable.

When it comes to data science, it's important to have a thorough understanding of the data we're dealing with, in order to generate features that are highly informative and, consequently, to build accurate and powerful models.

In order to gain this understanding, we perform an exploratory analysis on the data to see what the data can tell us about the relationships between the features and the target variable. Getting to know our data will even help us interpret the model we build and identify ways we can improve its accuracy.

The approach we take to achieve this is to allow the data to reveal its structure or model, which helps gain some new, often unsuspected, insight into the data. Let's learn more about this approach.

Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is defined as an approach to analyzing datasets to summarize their main characteristics, often with visual methods.

The purpose of EDA is to:

  • Discover patterns within a dataset
  • Spot anomalies
  • Form hypotheses about the behavior of data
  • Validate assumptions

Everything from basic summary statistics to complex visualizations help us gain an intuitive understanding of the data itself, which is highly important when it comes to forming new hypotheses about the data and uncovering what parameters affect the target variable. Often, discovering how the target variable varies across a single feature gives us an indication of how important a feature might be, and a variation across a combination of several features helps us come up with ideas for new informative features to engineer.

Most exploration and visualization is intended to understand the relationship between the features and the target variable. This is because we want to find out what relationships exist (or don't exist) between the data we have and the values we want to predict.

A very basic domain knowledge is usually necessary to be able to understand both the problem statement itself as well as what the data is telling us. In this chapter, we'll look at the ways we can get to know more about the data we have by analyzing the features we have.

EDA can tell us about:

  • Features that are unclean, have missing values, or have outliers
  • Features that are informative and are a good indicator of the target
  • The kind of relationships features have with the target
  • Further features that the data might need that we don't already have
  • Edge cases you might need to account for separately
  • Filters you might need to apply on the dataset
  • The presence of incorrect or fake data points

Now that we've looked at why EDA is important and what it can tell us, let's talk about what exactly EDA involves. EDA can involve anything from looking at basic summary statistics to visualizing complex trends over multiple variables. However, even simple statistics and plots can be powerful tools, as they may reveal important facts about the data that could change our modeling perspective. When we see plots representing data, we are able to easily detect trends and patterns, compared to just raw data and numbers. These visualizations further allow us to ask questions such as "How?" and "Why?", and form hypotheses about the dataset that can be validated by further visualizations. This is a continuous process that leads to a deeper understanding of the data. This chapter will introduce you to some of the basic tools that can be used to explore any dataset while keeping in mind the ultimate problem statement.

We'll start by walking through some basic summary statistics and how to interpret them, followed by a section on finding, analyzing, and dealing with missing values. Then we'll look at univariate relationships, that is, distributions and the behavior of individual variables. This will be followed by the final section on exploring relationships between variables. In this chapter, you will be introduced to types of plots that can be used to gain a basic overview of the dataset and its features, as well as how to gain insights by creating visualizations that combine several features, and we'll then work through some examples on how they can be used.

The dataset that we will use for our exploratory analysis and visualizations has been taken from the Significant Earthquake Database from NOAA, available as a public dataset on Google BigQuery (table ID: 'bigquery-public-data.noaa_significant_earthquakes.earthquakes'). We will be using a subset of the columns available, the metadata for which is available at https://console.cloud.google.com/bigquery?project=packt-data&folder&organizationId&p=bigquery-public-data&d=noaa_significant_earthquakes&t=earthquakes&page=table, and loading it into a pandas DataFrame to perform the exploration. We'll primarily be using Matplotlib for most of our visualizations, along with Seaborn and Missingno for some. It is to be noted, however, that Seaborn merely provides a wrapper over Matplotlib's functionalities, so anything that is plotted using Seaborn can also be plotted using Matplotlib. We'll try to keep things interesting by mixing up visualizations from both libraries.

The exploration and analysis will be conducted keeping in mind a sample problem statement: Given the data we have, we want to predict whether an earthquake caused a tsunami or not. This will be a classification problem (more on this in Chapter 4, Classification) where the target variable is the flag_tsunami column.

Exercise 10: Importing Libraries for Data Exploration

Before we begin, let's first import the required libraries, which we will be using for most of our data manipulations and visualizations:

  1. In a Jupyter notebook, import the following libraries:

    import json

    import pandas as pd

    import numpy as np

    import missingno as msno

    from sklearn.impute import SimpleImputer

    %matplotlib inline

    import matplotlib.pyplot as plt

    import seaborn as sns

    The %matplotlib inline command allows Jupyter to display the plots inline within the notebook itself.

  2. We can also read in the metadata containing the data types for each column, which are stored in the form of a JSON file. Do this using the following command. This command opens the file in readable format and uses the json library to read the file into a dictionary:

    with open('dtypes.json', 'r') as jsonfile:

    dtyp = json.load(jsonfile)

Now, let's get started.

主站蜘蛛池模板: 宁海县| 巴彦县| 新竹县| 云阳县| 景谷| 四子王旗| 贞丰县| 岫岩| 清徐县| 女性| 广河县| 湄潭县| 桃源县| 宜阳县| 青浦区| 磐安县| 报价| 贵南县| 耒阳市| 庆安县| 稻城县| 宁德市| 城步| 绍兴县| 依安县| 湘潭县| 左贡县| 石城县| 北辰区| 淅川县| 曲松县| 盘锦市| 本溪| 平山县| 桦南县| 鹤山市| 新晃| 通榆县| 灯塔市| 齐齐哈尔市| 思茅市|