官术网_书友最值得收藏!

Understanding the Science Behind EDA

In layman's terms, we can define EDA as the science of understanding data. A more formal definition is the process of analyzing and exploring datasets to summarize its characteristics, properties, and latent relationships using statistical, visual, analytical, or a combination of techniques.

To cement our understanding, let's break down the definition further. The dataset is a combination of numeric and categorical features. To study the data, we might need to explore features individually, and to study relationships, we might need to explore features together. Depending on the number of features and the type of features, we may cross paths with different types of EDA.

To simplify, we can broadly classify the process of EDA as follows:

  • Univariate analysis: Studying a single feature
  • Bivariate analysis: Studying the relationship between two features
  • Multivariate analysis: Studying the relationship between more than two features

For now, we will restrict the scope of the chapter to univariate and bivariate analysis. A few forms of multivariate analysis, such as regression, will be covered in the upcoming chapters.

To accomplish each of the previously mentioned analyses, we can use visualization techniques such as boxplots, scatter plots, and bar charts; statistical techniques such as hypothesis testing; or simple analytical techniques such as averages, frequency counts, and so on.

Breaking this further down, we have another dimension to cater to, that is, the types of features—numeric or categorical. In each of the type of analysis mentioned—univariate and bivariate—based on the type of the feature, we might have a different visual technique to accomplish the study. So, for univariate analysis of a numeric variable, we could use a histogram or a boxplot, whereas we might use a frequency bar chart for a categorical variable. We will get into the details of the overall exercise of EDA using a lazy programming approach, that is, we will explore the context and details of the analysis as and when it occurs in the book.

With the basic background context set for the exercise, let's get ready for a specific EDA exercise.

主站蜘蛛池模板: 马鞍山市| 繁峙县| 山阳县| 壤塘县| 开封市| 南江县| 颍上县| 定日县| 霍林郭勒市| 本溪市| 二手房| 柳江县| 聂拉木县| 内丘县| 荃湾区| 五原县| 常熟市| 阿图什市| 安平县| 静安区| 环江| 万荣县| 西峡县| 马山县| 建瓯市| 深州市| 江安县| 渝北区| 南丰县| 祁连县| 雷波县| 封丘县| 金平| 安塞县| 平远县| 东海县| 阿坝| 广汉市| 龙江县| 华宁县| 晋江市|