官术网_书友最值得收藏!

Understanding the Science Behind EDA

In layman's terms, we can define EDA as the science of understanding data. A more formal definition is the process of analyzing and exploring datasets to summarize its characteristics, properties, and latent relationships using statistical, visual, analytical, or a combination of techniques.

To cement our understanding, let's break down the definition further. The dataset is a combination of numeric and categorical features. To study the data, we might need to explore features individually, and to study relationships, we might need to explore features together. Depending on the number of features and the type of features, we may cross paths with different types of EDA.

To simplify, we can broadly classify the process of EDA as follows:

  • Univariate analysis: Studying a single feature
  • Bivariate analysis: Studying the relationship between two features
  • Multivariate analysis: Studying the relationship between more than two features

For now, we will restrict the scope of the chapter to univariate and bivariate analysis. A few forms of multivariate analysis, such as regression, will be covered in the upcoming chapters.

To accomplish each of the previously mentioned analyses, we can use visualization techniques such as boxplots, scatter plots, and bar charts; statistical techniques such as hypothesis testing; or simple analytical techniques such as averages, frequency counts, and so on.

Breaking this further down, we have another dimension to cater to, that is, the types of features—numeric or categorical. In each of the type of analysis mentioned—univariate and bivariate—based on the type of the feature, we might have a different visual technique to accomplish the study. So, for univariate analysis of a numeric variable, we could use a histogram or a boxplot, whereas we might use a frequency bar chart for a categorical variable. We will get into the details of the overall exercise of EDA using a lazy programming approach, that is, we will explore the context and details of the analysis as and when it occurs in the book.

With the basic background context set for the exercise, let's get ready for a specific EDA exercise.

主站蜘蛛池模板: 高唐县| 满城县| 安顺市| 图木舒克市| 汕头市| 瑞金市| 阿拉善左旗| 台南市| 德庆县| 辽阳县| 静宁县| 仪征市| 上林县| 廊坊市| 扎鲁特旗| 陆河县| 湟中县| 响水县| 日照市| 即墨市| 左贡县| 巫溪县| 于田县| 和田县| 灵台县| 武隆县| 格尔木市| 策勒县| 阿克陶县| 乌恰县| 平罗县| 柯坪县| 边坝县| 松潘县| 龙口市| 玉田县| 安国市| 措勤县| 文山县| 马关县| 新蔡县|