- Go Machine Learning Projects
- Xuanyi Chew
- 513字
- 2021-06-10 18:46:33
Exploratory data analysis
Exploratory data analysis is part and parcel of any model-building process. Understanding the algorithm at play, too, is important. Given that this chapter revolves around linear regression, it might be worth it to explore the data through the lens of understanding linear regression.
But first, let's look at the data. One of the first things I recommend any budding data scientist keen on machine learning to do is to explore the data, or a subset of it, to get a feel for it. I usually do it in a spreadsheet application such as Excel or Google Sheets. I then try to understand, in human ways, the meaning of the data.
This dataset comes with a description of fields, which I can't enumerate in full here. A snapshot, however, would be illuminating for the rest of the discussion in this chapter:
- SalePrice: The property's sale price in dollars. This is the dependent variable that we're trying to predict.
- MSSubClass: The building class.
- MSZoning: The general zoning classification.
- LotFrontage: The linear feet of the street connected to the property.
- LotArea: The lot size in square feet.
There can be multiple ways of understanding linear regression. However, one of my favorite ways of understanding linear regression directly ties into exploratory data analysis. Specifically, we're interested in looking at linear regression through the lens of the conditional expectation functions (CEFs) of the independent variable.
The conditional expectation function of a variable is simply the expected value of the variable, dependent upon the value of another variable. This seems like a rather dense subject to get through, so I shall offer three different views of the same topic in an attempt to clarify:
- Statistical point of view: The conditional expectation function of a dependent variable
given a vector of covariates
is simply the expected value of
(the average) when
is fixed to
.
- Programming point of view in pseudo-SQL: select avg(Y) from dataset where X = 'Xi'. When conditioning upon multiple conditions, it's simply this: select avg(Y) from dataset where X1 = 'Xik' and X2 = 'Xjl'.
- Concrete example: What are the expected house prices if one of the independent variables—say, MSZoning—is RL? The expected house price is the population average, which translates to: of all the houses in Boston, what is the average price of house sold whose zoning type is RL?
As it stands, this is a pretty bastardized version of what the CEF is—there are some subtleties involved in the definition of the CEF, but that is not within the scope of this book, so we shall leave that for later. For now, this rough understanding of CEF is enough to get us started with our exploratory data analysis.
The programming point of view in pseudo-SQL is useful because it informs us about what we would need so that we can quickly calculate the aggregate of data. We would need to create indices. Because our dataset is small, we can be relatively blasé about the data structures used to index the data.
- 火格局的時空變異及其在電網防火中的應用
- 計算機圖形學
- 控制與決策系統仿真
- 21天學通C++
- 樂高創意機器人教程(中級 下冊 10~16歲) (青少年iCAN+創新創意實踐指導叢書)
- STM32G4入門與電機控制實戰:基于X-CUBE-MCSDK的無刷直流電機與永磁同步電機控制實現
- 3D Printing for Architects with MakerBot
- 塊數據5.0:數據社會學的理論與方法
- Microsoft System Center Confi guration Manager
- 網絡服務搭建、配置與管理大全(Linux版)
- 電腦上網入門
- 筆記本電腦使用與維護
- ADuC系列ARM器件應用技術
- PowerPoint 2010幻燈片制作高手速成
- WPF專業編程指南