官术网_书友最值得收藏!

Exploratory data analysis

Exploratory data analysis is part and parcel of any model-building process. Understanding the algorithm at play, too, is important. Given that this chapter revolves around linear regression, it might be worth it to explore the data through the lens of understanding linear regression.

But first, let's look at the data. One of the first things I recommend any budding data scientist keen on machine learning to do is to explore the data, or a subset of it, to get a feel for it. I usually do it in a spreadsheet application such as Excel or Google Sheets. I then try to understand, in human ways, the meaning of the data.

This dataset comes with a description of fields, which I can't enumerate in full here. A snapshot, however, would be illuminating for the rest of the discussion in this chapter:

  • SalePrice: The property's sale price in dollars. This is the dependent variable that we're trying to predict.
  • MSSubClass: The building class.
  • MSZoning: The general zoning classification.
  • LotFrontage: The linear feet of the street connected to the property.
  • LotArea: The lot size in square feet.

There can be multiple ways of understanding linear regression. However, one of my favorite ways of understanding linear regression directly ties into exploratory data analysis. Specifically, we're interested in looking at linear regression through the lens of the conditional expectation functions (CEFs) of the independent variable.

The conditional expectation function of a variable is simply the expected value of the variable, dependent upon the value of another variable. This seems like a rather dense subject to get through, so I shall offer three different views of the same topic in an attempt to clarify:

  • Statistical point of view: The conditional expectation function of a dependent variable given a vector of covariates is simply the expected value of (the average) when is fixed to .
  • Programming point of view in pseudo-SQL: select avg(Y) from dataset where X = 'Xi'. When conditioning upon multiple conditions, it's simply this: select avg(Y) from dataset where X1 = 'Xik' and X2 = 'Xjl'.
  • Concrete example: What are the expected house prices if one of the independent variables—say, MSZoning—is RL? The expected house price is the population average, which translates to: of all the houses in Boston, what is the average price of house sold whose zoning type is RL?

As it stands, this is a pretty bastardized version of what the CEF is—there are some subtleties involved in the definition of the CEF, but that is not within the scope of this book, so we shall leave that for later. For now, this rough understanding of CEF is enough to get us started with our exploratory data analysis.

The programming point of view in pseudo-SQL is useful because it informs us about what we would need so that we can quickly calculate the aggregate of data. We would need to create indices. Because our dataset is small, we can be relatively blasé about the data structures used to index the data.

主站蜘蛛池模板: 偃师市| 沧州市| 江油市| 大荔县| 图木舒克市| 怀安县| 宁明县| 通化市| 永平县| 荥阳市| 河西区| 永城市| 阜康市| 平原县| 杭锦旗| 衢州市| 堆龙德庆县| 永平县| 屯门区| 长岭县| 通城县| 广安市| 鄂托克前旗| 日照市| 林州市| 永州市| 资兴市| 郧西县| 宜昌市| 禹州市| 都安| 都安| 合山市| 揭阳市| 武穴市| 新巴尔虎左旗| 崇义县| 宁安市| 融水| 习水县| 洛宁县|