官术网_书友最值得收藏!

Validating the dataset

Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.

In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):

  • Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
  • Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
  • No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
  • Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
  • Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.

These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.

主站蜘蛛池模板: 哈巴河县| 西贡区| 子长县| 丰县| 绥芬河市| 新乐市| 麟游县| 广昌县| 芷江| 内黄县| 肥乡县| 嘉鱼县| 基隆市| 夏津县| 陕西省| 阳东县| 大关县| 长兴县| 法库县| 京山县| 台东县| 郑州市| 涟源市| 博乐市| 鹤峰县| 台前县| 文昌市| 西乌珠穆沁旗| 新竹市| 嘉兴市| 神木县| 临邑县| 于田县| 延川县| 高淳县| 神农架林区| 高尔夫| 汉中市| 昭通市| 绥宁县| 上饶市|