官术网_书友最值得收藏!

Validating the dataset

Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.

In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):

  • Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
  • Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
  • No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
  • Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
  • Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.

These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.

主站蜘蛛池模板: 夏邑县| 南通市| 汽车| 肥城市| 日土县| 方城县| 咸阳市| 灵武市| 枣庄市| 中阳县| 武夷山市| 石台县| 比如县| 府谷县| 灵丘县| 邵阳县| 车致| 运城市| 梁平县| 舟曲县| 南和县| 新巴尔虎左旗| 彭阳县| 宕昌县| 双鸭山市| 响水县| 双城市| 新民市| 西贡区| 灌阳县| 南溪县| 光泽县| 汾阳市| 常宁市| 富锦市| 申扎县| 岑巩县| 沅江市| 龙泉市| 轮台县| 桦甸市|