- Effective Amazon Machine Learning
- Alexis Perrier
- 364字
- 2021-07-03 00:17:49
Validating the dataset
Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.
In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):
- Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
- Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
- No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
- Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
- Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.
These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.
推薦閱讀
- 文本數據挖掘:基于R語言
- Libgdx Cross/platform Game Development Cookbook
- 區塊鏈:看得見的信任
- INSTANT Cytoscape Complex Network Analysis How-to
- Scratch 3.0 藝術進階
- 一個64位操作系統的設計與實現
- 大數據技術入門
- LabVIEW 完全自學手冊
- 二進制分析實戰
- 數據挖掘競賽實戰:方法與案例
- Deep Learning with R for Beginners
- 云工作時代:科技進化必將帶來的新工作方式
- NoSQL數據庫原理(第2版·微課版)
- 代碼的未來
- Practical Convolutional Neural Networks