- Effective Amazon Machine Learning
- Alexis Perrier
- 364字
- 2021-07-03 00:17:49
Validating the dataset
Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.
In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):
- Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
- Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
- No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
- Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
- Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.
These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.
推薦閱讀
- 數(shù)據(jù)分析實(shí)戰(zhàn):基于EXCEL和SPSS系列工具的實(shí)踐
- 工業(yè)大數(shù)據(jù)分析算法實(shí)戰(zhàn)
- 云計(jì)算與大數(shù)據(jù)應(yīng)用
- Hadoop與大數(shù)據(jù)挖掘(第2版)
- MySQL從入門到精通(第3版)
- 數(shù)據(jù)結(jié)構(gòu)與算法(C語言版)
- Hadoop大數(shù)據(jù)實(shí)戰(zhàn)權(quán)威指南(第2版)
- INSTANT Cytoscape Complex Network Analysis How-to
- Spark大數(shù)據(jù)分析實(shí)戰(zhàn)
- 云原生數(shù)據(jù)中臺(tái):架構(gòu)、方法論與實(shí)踐
- 圖數(shù)據(jù)實(shí)戰(zhàn):用圖思維和圖技術(shù)解決復(fù)雜問題
- 企業(yè)級(jí)容器云架構(gòu)開發(fā)指南
- 數(shù)據(jù)庫應(yīng)用系統(tǒng)開發(fā)實(shí)例
- 一本書講透Elasticsearch:原理、進(jìn)階與工程實(shí)踐
- MySQL數(shù)據(jù)庫技術(shù)與應(yīng)用