官术网_书友最值得收藏!

Understanding linear regression

Regression algorithms are an important algorithm in a data scientist's toolkit as they can be used for various non-binary prediction tasks. The linear regression algorithm models the relationship between a dependent variable that we are trying to predict with a vector of independent variables. The vector of variables is also called the regressor in the context of regression algorithms. Linear regression assumes that there is a linear relationship between the vector of independent variables and the dependent variable that we are trying to predict. Hence, linear regression models learn the unknown variables and constants of a linear function using the training data so that the linear function best fits the training data. 

Linear regression can be applied in cases where the goal is to predict or forecast the dependent variable based on the regressor variables. We will use an example to explain how linear regression trains based on data.

The following table shows a sample dataset where the goal is to predict the price of a house based on three variables:

In this dataset, the Floor Size, Number of Bedrooms, and Number of Bathrooms variables are assumed as independent in linear regression. Our goal is to predict the House Price value based on the variables. 

Let's simplify this problem. Let's only consider the Floor Size variable to predict the house price. Creating linear regression from only one variable or regressor is referred to as a simple linear regression. If we create a scatterplot of these two values, we can see that there is a relationship between them:

Although there is not an exact linear relationship between the two variables, we can create an approximate line that represents the trend. The aim of the modeling algorithm is to minimize the error in creating this approximate line. 

As we know, a straight line can be represented by the following equation:

                              

Hence, the approximately linear relationship in the preceding diagram can also be represented using the same formula, and the task of the linear regression model is to learn the values of  and . Moreover, since we know that the relationship between the predicted variable and the regressors is not strictly linear, we can add a random error variable to the equation that models the noise in the dataset. The following formula represents how the simple linear regression model is represented:

                     

Now, let's consider a dataset with multiple regressors. Instead of just representing the linear relationship between two variables,  and , we will represent a set of regressors as . We will assume that a linear relationship between the dependent variable  and the regressors  is linear.  Thus, a linear regression model with multiple regressors is represented by the following formula:

       

Linear regression, as we've already discussed, assumes that there is a linear relationship between the regressors and the dependent variable that we are trying to predict. This is an important assumption that may not hold true in all datasets. Hence, for a data scientist, using linear regression may look attractive due to its fast training time. However, if the dataset variables do not have a linear relationship with the dependent variable, it may lead to significant errors. In such cases, data scientists may also try algorithms such as Bernoulli regression, Poisson regression, or multinomial regression to improve prediction precision. We will also discuss logistic regression later in this chapter, which is used when the dependent variable is binary. 

During the training phase, linear regression can use various techniques for parameter estimation to learn the values of , and . We will not go into the details of these techniques in this book. However, we recommend that you try using these parameter estimation techniques in the examples that follow and observe their effect on the training time of the algorithm and the accuracy of prediction. 

To fit a linear model to the data, we first need to be able to determine how well a linear model fits the data. There are various models being developed for parameter estimation in linear regression. Parameter estimation is the process of estimating the values of and . In the following sections, will briefly explain the following estimation techniques. 

主站蜘蛛池模板: 丰顺县| 磴口县| 拜城县| 平定县| 兴义市| 谷城县| 鹤山市| 舒兰市| 北碚区| 克什克腾旗| 敦化市| 翁牛特旗| 共和县| 通榆县| 资源县| 巨鹿县| 营口市| 汝阳县| 喜德县| 安图县| 宜城市| 永泰县| 老河口市| 台北市| 阿荣旗| 繁峙县| 彩票| 高清| 定安县| 石嘴山市| 宜兰县| 蓝田县| 扶风县| 田阳县| 隆安县| 蓬安县| 台前县| 普兰县| 湖南省| 乳山市| 崇明县|