官术网_书友最值得收藏!

  • The Data Science Workshop
  • Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
  • 849字
  • 2021-06-11 18:27:21

Multiple Regression Analysis

In the exercises and activity so far, we have used only one independent variable in our regression analysis. In practice, as we have seen with the Boston Housing dataset, processes and phenomena of analytic interest are rarely influenced by only one feature. To be able to model the variability to a higher level of accuracy, therefore, it is necessary to investigate all the independent variables that may contribute significantly toward explaining the variability in the dependent variable. Multiple regression analysis is the method that is used to achieve this.

Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels Formula API

In this exercise, we will be using the plus operator (+) in the patsy formula string to define a linear regression model that includes more than one independent variable.

To complete this activity, run the code in the following steps in your Colab notebook:

  1. Open a new Colab notebook file and import the required packages.

    import statsmodels.formula.api as smf

    import pandas as pd

    from sklearn.model_selection import train_test_split

  2. Execute Step 2 to 11 from Exercise 2.01, Loading and Preparing the Data for Analysis.
  3. Use the plus operator (+) of the Patsy formula language to define a linear model that regresses crimeRatePerCapita on pctLowerStatus, radialHighwaysAccess, medianValue_Ks, and nitrixOxide_pp10m and assign it to a variable named multiLinearModel. Use the Python line continuation symbol (\) to continue your code on a new line should you run out of space:

    multiLinearModel = smf.ols\

                       (formula = 'crimeRatePerCapita \

                                   ~ pctLowerStatus \

                                   + radialHighwaysAccess \

                                   + medianValue_Ks \

                                   + nitrixOxide_pp10m', \

                                   data=train_data)

  4. Call the fit method of the model instance and assign the results of the method to a variable:

    multiLinearModResult = multiLinearModel.fit()

  5. Print a summary of the results stored the variable created in Step 3:

print(multiLinearModResult.summary())

The output is as follows:

Figure 2.18: A summary of multiple linear regression results

Note

To access the source code for this specific section, please refer to https://packt.live/34cJgOK.

You can also run this example online at https://packt.live/3h1CKOt.

If the exercise was correctly followed, Figure 2.18 will be the result of the analysis. In Activity 2.01, the R-squared statistic was used to assess the model for goodness of fit. When multiple independent variables are involved, the goodness of fit of the model created is assessed using the adjusted R-squared statistic.

The adjusted R-squared statistic considers the presence of the extra independent variables in the model and corrects for inflation of the goodness of fit measure of the model, which is just caused by the fact that more independent variables are being used to create the model.

The lesson we learn from this exercise is the improvement in the adjusted R-squared value in Section 1 of Figure 2.18. When only one independent variable was used to create a model that seeks to explain the variability in crimeRatePerCapita in Exercise 2.04, Fitting a Simple Linear Regression Model Using the Statsmodels formula API, the R-squared value calculated was only 14.4 percent. In this exercise, we used four independent variables. The model that was created improved the adjusted R-squared statistic to 39.1 percent, an increase of 24.7 percent.

We learn that the presence of independent variables that are correlated to a dependent variable can help explain the variability in the independent variable in a model. But it is clear that a considerable amount of variability, about 60.9 percent, in the dependent variable is still not explained by our model.

There is still room for improvement if we want a model that does a good job of explaining the variability we see in crimeRatePerCapita. In Section 2 of Figure 2.18, the intercept and all the independent variables in our model are listed together with their coefficients. If we denote pctLowerStatus by x1, radialHighwaysAccess by x2, medianValue_Ks by x3 , and nitrixOxide_pp10m by x4, a mathematical expression for the model created can be written as y ≈ 0.8912+0.1028x1+0.4948x2-0.1103x3-2.1039x4.

The expression just stated defines the model created in this exercise, and it is comparable to the expression for multiple linear regression provided in Figure 2.5 earlier.

主站蜘蛛池模板: 克山县| 游戏| 通州市| 武平县| 滕州市| 林口县| 巴马| 独山县| 象州县| 昆明市| 鹤庆县| 西乡县| 华宁县| 肃北| 金川县| 吴忠市| 玉龙| 吴忠市| 临泽县| 东乡族自治县| 海口市| 基隆市| 顺平县| 温州市| 桦川县| 双城市| 革吉县| 喜德县| 招远市| 太湖县| 安阳县| 呼伦贝尔市| 天镇县| 天镇县| 琼海市| 邳州市| 时尚| 确山县| 手机| 普宁市| 安平县|