官术网_书友最值得收藏!

How it works...

In Step 1 and Step 2, we looked at the variables with missing values in absolute and percentage terms. We noticed that the Alley variable had more than 93% of its values missing. However, from the data description, we figured out that the Alley variable had a No Access to Alley value, which is codified as NA in the dataset. When this value was read in Python, all instances of NA were treated as missing values. In Step 3, we replaced the NA in Alley with No Access.

Note that we used %matplotlib inline in Step 2. This is a magic function that renders the plot in the notebook itself. 

In Step 4, we used the seaborn library to plot the missing value chart. In this chart, we identified the variables that had missing values. The missing values were denoted in white, while the presence of data was denoted in color. We noticed from the chart that Alley had no more missing values.

In Step 4, we used cubehelix_palette() from the seaborn library,  which  produces a color map with linearly decreasing (or increasing) brightness. The seaborn library also provides us with options including light_palette() and dark_palette(). light_palette() gives a sequential palette that blends from light to color, while dark_palette() produces a sequential palette that blends from dark to color.

In Step 5, we noticed that one of the numerical variables, LotFrontage, had more than 17% of its values missing. We decided to impute the missing values with the median of this variable. We revisited the missing value chart in Step 6 to see whether the variables were left with any missing values. We noticed that Alley and LotFrontage showed no white marks, indicating that neither of the two variables had any further missing values.

In Step 7, we identified a handful of variables that had data codified with NA. This caused the same problem we encountered previously, as Python treated them as missing values. We replaced all such codified values with actual information.

We then revisited the missing value chart in Step 8. We saw that almost all the variables then had no missing values, except for MasVnrType, MasVnrArea, and Electrical.

In Step 9 and 10, we filled in the missing values for the MasVnrType and MasVnrArea variables. We noticed that MasVnrType is None whenever MasVnrArea is 0.0, except for some rare occasions. So, we imputed the MasVnrType variable with None, and MasVnrArea with 0.0 wherever those two variables had missing values. We were then only left with one variable with missing values, Electrical.

In Step 11, we looked at what type of house was missing the Electrical value. We noticed that MSSubClass denoted the dwelling type and, for the missing Electrical value, the MSSubClass was 80, which meant it was split or multi-level. In Step 12, we checked the distribution of Electrical by the dwelling type, which was MSSubClass. We noticed that when MSSubClass equals 80, the majority of the values of Electrical are SBrkr, which stands for standard circuit breakers and Romex. For this reason, we decided to impute the missing value in Electrical with SBrkr.

Finally, in Step 14, we again revisited the missing value chart and saw that there were no more missing values in the dataset.

主站蜘蛛池模板: 平度市| 东兴市| 长岛县| 丰宁| 克拉玛依市| 富源县| 信阳市| 东台市| 新乐市| 思茅市| 弥勒县| 长汀县| 屏东市| 始兴县| 崇文区| 蓬莱市| 喀喇| 永泰县| 密云县| 旺苍县| 辽宁省| 清新县| 壶关县| 岳阳市| 高阳县| 永顺县| 榆中县| 北安市| 灌阳县| 黎平县| 西华县| 沅江市| 加查县| 河北省| 和龙市| 昌邑市| 余江县| 潼南县| 三江| 大荔县| 洛南县|