官术网_书友最值得收藏!

How it works...

In Step 1 and Step 2, we looked at the variables with missing values in absolute and percentage terms. We noticed that the Alley variable had more than 93% of its values missing. However, from the data description, we figured out that the Alley variable had a No Access to Alley value, which is codified as NA in the dataset. When this value was read in Python, all instances of NA were treated as missing values. In Step 3, we replaced the NA in Alley with No Access.

Note that we used %matplotlib inline in Step 2. This is a magic function that renders the plot in the notebook itself. 

In Step 4, we used the seaborn library to plot the missing value chart. In this chart, we identified the variables that had missing values. The missing values were denoted in white, while the presence of data was denoted in color. We noticed from the chart that Alley had no more missing values.

In Step 4, we used cubehelix_palette() from the seaborn library,  which  produces a color map with linearly decreasing (or increasing) brightness. The seaborn library also provides us with options including light_palette() and dark_palette(). light_palette() gives a sequential palette that blends from light to color, while dark_palette() produces a sequential palette that blends from dark to color.

In Step 5, we noticed that one of the numerical variables, LotFrontage, had more than 17% of its values missing. We decided to impute the missing values with the median of this variable. We revisited the missing value chart in Step 6 to see whether the variables were left with any missing values. We noticed that Alley and LotFrontage showed no white marks, indicating that neither of the two variables had any further missing values.

In Step 7, we identified a handful of variables that had data codified with NA. This caused the same problem we encountered previously, as Python treated them as missing values. We replaced all such codified values with actual information.

We then revisited the missing value chart in Step 8. We saw that almost all the variables then had no missing values, except for MasVnrType, MasVnrArea, and Electrical.

In Step 9 and 10, we filled in the missing values for the MasVnrType and MasVnrArea variables. We noticed that MasVnrType is None whenever MasVnrArea is 0.0, except for some rare occasions. So, we imputed the MasVnrType variable with None, and MasVnrArea with 0.0 wherever those two variables had missing values. We were then only left with one variable with missing values, Electrical.

In Step 11, we looked at what type of house was missing the Electrical value. We noticed that MSSubClass denoted the dwelling type and, for the missing Electrical value, the MSSubClass was 80, which meant it was split or multi-level. In Step 12, we checked the distribution of Electrical by the dwelling type, which was MSSubClass. We noticed that when MSSubClass equals 80, the majority of the values of Electrical are SBrkr, which stands for standard circuit breakers and Romex. For this reason, we decided to impute the missing value in Electrical with SBrkr.

Finally, in Step 14, we again revisited the missing value chart and saw that there were no more missing values in the dataset.

主站蜘蛛池模板: 高陵县| 陵水| 安化县| 邮箱| 乌鲁木齐市| 库尔勒市| 五常市| 灌南县| 佛冈县| 南雄市| 唐河县| 封开县| 陆河县| 原平市| 牟定县| 嵊泗县| 浠水县| 普兰店市| 乃东县| 搜索| 银川市| 柘荣县| 焦作市| 勃利县| 博客| 柳州市| 张家港市| 长春市| 托克逊县| 泽库县| 溆浦县| 同德县| 安泽县| 南宫市| 城固县| 昌图县| 铁力市| 宝清县| 定安县| 蓝山县| 德钦县|