- Go Machine Learning Projects
- Xuanyi Chew
- 447字
- 2021-06-10 18:46:34
Handling bad numbers
Another part of the janitorial work is handling bad numbers. A good example is in the LotFrontage variable. From the data description, we know that this is supposed to be a continuous variable. Therefore, all the numbers should be directly convertible to float64. Looking at the data, however, we see that it's not true—there is data that is NA.
LotFrontage, according to the description, is the linear feet of the street connected to property. NA could mean one of two things:
- We have no information on whether there is a street connected to the property
- There is no street connected to the property
In either case, it would be reasonable to replace NA with 0. This is reasonable, because the second lowest value in LotFrontage is 21. There are other ways of imputing the data, of course, and often the imputations will lead to better models. But for now, we'll impute it with 0.
We can also do the same with any other continuous variables in this dataset simply because they make sense when you replace the NA with 0. One tip is to use it in a sentence: this house has an Unknown GarageArea. If that is the case, then what should be the best guess? Well, it'd be helpful to assume that the house has no garage, so it's OK to replace NA with 0.
Note that this may not be the case in other machine learning projects. Remember—human insight may be fallible, but its often the best solution for a lot of irregularities in the data. If you happen to be a realtor, and you have a lot more domain knowledge, you can infuse said domain knowledge into the imputation phase—you can use variables to calculate and estimate other variables for example.
As for the categorical variables, we can for the most part treat NA as the zero value of the variable, so no change there if there is an NA. There is some categorical data for which NA or None wouldn't make sense. This is where the aforementioned clever encoding of category could come in handy. In the cases of these variables, we'll use the most commonly found value as the zero value:
- MSZoning
- BsmtFullBath
- BsmtHalfBath
- Utilities
- Functional
- Electrical
- KitchenQual
- SaleType
- Exterior1st
- Exterior2nd
Furthermore, there are some variables that are categorical, but the data is numerical. An example found in the dataset is the MSSubclass variable. It's essentially a categorical variable, but its data is numerical. When encoding these kinds of categorical data, it makes sense to have them sorted numerically, such that the 0 value is indeed the lowest value.
- Mastering Spark for Data Science
- 人工免疫算法改進及其應用
- 計算機網絡技術實訓
- Mastering Machine Learning Algorithms
- 大學計算機應用基礎
- 大數據時代
- 深度學習與目標檢測
- AVR單片機工程師是怎樣煉成的
- Hands-On Geospatial Analysis with R and QGIS
- SolarWinds Server & Application Monitor:Deployment and Administration
- SQL語言與數據庫操作技術大全
- R:Predictive Analysis
- 博弈論與無線傳感器網絡安全
- Azure Serverless Computing Cookbook
- Ripple Quick Start Guide