官术网_书友最值得收藏!

Using a regression or another simple model to predict the values of missing variables

This is the approach that we will use for the Age feature of the Titanic example. The Age feature is an important step towards predicting the survival of passengers, and applying the previous approach by taking the mean will make us lose some information.

In order to predict the missing values, you need to use a supervised learning algorithm that takes the available features as input and the available values of the feature that you want to predict for its missing value as output. In the following code snippet, we are using the random forest classifier to predict the missing values of the Age feature:

# Define a helper function that can use RandomForestClassifier for handling the missing values of the age variable
def set_missing_ages():
global df_titanic_data

age_data = df_titanic_data[
['Age', 'Embarked', 'Fare', 'Parch', 'SibSp', 'Title_id', 'Pclass', 'Names', 'CabinLetter']]
input_values_RF = age_data.loc[(df_titanic_data.Age.notnull())].values[:, 1::]
target_values_RF = age_data.loc[(df_titanic_data.Age.notnull())].values[:, 0]

# Creating an object from the random forest regression function of sklearn<use the documentation for more details>
regressor = RandomForestRegressor(n_estimators=2000, n_jobs=-1)

# building the model based on the input values and target values above
regressor.fit(input_values_RF, target_values_RF)

# using the trained model to predict the missing values
predicted_ages = regressor.predict(age_data.loc[(df_titanic_data.Age.isnull())].values[:, 1::])

    # Filling the predicted ages in the original titanic dataframe
age_data.loc[(age_data.Age.isnull()), 'Age'] = predicted_ages
主站蜘蛛池模板: 临夏市| 田林县| 通州市| 全椒县| 镇安县| 汶川县| 邹平县| 临漳县| 白玉县| 松阳县| 固原市| 金乡县| 津南区| 衡南县| 凤庆县| 朝阳市| 贵港市| 靖边县| 永城市| 靖江市| 平武县| 鄂尔多斯市| 珠海市| 正宁县| 兰州市| 冷水江市| 留坝县| 南投县| 手机| 新晃| 贺兰县| 博客| 科尔| 奇台县| 齐河县| 荔浦县| 茂名市| 郧西县| 青浦区| 富宁县| 万年县|