官术网_书友最值得收藏!

Using a regression or another simple model to predict the values of missing variables

This is the approach that we will use for the Age feature of the Titanic example. The Age feature is an important step towards predicting the survival of passengers, and applying the previous approach by taking the mean will make us lose some information.

In order to predict the missing values, you need to use a supervised learning algorithm that takes the available features as input and the available values of the feature that you want to predict for its missing value as output. In the following code snippet, we are using the random forest classifier to predict the missing values of the Age feature:

# Define a helper function that can use RandomForestClassifier for handling the missing values of the age variable
def set_missing_ages():
global df_titanic_data

age_data = df_titanic_data[
['Age', 'Embarked', 'Fare', 'Parch', 'SibSp', 'Title_id', 'Pclass', 'Names', 'CabinLetter']]
input_values_RF = age_data.loc[(df_titanic_data.Age.notnull())].values[:, 1::]
target_values_RF = age_data.loc[(df_titanic_data.Age.notnull())].values[:, 0]

# Creating an object from the random forest regression function of sklearn<use the documentation for more details>
regressor = RandomForestRegressor(n_estimators=2000, n_jobs=-1)

# building the model based on the input values and target values above
regressor.fit(input_values_RF, target_values_RF)

# using the trained model to predict the missing values
predicted_ages = regressor.predict(age_data.loc[(df_titanic_data.Age.isnull())].values[:, 1::])

    # Filling the predicted ages in the original titanic dataframe
age_data.loc[(age_data.Age.isnull()), 'Age'] = predicted_ages
主站蜘蛛池模板: 宝兴县| 灵宝市| 专栏| 漳浦县| 呼玛县| 施秉县| 凤冈县| 新竹县| 陆丰市| 麟游县| 拉孜县| 岐山县| 雅安市| 乐陵市| 屏边| 石嘴山市| 宜良县| 湛江市| 呼和浩特市| 全椒县| 彰化市| 霍山县| 江口县| 嘉义市| 明水县| 年辖:市辖区| 长治县| 桐柏县| 北海市| 双鸭山市| 花垣县| 西乌| 宜都市| 敦煌市| 灵丘县| 双鸭山市| 获嘉县| 揭东县| 贞丰县| 惠水县| 宝兴县|