官术网_书友最值得收藏!

Titanic example revisited

In this section, we are going to go through the Titanic example again but from a different perspective while using the feature engineering tool. In case you skipped Chapter 2, Data Modeling in Action - The Titanic Example, the Titanic example is a Kaggle competition with the purpose of predicting weather a specific passenger survived or not.

During this revisit of the Titanic example, we are going to use the scikit-learn and pandas libraries. So first off, let's start by reading the train and test sets and get some statistics about the data:

# reading the train and test sets using pandas
train_data = pd.read_csv('data/train.csv', header=0)
test_data = pd.read_csv('data/test.csv', header=0)

# concatenate the train and test set together for doing the overall feature engineering stuff
df_titanic_data = pd.concat([train_data, test_data])

# removing duplicate indices due to coming the train and test set by re-indexing the data
df_titanic_data.reset_index(inplace=True)

# removing the index column the reset_index() function generates
df_titanic_data.drop('index', axis=1, inplace=True)

# index the columns to be 1-based index
df_titanic_data = df_titanic_data.reindex_axis(train_data.columns, axis=1)

We need to point out a few things about the preceding code snippet:

  • As shown, we have used the concat function of pandas to combine the data frames of the train and test sets. This is useful for the feature engineering task as we need a full view of the distribution of the input variables/features.
  • After combining both data frames, we need to do some modifications to the output data frame.
主站蜘蛛池模板: 九龙坡区| 汉源县| 濮阳市| 贡嘎县| 金阳县| 百色市| 永昌县| 稷山县| 定兴县| 左贡县| 黄陵县| 柳江县| 淮滨县| 盘锦市| 太原市| 华宁县| 祁连县| 武隆县| 安福县| 汉沽区| 德格县| 临桂县| 汾阳市| 社旗县| 宁强县| 彰化县| 岳西县| 金塔县| 武宁县| 开阳县| 武强县| 尚志市| 吴川市| 理塘县| 湘西| 奉化市| 北碚区| 芒康县| 易门县| 惠水县| 阳信县|