官术网_书友最值得收藏!

Titanic example revisited

In this section, we are going to go through the Titanic example again but from a different perspective while using the feature engineering tool. In case you skipped Chapter 2, Data Modeling in Action - The Titanic Example, the Titanic example is a Kaggle competition with the purpose of predicting weather a specific passenger survived or not.

During this revisit of the Titanic example, we are going to use the scikit-learn and pandas libraries. So first off, let's start by reading the train and test sets and get some statistics about the data:

# reading the train and test sets using pandas
train_data = pd.read_csv('data/train.csv', header=0)
test_data = pd.read_csv('data/test.csv', header=0)

# concatenate the train and test set together for doing the overall feature engineering stuff
df_titanic_data = pd.concat([train_data, test_data])

# removing duplicate indices due to coming the train and test set by re-indexing the data
df_titanic_data.reset_index(inplace=True)

# removing the index column the reset_index() function generates
df_titanic_data.drop('index', axis=1, inplace=True)

# index the columns to be 1-based index
df_titanic_data = df_titanic_data.reindex_axis(train_data.columns, axis=1)

We need to point out a few things about the preceding code snippet:

  • As shown, we have used the concat function of pandas to combine the data frames of the train and test sets. This is useful for the feature engineering task as we need a full view of the distribution of the input variables/features.
  • After combining both data frames, we need to do some modifications to the output data frame.
主站蜘蛛池模板: 财经| 沾益县| 兴海县| 乐清市| 胶南市| 西峡县| 稷山县| 聂荣县| 忻城县| 赤峰市| 类乌齐县| 宁晋县| 平利县| 常熟市| 潞西市| 洮南市| 南阳市| 成都市| 沙田区| 惠安县| 郁南县| 南开区| 汝南县| 津南区| 克东县| 鄢陵县| 定结县| 巴彦县| 伊宁县| 马鞍山市| 尼玛县| 简阳市| 仪陇县| 子洲县| 阿城市| 竹溪县| 老河口市| 达日县| 永平县| 鄂温| 三门县|