官术网_书友最值得收藏!

Engineering new features

In the previous few examples, we saw that changing the features can have quite a large impact on the performance of the algorithm. Through our small amount of testing, we had more than 10 percent variance just from the features.

You can create features that come from a simple function in pandas by doing something like this:

dataset["New Feature"] = feature_creator()

The feature_creator function must return a list of the feature's value for each sample in the dataset. A common pattern is to use the dataset as a parameter:

dataset["New Feature"] = feature_creator(dataset)

You can create those features more directly by setting all the values to a single default value, like 0 in the next line:

dataset["My New Feature"] = 0

You can then iterate over the dataset, computing the features as you go. We used
this format in this chapter to create many of our features:

for index, row in dataset.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
# Some calculation here to alter row
dataset.set_value(index, "FeatureName", feature_value)

Keep in mind that this pattern isn't very efficient. If you are going to do this, try all of your features at once.

A common best practice is to touch every sample as little as possible, preferably only once.

Some example features that you could try and implement are as follows:

  • How many days has it been since each team's previous match? Teams may be tired if they play too many games in a short time frame.
  • How many games of the last five did each team win? This will give a more stable form of the HomeLastWin and VisitorLastWin features we extracted earlier (and can be extracted in a very similar way).
  • Do teams have a good record when visiting certain other teams? For instance, one team may play well in a particular stadium, even if they are the visitors.

If you are facing trouble extracting features of these types, check the pandasdocumentation at http://pandas.pydata.org/pandas-docs/stable/ for help. Alternatively, you can try an online forum such as Stack Overflow for assistance.

More extreme examples could use player data to estimate the strength of each team's sides to predict who won. These types of complex features are used every day by gamblers and sports betting agencies to try to turn a profit by predicting the outcome of sports matches.

主站蜘蛛池模板: 上思县| 宜兰市| 盈江县| 临朐县| 襄汾县| 顺平县| 白朗县| 泗阳县| 秦安县| 咸宁市| 繁峙县| 大新县| 富阳市| 荔浦县| 嫩江县| 潼关县| 饶平县| 安宁市| 万全县| 邵武市| 雅江县| 高碑店市| 长沙县| 皮山县| 德令哈市| 平南县| 中江县| 铜川市| 高平市| 迁安市| 邮箱| 建湖县| 沛县| 徐州市| 德清县| 武宁县| 随州市| 天长市| 芮城县| 沁水县| 凤山市|