官术网_书友最值得收藏!

Extracting new features

We will now extract some features from this dataset by combining and comparing the existing data. First, we need to specify our class value, which will give our classification algorithm something to compare against to see if its prediction is correct or not. This could be encoded in a number of ways; however, for this application, we will specify our class as 1 if the home team wins and 0 if the visitor team wins. In basketball, the team with the most points wins. So, while the data set doesn't specify who wins directly, we can easily compute it.

We can specify the data set by the following:

dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]

We then copy those values into a NumPy array to use later for our scikit-learn classifiers. There is not currently a clean integration between pandas and scikit-learn, but they work nicely together through the use of NumPy arrays. While we will use pandas to extract features, we will need to extract the values to use them with scikit-learn:

y_true = dataset["HomeWin"].values

The preceding array now holds our class values in a format that scikit-learn can read.

By the way, the better baseline figure for sports prediction is to predict the home team in every game. Home teams are shown to have an advantage in nearly all sports across the world. How big is this advantage? Let's have a look:

dataset["HomeWin"].mean()

The resulting value, around 0.59, indicates that the home team wins 59 percent of games on average. This is higher than 50 percent from random chance and is a simple rule that applies to most sports.

We can also start creating some features to use in our data mining for the input values (the X array). While sometimes we can just throw the raw data into our classifier, we often need to derive continuous numerical or categorical features from our data.

For our current dataset, we can't really use the features already present (in their current form) to do a prediction. We wouldn't know the scores of a game before we would need to predict the outcome of the game, so we can not use them as features. While this might sound obvious, it can be easy to miss.

The first two features we want to create to help us predict which team will win are whether either of those two teams won their previous game. This would roughly approximate which team is currently playing well.

We will compute this feature by iterating through the rows in order and recording which team won. When we get to a new row, we look up whether the team won the last time we saw them.

We first create a (default) dictionary to store the team's last result:

from collections import defaultdict 
won_last = defaultdict(int)

We then create a new feature on our dataset to store the results of our new features:

dataset["HomeLastWin"] = 0
dataset["VisitorLastWin"] = 0

The key of this dictionary will be the team and the value will be whether they won their previous game. We can then iterate over all the rows and update the current row with the team's last result:

for index, row in dataset.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
dataset.set_value(index, "HomeLastWin", won_last[home_team])
dataset.set_value(index, "VisitorLastWin", won_last[visitor_team])
won_last[home_team] = int(row["HomeWin"])
won_last[visitor_team] = 1 - int(row["HomeWin"])

Note that the preceding code relies on our dataset being in chronological order. Our dataset is in order; however, if you are using a dataset that is not in order, you will need to replace dataset.iterrows() with dataset.sort("Date").iterrows().

Those last two lines in the loop update our dictionary with either a 1 or a 0, depending on which team won the current game. This information is used for the next game each team plays.

After the preceding code runs, we will have two new features: HomeLastWin and VisitorLastWin. Have a look at the dataset using dataset.head(6) to see an example of a home team and a visitor team that won their recent game. Have a look at other parts of the dataset using the panda's indexer:

dataset.ix[1000:1005]

Currently, this gives a false value to all teams (including the previous year's champion!) when they are first seen. We could improve this feature using the previous year's data, but we will not do that in this chapter.

主站蜘蛛池模板: 宾阳县| 吕梁市| 荣昌县| 隆回县| 吕梁市| 永兴县| 商洛市| 长春市| 阿瓦提县| 宝兴县| 葫芦岛市| 萨迦县| 新巴尔虎右旗| 镇康县| 布拖县| 阳曲县| 定远县| 扶风县| 本溪| 新沂市| 无棣县| 炎陵县| 米泉市| 沙坪坝区| 南城县| 兴化市| 缙云县| 双鸭山市| 嵊泗县| 古浪县| 夹江县| 夹江县| 高唐县| 阿荣旗| 巧家县| 铜鼓县| 科技| 合山市| 屏边| 始兴县| 晋州市|