- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 413字
- 2021-07-02 23:40:08
Cleaning up the dataset
After looking at the output, we can see a number of problems:
- The date is just a string and not a date object
- From visually inspecting the results, the headings aren't complete or correct
These issues come from the data and we could fix this by altering the data itself. However, in doing this, we could forget the steps we took or misapply them; that is, we can't replicate our results. As with the previous section where we used pipelines to track the transformations we made to a dataset, we will use pandas to apply transformations to the raw data itself.
The pandas.read_csv function has parameters to fix each of these issues, which we can specify when loading the file. We can also change the headings after loading the file, as shown in the following code:
dataset = pd.read_csv(data_filename, parse_dates=["Date"]) dataset.columns
= ["Date", "Start (ET)", "Visitor Team", "VisitorPts",
"Home Team", "HomePts", "OT?", "Score Type", "Notes"]
The results have significantly improved, as we can see if we print out the resulting data frame:
dataset.head()
The output is as follows:
Even in well-compiled data sources such as this one, you need to make some adjustments. Different systems have different nuances, resulting in data files that are not quite compatible with each other. When loading a dataset for the first time, always check the data loaded (even if it's a known format) and also check the data types of the data. In pandas, this can be done with the following code:
print(dataset.dtypes)
Now that we have our dataset in a consistent format, we can compute a baseline, which is an easy way to get a good accuracy on a given problem. Any decent data mining solution should beat this baseline figure.
For a product recommendation system, a good baseline is to simply recommend the most popular product.
For a classification task, it can be to always predict the most frequent task, or alternatively applying a very simple classification algorithm like OneR.
For our dataset, each match has two teams: a home team and a visitor team. An obvious baseline for this task is 50 percent, which is our expected accuracy if we simply guessed a winner at random. In other words, choosing the predicted winning team randomly will (over time) result in an accuracy of around 50 percent. With a little domain knowledge, however, we can use a better baseline for this task, which we will see in the next section.
- Web Application Development with R Using Shiny(Second Edition)
- jQuery從入門到精通 (軟件開發(fā)視頻大講堂)
- RabbitMQ Essentials
- INSTANT Sinatra Starter
- Julia for Data Science
- Python:Deeper Insights into Machine Learning
- 移動增值應(yīng)用開發(fā)技術(shù)導論
- 新印象:解構(gòu)UI界面設(shè)計
- Arduino可穿戴設(shè)備開發(fā)
- Mastering Docker
- Web程序設(shè)計:ASP.NET(第2版)
- Python無監(jiān)督學習
- VMware vSphere 5.5 Cookbook
- 計算機應(yīng)用基礎(chǔ)(Windows 7+Office 2010)
- Implementing Domain:Specific Languages with Xtext and Xtend