官术网_书友最值得收藏!

Using pandas to load the dataset

The pandas library is a library for loading, managing, and manipulating data. It handles data structures behind-the-scenes and supports data analysis functions, such as computing the mean and grouping data by value.

When doing multiple data mining experiments, you will find that you write many of the same functions again and again, such as reading files and extracting features. Each time this reimplementation happens, you run the risk of introducing bugs. Using a high-quality library such as pandas significantly reduces the amount of work needed to do these functions, and also gives you more confidence in using well-tested code to underly your own programs.

Throughout this book, we will be using pandas a lot, introducing use cases as we go and new functions as needed.

We can load the dataset using the read_csv function:

import pandas as pd
data_filename = "basketball.csv"
dataset = pd.read_csv(data_filename)

The result of this is a pandas DataFrame, and it has some useful functions that we will use later on. Looking at the resulting dataset, we can see some issues. Type the following and run the code to see the first five rows of the dataset:

dataset.head(5)

Here's the output:

Just reading the data with no parameters resulted in quite a usable dataset, but it has some issues which we will address in the next section.

主站蜘蛛池模板: 大名县| 吴旗县| 明光市| 丹巴县| 华安县| 郁南县| 莲花县| 宜宾市| 通渭县| 凌海市| 霍山县| 奉新县| 南京市| 上高县| 湖北省| 临汾市| 贵德县| 济源市| 济宁市| 辉县市| 琼中| 建瓯市| 邯郸县| 建昌县| 泰安市| 齐齐哈尔市| 巴彦县| 拜泉县| 兴化市| 浦县| 霍林郭勒市| 苍溪县| 花莲市| 长汀县| 海林市| 文登市| 淮南市| 西乌珠穆沁旗| 甘谷县| 新源县| 株洲县|