- The Data Science Workshop
- Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
- 1874字
- 2021-06-11 18:27:19
Python for Data Science
Python offers an incredible number of packages for data science. A package is a collection of prebuilt functions and classes shared publicly by its author(s). These packages extend the core functionalities of Python. The Python Package Index (https://packt.live/37iTRXc) lists all the packages available in Python.
In this section, we will present to you two of the most popular ones: pandas and scikit-learn.
The pandas Package
The pandas package provides an incredible amount of APIs for manipulating data structures. The two main data structures defined in the pandas package are DataFrame and Series.
DataFrame and Series
A DataFrame is a tabular data structure that is represented as a two-dimensional table. It is composed of rows, columns, indexes, and cells. It is very similar to a sheet in Excel or a table in a database:

Figure 1.28: Components of a DataFrame
In Figure 1.28, there are three different columns: algorithm, learning, and type. Each of these columns (also called variables) contains a specific type of information. For instance, the algorithm variable lists the names of different machine learning algorithms.
A row stores the information related to a record (also called an observation). For instance, row number 2 (index number 2) refers to the RandomForest record and all its attributes are stored in the different columns.
Finally, a cell is the value of a given row and column. For example, Clustering is the value of the cell of the row index 2 and the type column. You can see it as the intersection of a specified row and column.
So, a DataFrame is a structured representation of some data organized by rows and columns. A row represents an observation and each column contains the value of its attributes. This is the most common data structure used in data science.
In pandas, a DataFrame is represented by the DataFrame class. A pandas DataFrame is composed of pandas Series, which are 1-dimensional arrays. A pandas Series is basically a single column in a DataFrame.
Data is usually classified into two groups: structured and unstructured. Think of structured data as database tables or Excel spreadsheets where each column and row has a predefined structure. For example, in a table or spreadsheet that lists all the employees of a company, every record will follow the same pattern, such as the first column containing the date of birth, the second and third ones being for first and last names, and so on.
On the other hand, unstructured data is not organized with predefined and static patterns. Text and images are good examples of unstructured data. If you read a book and look at each sentence, it will not be possible for you to say that the second word of a sentence is always a verb or a person's name; it can be anything depending on how the author wanted to convey the information they wanted to share. Each sentence has its own structure and will be different from the last. Similarly, for a group of images, you can't say that pixels 20 to 30 will always represent the eye of a person or the wheel of a car: it will be different for each image.
Data can come from different data sources: there could be flat files, data storage, or Application Programming Interface (API) feeds, for example. In this book, we will work with flat files such as CSVs, Excel spreadsheets, or JSON. All these types of files are storing information with their own format and structure.
We'll have a look at the CSV file first.
CSV Files
CSV files use the comma character—,—to separate columns and newlines for a new row. The previous example of a DataFrame would look like this in a CSV file:
algorithm,learning,type
Linear Regression,Supervised,Regression
Logistic Regression,Supervised,Classification
RandomForest,Supervised,Regression or Classification
k-means,Unsupervised,Clustering
In Python, you need to first import the packages you require before being able to use them. To do so, you will have to use the import command. You can create an alias of each imported package using the as keyword. It is quite common to import the pandas package with the alias pd:
import pandas as pd
pandas provides a .read_csv() method to easily load a CSV file directly into a DataFrame. You just need to provide the path or the URL to the CSV file, as shown below.
Note
Watch out for the slashes in the string below. Remember that the backslashes ( \ ) are used to split the code across multiple lines, while the forward slashes ( / ) are part of the URL.
pd.read_csv('https://raw.githubusercontent.com/PacktWorkshops'\
'/The-Data-Science-Workshop/master/Chapter01/'\
'Dataset/csv_example.csv')
You should get the following output:

Figure 1.29: DataFrame after loading a CSV file
Note
In this book, we will be loading datasets stored in the Packt GitHub repository: https://packt.live/2ucwsId.
GitHub wraps stored data into its own specific format. To load the original version of a dataset, you will need to load the raw version of it by clicking on the Raw button and copying the URL provided on your browser.
Have a look at Figure 1.30:

Figure 1.30: Getting the URL of a raw dataset on GitHub
Excel Spreadsheets
Excel is a Microsoft tool and is very popular in the industry. It has its own internal structure for recording additional information, such as the data type of each cell or even Excel formulas. There is a specific method in pandas to load Excel spreadsheets called .read_excel():
pd.read_excel('https://github.com/PacktWorkshops'\
'/The-Data-Science-Workshop/blob/master'\
'/Chapter01/Dataset/excel_example.xlsx?raw=true')
You should get the following output:

Figure 1.31: Dataframe after loading an Excel spreadsheet
JSON
JSON is a very popular file format, mainly used for transferring data from web APIs. Its structure is very similar to that of a Python dictionary with key-value pairs. The example DataFrame we used before would look like this in JSON format:
{
"algorithm":{
"0":"Linear Regression",
"1":"Logistic Regression",
"2":"RandomForest",
"3":"k-means"
},
"learning":{
"0":"Supervised",
"1":"Supervised",
"2":"Supervised",
"3":"Unsupervised"
},
"type":{
"0":"Regression",
"1":"Classification",
"2":"Regression or Classification",
"3":"Clustering"
}
}
As you may have guessed, there is a pandas method for reading JSON data as well, and it is called .read_json():
pd.read_json('https://raw.githubusercontent.com/PacktWorkshops'\
'/The-Data-Science-Workshop/master/Chapter01'\
'/Dataset/json_example.json')
You should get the following output:

Figure 1.32: Dataframe after loading JSON data
pandas provides more methods to load other types of files. The full list can be found in the following documentation: https://packt.live/2FiYB2O.
pandas is not limited to only loading data into DataFrames; it also provides a lot of other APIs for creating, analyzing, or transforming DataFrames. You will be introduced to some of its most useful methods in the following chapters.
Exercise 1.02: Loading Data of Different Formats into a pandas DataFrame
In this exercise, we will practice loading different data formats, such as CSV, TSV, and XLSX, into pandas DataFrames. The dataset we will use is the Top 10 Postcodes for the First Home Owner Grants dataset (this is a grant provided by the Australian government to help first-time real estate buyers). It lists the 10 postcodes (also known as zip codes) with the highest number of First Home Owner grants.
In this dataset, you will find the number of First Home Owner grant applications for each postcode and the corresponding suburb.
Note
This dataset can be found on our GitHub repository at https://packt.live/2FgAT7d.
Also, it is publicly available here: https://packt.live/2ZJBYhi.
The following steps will help you complete the exercise:
- Open a new Colab notebook.
- Import the pandas package, as shown in the following code snippet:
import pandas as pd
- Create a new variable called csv_url containing the URL to the raw CSV file:
csv_url = 'https://raw.githubusercontent.com/PacktWorkshops'\
'/The-Data-Science-Workshop/master/Chapter01'\
'/Dataset/overall_topten_2012-2013.csv'
- Load the CSV file into a DataFrame using the pandas .read_csv() method. The first row of this CSV file contains the name of the file, which you can see if you open the file directly. You will need to exclude this row by using the skiprows=1 parameter. Save the result in a variable called csv_df and print it:
csv_df = pd.read_csv(csv_url, skiprows=1)
csv_df
You should get the following output:
Figure 1.33: The DataFrame after loading the CSV file
- Create a new variable called tsv_url containing the URL to the raw TSV file:
tsv_url = 'https://raw.githubusercontent.com/PacktWorkshops'\
'/The-Data-Science-Workshop/master/Chapter01'\
'/Dataset/overall_topten_2012-2013.tsv'
Note
A TSV file is similar to a CSV file but instead of using the comma character (,) as a separator, it uses the tab character (\t).
- Load the TSV file into a DataFrame using the pandas .read_csv() method and specify the skiprows=1 and sep='\t' parameters. Save the result in a variable called tsv_df and print it:
tsv_df = pd.read_csv(tsv_url, skiprows=1, sep='\t')
tsv_df
You should get the following output:
Figure 1.34: The DataFrame after loading the TSV file
- Create a new variable called xlsx_url containing the URL to the raw Excel spreadsheet:
xlsx_url = 'https://github.com/PacktWorkshops'\
'/The-Data-Science-Workshop/blob/master/'\
'Chapter01/Dataset'\
'/overall_topten_2012-2013.xlsx?raw=true'
- Load the Excel spreadsheet into a DataFrame using the pandas .read_excel() method. Save the result in a variable called xlsx_df and print it:
xlsx_df = pd.read_excel(xlsx_url)
xlsx_df
You should get the following output:
Figure 1.35: Display of the DataFrame after loading the Excel spreadsheet
By default, .read_excel() loads the first sheet of an Excel spreadsheet. In this example, the data we're looking for is actually stored in the second sheet.
- Load the Excel spreadsheet into a Dataframe using the pandas .read_excel() method and specify the skiprows=1 and sheet_name=1 parameters. (Note that the sheet_name parameter is zero-indexed, so sheet_name=0 returns the first sheet, while sheet_name=1 returns the second sheet.) Save the result in a variable called xlsx_df1 and print it:
xlsx_df1 = pd.read_excel(xlsx_url, skiprows=1, sheet_name=1)
xlsx_df1
You should get the following output:
Figure 1.36: The DataFrame after loading the second sheet of the Excel spreadsheet
Note
To access the source code for this specific section, please refer to https://packt.live/2Yajzuq.
You can also run this example online at https://packt.live/2Q4dThe.
In this exercise, we learned how to load the Top 10 Postcodes for First Home Buyer Grants dataset from different file formats.
In the next section, we will be introduced to scikit-learn.
- Learning Scala Programming
- The Supervised Learning Workshop
- 大學計算機應用基礎實踐教程
- Apache Spark 2 for Beginners
- PHP網絡編程學習筆記
- Windows Presentation Foundation Development Cookbook
- Blender 3D Incredible Machines
- 你必須知道的204個Visual C++開發問題
- C語言程序設計
- Django實戰:Python Web典型模塊與項目開發
- Beginning C++ Game Programming
- C陷阱與缺陷
- C++從入門到精通(第6版)
- Java高并發編程詳解:深入理解并發核心庫
- 輕松學Scratch 3.0 少兒編程(全彩)