- Mastering Python for Data Science
- Samir Madhavan
- 973字
- 2021-07-16 20:14:17
Empowering data analysis with pandas
The pandas library was developed by Wes McKinny when he was working at AQR Capital Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial data. Later, Chang She joined him and helped develop the package further.
The pandas library is an open source Python library, specially designed for data analysis. It has been built on NumPy and makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well.
The pandas library brings the richness of R in the world of Python to handle data. It's has efficient data structures to process data, perform fast joins, and read data from various sources, to name a few.
The data structure of pandas
The pandas library essentially has three data structures:
- Series
- DataFrame
- Panel
Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings, and Python objects too. A series can be created by calling the following:
>>> import pandas as pd >>> pd.Series(np.random.randn(5)) 0 0.733810 1 -1.274658 2 -1.602298 3 0.460944 4 -0.632756 dtype: float64
The random.randn
parameter is part of the NumPy package and it generates random numbers. The series function creates a pandas series that consists of an index, which is the first column, and the second column consists of random values. At the bottom of the output is the datatype of the series.
The index of the series can be customized by calling the following:
>>> pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) a -0.929494 b -0.571423 c -1.197866 d 0.081107 e -0.035091 dtype: float64
A series can be derived from a Python dict too:
>>> d = {'A': 10, 'B': 20, 'C': 30} >>> pd.Series(d) A 10 B 20 C 30 dtype: int64
DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a table. A DataFrame can be formed from the following data structures:
- A NumPy array
- Lists
- Dicts
- Series
- A 2D NumPy array
A DataFrame
can be created from a dict of series by calling the following commands:
>>> d = {'c1': pd.Series(['A', 'B', 'C']), 'c2': pd.Series([1, 2., 3., 4.])} >>> df = pd.DataFrame(d) >>> df c1 c2 0 A 1 1 B 2 2 C 3 3 NaN 4
The DataFrame can be created using a dict of lists too:
>>> d = {'c1': ['A', 'B', 'C', 'D'], 'c2': [1, 2.0, 3.0, 4.0]} >>> df = pd.DataFrame(d) >>> print df c1 c2 0 A 1 1 B 2 2 C 3 3 D 4
A Panel
is a data structure that handles 3D data. The following command is an example of panel data:
>>> d = {'Item1': pd.DataFrame(np.random.randn(4, 3)), 'Item2': pd.DataFrame(np.random.randn(4, 2))} >>> pd.Panel(d) <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 2
The preceding command shows that there are 2 DataFrames represented by two items. There are four rows represented by four major axes and three columns represented by three minor axes.
Inserting and exporting data
The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library makes it convenient to read data from these formats or to export to these formats. We'll use a dataset that contains the weight statistics of the school students from the U.S..
We'll be using a file with the following structure:

To read data from a .csv
file, the following read_csv
function can be used:
>>> d = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv') >>> d[0:5]['AREA NAME'] 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 3 COHOES CITY SCHOOL DISTRICT 4 COHOES CITY SCHOOL DISTRICT
The read_csv
function takes the path of the .csv
file to input the data. The command after this prints the first five rows of the Location
column in the data.
To write a data to the .csv
file, the following to_csv
function can be used:
>>> d = {'c1': pd.Series(['A', 'B', 'C']), 'c2': pd.Series([1, 2., 3., 4.])} >>> df = pd.DataFrame(d) >>> df.to_csv('sample_data.csv')
The DataFrame is written to a .csv
file by using the to_csv
method. The path and the filename where the file needs to be created should be mentioned.
In addition to the pandas package, the xlrd
package needs to be installed for pandas to read the data from an Excel file:
>>> d=pd.read_excel('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xls')
The preceding function is similar to the CSV reading command. To write to an Excel file, the xlwt
package needs to be installed:
>>> df.to_excel('sample_data.xls')
To read the data from a JSON file, Python's standard json
package can be used. The following commands help in reading the file:
>>> import json >>> json_data = open('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.json') >>> data = json.load(json_data) >>> json_data.close()
In the preceding command, the open()
function opens a connection to the file. The json.load()
function loads the data into Python. The json_data.close()
function closes the connection to the file.
The pandas library also provides a function to read the JSON file, which can be accessed using pd.read_json()
.
To read data from a database, the following function can be used:
>>> pd.read_sql_table(table_name, con)
The preceding command generates a DataFrame. If a table name and an SQLAlchemy engine are given, they return a DataFrame. This function does not support the DBAPI connection. The following are the description of the parameters used:
table_name
: This refers to the name of the SQL table in a databasecon
: This refers to the SQLAlchemy engine
The following command reads SQL query into a DataFrame:
>>> pd.read_sql_query(sql, con)
The following are the description of the parameters used:
sql
: This refers to the SQL query that is to be executedcon
: This refers to the SQLAlchemy engine
- Learning RxJava
- Python從入門到精通(精粹版)
- Selenium Design Patterns and Best Practices
- Hands-On Microservices with Kotlin
- 大模型RAG實戰(zhàn):RAG原理、應(yīng)用與系統(tǒng)構(gòu)建
- Visual FoxPro程序設(shè)計
- SQL Server 2016數(shù)據(jù)庫應(yīng)用與開發(fā)
- PHP從入門到精通(第4版)(軟件開發(fā)視頻大講堂)
- Arduino計算機視覺編程
- 監(jiān)控的藝術(shù):云原生時代的監(jiān)控框架
- Python程序設(shè)計教程
- HTML5游戲開發(fā)實戰(zhàn)
- IBM RUP參考與認(rèn)證指南
- Android智能手機APP界面設(shè)計實戰(zhàn)教程
- 算法精解:C語言描述