官术网_书友最值得收藏!

Empowering data analysis with pandas

The pandas library was developed by Wes McKinny when he was working at AQR Capital Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial data. Later, Chang She joined him and helped develop the package further.

The pandas library is an open source Python library, specially designed for data analysis. It has been built on NumPy and makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well.

The pandas library brings the richness of R in the world of Python to handle data. It's has efficient data structures to process data, perform fast joins, and read data from various sources, to name a few.

The data structure of pandas

The pandas library essentially has three data structures:

  1. Series
  2. DataFrame
  3. Panel

Series

Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings, and Python objects too. A series can be created by calling the following:

>>> import pandas as pd
>>> pd.Series(np.random.randn(5))

0 0.733810
1 -1.274658
2 -1.602298
3 0.460944
4 -0.632756
dtype: float64

The random.randn parameter is part of the NumPy package and it generates random numbers. The series function creates a pandas series that consists of an index, which is the first column, and the second column consists of random values. At the bottom of the output is the datatype of the series.

The index of the series can be customized by calling the following:

>>> pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

a -0.929494
b -0.571423
c -1.197866
d 0.081107
e -0.035091
dtype: float64

A series can be derived from a Python dict too:

>>> d = {'A': 10, 'B': 20, 'C': 30}
>>> pd.Series(d)

A 10
B 20
C 30
dtype: int64

DataFrame

DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a table. A DataFrame can be formed from the following data structures:

  • A NumPy array
  • Lists
  • Dicts
  • Series
  • A 2D NumPy array

A DataFrame can be created from a dict of series by calling the following commands:

>>> d = {'c1': pd.Series(['A', 'B', 'C']),
 'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df

 c1 c2
0 A 1
1 B 2
2 C 3
3 NaN 4

The DataFrame can be created using a dict of lists too:

>>> d = {'c1': ['A', 'B', 'C', 'D'],
 'c2': [1, 2.0, 3.0, 4.0]}
>>> df = pd.DataFrame(d)
>>> print df
 c1 c2
0 A 1
1 B 2
2 C 3
3 D 4

Panel

A Panel is a data structure that handles 3D data. The following command is an example of panel data:

>>> d = {'Item1': pd.DataFrame(np.random.randn(4, 3)),
 'Item2': pd.DataFrame(np.random.randn(4, 2))}
>>> pd.Panel(d)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

The preceding command shows that there are 2 DataFrames represented by two items. There are four rows represented by four major axes and three columns represented by three minor axes.

Inserting and exporting data

The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library makes it convenient to read data from these formats or to export to these formats. We'll use a dataset that contains the weight statistics of the school students from the U.S..

We'll be using a file with the following structure:

CSV

To read data from a .csv file, the following read_csv function can be used:

>>> d = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv')
>>> d[0:5]['AREA NAME']

0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
3 COHOES CITY SCHOOL DISTRICT
4 COHOES CITY SCHOOL DISTRICT

The read_csv function takes the path of the .csv file to input the data. The command after this prints the first five rows of the Location column in the data.

To write a data to the .csv file, the following to_csv function can be used:

>>> d = {'c1': pd.Series(['A', 'B', 'C']),
 'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df.to_csv('sample_data.csv')

The DataFrame is written to a .csv file by using the to_csv method. The path and the filename where the file needs to be created should be mentioned.

XLS

In addition to the pandas package, the xlrd package needs to be installed for pandas to read the data from an Excel file:

>>> d=pd.read_excel('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xls')

The preceding function is similar to the CSV reading command. To write to an Excel file, the xlwt package needs to be installed:

>>> df.to_excel('sample_data.xls')

JSON

To read the data from a JSON file, Python's standard json package can be used. The following commands help in reading the file:

>>> import json
>>> json_data = open('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.json')
>>> data = json.load(json_data)
>>> json_data.close()

In the preceding command, the open() function opens a connection to the file. The json.load() function loads the data into Python. The json_data.close() function closes the connection to the file.

The pandas library also provides a function to read the JSON file, which can be accessed using pd.read_json().

Database

To read data from a database, the following function can be used:

>>> pd.read_sql_table(table_name, con)

The preceding command generates a DataFrame. If a table name and an SQLAlchemy engine are given, they return a DataFrame. This function does not support the DBAPI connection. The following are the description of the parameters used:

  • table_name: This refers to the name of the SQL table in a database
  • con: This refers to the SQLAlchemy engine

The following command reads SQL query into a DataFrame:

>>> pd.read_sql_query(sql, con)

The following are the description of the parameters used:

  • sql: This refers to the SQL query that is to be executed
  • con: This refers to the SQLAlchemy engine
主站蜘蛛池模板: 新民市| 宿迁市| 安吉县| 株洲县| 上饶市| 绥化市| 本溪| 罗甸县| 鹤峰县| 江西省| 通州区| 禹州市| 康平县| 浦县| 东乡县| 昌黎县| 阿坝县| 弋阳县| 昭苏县| 龙里县| 乐清市| 疏附县| 万山特区| 饶平县| 西青区| 永定县| 调兵山市| 汕尾市| 南岸区| 黑山县| 翁牛特旗| 桓台县| 来宾市| 格尔木市| 五常市| 来安县| 富阳市| 克什克腾旗| 监利县| 胶南市| 晋中市|