- Python Data Analysis(Second Edition)
- Armando Fandango
- 748字
- 2021-07-09 19:04:07
The Pandas DataFrames
A Pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in Pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the following ways:
- Using another DataFrame.
- Using a NumPy array or a composite of arrays that has a two-dimensional shape.
- Likewise, we can create a DataFrame out of another Pandas data structure called Series. We will learn about Series in the following section.
- A DataFrame can also be produced from a file, such as a CSV file.
- From a dictionary of one-dimensional structures, such as one-dimensional NumPy arrays, lists, dicts, or Pandas Series.
As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv
; the file is in the code bundle of this book. These are the first two lines, including the header:
Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) total Afghanistan,1,1,151,28,,,,26088
In the next steps, we will take a look at Pandas DataFrames and its attributes:
- To kick off, load the data file into a
DataFrame
and print it on the screen:from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print("Dataframe", df)
The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines:
199 21732.0 200 11696.0 201 13228.0 [202 rows x 9 columns]
- The DataFrame has an attribute that holds its shape as a tuple, similar to
ndarray
. Query the number of rows of a DataFrame as follows:print("Shape", df.shape) print("Length", len(df))
The values we obtain comply with the printout of the preceding step:
Shape (202, 9) Length 202
- Check the column's header and data types with the other attributes:
print("Column Headers", df.columns) print("Data types", df.dtypes)
We receive the column headers in a special data structure:
Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object')
The data types are printed as follows:
- The Pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have Pandas create it automatically. The index can be accessed with a corresponding property, as follows:
Print("Index", df.index)
An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at
0
, with an increment of one for each row:Index Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
- Sometimes, we wish to iterate over the underlying data of a DataFrame. Iterating over column values can be inefficient if we utilize the Pandas iterators. It's much better to extract the underlying NumPy arrays and work with those. The Pandas DataFrame has an attribute that can aid with this as well:
print("Values", df.values)
Please note that some values are designated nan
in the output, for 'not a number'. These values come from empty fields in the input datafile:
Values [['Afghanistan' 1 1 ..., nan nan 26088.0] ['Albania' 2 2 ..., 93.0 94.0 3172.0] ['Algeria' 3 3 ..., 94.0 96.0 33351.0] ..., ['Yemen' 200 1 ..., 65.0 85.0 21732.0] ['Zambia' 201 3 ..., 94.0 90.0 11696.0] ['Zimbabwe' 202 3 ..., 88.0 87.0 13228.0]]
The preceding code is available in Python Notebook ch-03.ipynb
, available in the code bundle of this book.
- Visual Studio 2012 Cookbook
- LabVIEW2018中文版 虛擬儀器程序設計自學手冊
- CKA/CKAD應試教程:從Docker到Kubernetes完全攻略
- Julia Cookbook
- SQL基礎教程(視頻教學版)
- Oracle從入門到精通(第5版)
- Android底層接口與驅動開發(fā)技術詳解
- 從零開始學C語言
- MySQL程序員面試筆試寶典
- Java Web從入門到精通(第3版)
- Hands-On Neural Network Programming with C#
- 前端架構設計
- HTML5程序設計基礎教程
- WCF編程(第2版)
- Learning QGIS(Second Edition)