- Mastering Python for Data Science
- Samir Madhavan
- 807字
- 2021-07-16 20:14:18
Data operations
Once the missing data is handled, various operations can be performed on the data.
Aggregation operations
There are a number of aggregation operations, such as average, sum, and so on, which you would like to perform on a numerical field. These are the methods used to perform it:
- Average: To find out the average number of students in the
ELEMENTARY
school who are obese, we'll first filter theELEMENTARY
data with the following command:>>> data = d[d['GRADE LEVEL'] == 'ELEMENTARY'] 213.41593780369291
Now, we'll find the mean using the following command:
>>> data['NO. OBESE'].mean()
The elementary grade level data is filtered and stored in the data object. The
NO. OBESE
column is selected, which contains the number of obese students and using themean()
method, the average is taken out. - SUM: To find out the total number of elementary students who are obese across all the school, use the following command:
>>> data['NO. OBESE'].sum() 219605.0
- MAX: To get the maximum number of students that are obese in an elementary school, use the following command:
>>> data['NO. OBESE'].max() 48843.0
- MIN: To get the minimum number of students that are obese in an elementary school, use the following command:
>>> data['NO. OBESE'].min() 5.0
- STD: To get the standard deviation of the number of obese students, use the following command:
>>> data['NO. OBESE'].std() 1690.3831128098113
- COUNT: To count the total number of schools with the
ELEMENTARY
grade in theDELAWARE
county, use the following command:>>> data = df[(d['GRADE LEVEL'] == 'ELEMENTARY') & (d['COUNTY'] == 'DELAWARE')] >>> data['COUNTY'].count() 19
The table is filtered for the
ELEMENTARY
grade and theDELAWARE
county. Notice that the conditions are enclosed in parentheses. This is to ensure that individual conditions are evaluated and if the parentheses are not provided, then Python will throw an error.
Joins
SQL-like joins can be performed on the DataFrame using pandas. Let's define a lookup DataFrame, which assigns levels to each of the grades using the following command:
>>> grade_lookup = {'GRADE LEVEL': pd.Series(['ELEMENTARY', 'MIDDLE/HIGH', 'MISC']), 'LEVEL': pd.Series([1, 2, 3])} >>> grade_lookup = DataFrame(grade_lookup)
Let's take the first five rows of the GRADE
data column as an example for performing the joins:
>>> df[['GRADE LEVEL']][0:5] GRADE LEVEL 0 DISTRICT TOTAL 1 ELEMENTARY 2 MIDDLE/HIGH 3 DISTRICT TOTAL 4 ELEMENTARY
The following image is a sample of an inner join:

An inner join can be performed with the following command:
>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='inner') >>> d_sub[['GRADE LEVEL', 'LEVEL']] GRADE LEVEL LEVEL 1 ELEMENTARY 1 4 ELEMENTARY 1 2 MIDDLE/HIGH 2
The join takes place with the join()
method. The first argument takes the DataFrame on which the lookup takes place. Note that the grade_lookup
DataFrame's index is being set by the set_index()
method. This is essential for a join, as without it, the join method won't know on which column to join the DataFrame to.
The second argument takes a column of the d
DataFrame to join the data. The third argument defines the join as an inner join.
The following image is a sample of a left outer join:

A left outer join can be performed with the following commands:
>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='left') >>> d_sub[['GRADE LEVEL', 'LEVEL']] GRADE LEVEL LEVEL 0 DISTRICT TOTAL NaN 1 ELEMENTARY 1 2 MIDDLE/HIGH 2 3 DISTRICT TOTAL NaN 4 ELEMENTARY 1
You can notice that DISTRICT TOTAL
has missing values for a level column, as the grade_lookup
DataFrame does not have an instance for DISTRICT TOTAL
.
The following image is a sample of a full outer join:

The full outer join can be performed with the following commands:
>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='outer') >>> d_sub[['GRADE LEVEL', 'LEVEL']] GRADE LEVEL LEVEL 0 DISTRICT TOTAL NaN 3 DISTRICT TOTAL NaN 1 ELEMENTARY 1 4 ELEMENTARY 1 2 MIDDLE/HIGH 2 4 MISC 3
It's easy to do an SQL-like group by operation with pandas. Let's say, if you want to find the sum of the number of obese students in each of the grades, then you can use the following command:
>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).sum() GRADE LEVEL DISTRICT TOTAL 127101 ELEMENTARY 72880 MIDDLE/HIGH 53089
This command chooses the number of obese students column, then uses the group by method to group the data-based group level, and finally, the sum method sums up the number. The same can be achieved by the following function too:
>>> d['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate(sum)
Here, the aggregate method is utilized. The sum function is passed to obtain the required results.
It's also possible to obtain multiple kinds of aggregations on the same metric. This can be achieved by the following command:
>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std]) sum mean std GRADE LEVEL DISTRICT TOTAL 127101 128.384848 158.933263 ELEMENTARY 72880 76.958817 100.289578 MIDDLE/HIGH 53089 59.251116 65.905591
- FuelPHP Application Development Blueprints
- LabVIEW Graphical Programming Cookbook
- C/C++算法從菜鳥到達人
- 實戰低代碼
- aelf區塊鏈應用架構指南
- Spring實戰(第5版)
- Drupal 8 Configuration Management
- Android驅動開發權威指南
- 快速入門與進階:Creo 4·0全實例精講
- Learning Docker Networking
- Android嵌入式系統程序開發:基于Cortex-A8(第2版)
- Mastering XenApp?
- R語言:邁向大數據之路
- 嵌入式網絡編程
- Java面向對象程序設計(第3版)