飞鸟派对怎么种几千倍的

書名： Mastering Python for Data Science
作者名： Samir Madhavan
本章字數： 807字
更新時間： 2021-07-16 20:14:18

Data operations

Once the missing data is handled, various operations can be performed on the data.

Aggregation operations

There are a number of aggregation operations, such as average, sum, and so on, which you would like to perform on a numerical field. These are the methods used to perform it:

Average: To find out the average number of students in the ELEMENTARY school who are obese, we'll first filter the ELEMENTARY data with the following command:
```
>>> data = d[d['GRADE LEVEL'] == 'ELEMENTARY']
213.41593780369291
```
Now, we'll find the mean using the following command:
```
>>> data['NO. OBESE'].mean()
```
The elementary grade level data is filtered and stored in the data object. The NO. OBESE column is selected, which contains the number of obese students and using the mean() method, the average is taken out.
SUM: To find out the total number of elementary students who are obese across all the school, use the following command:
```
>>> data['NO. OBESE'].sum()
219605.0
```
MAX: To get the maximum number of students that are obese in an elementary school, use the following command:
```
>>> data['NO. OBESE'].max()
48843.0
```
MIN: To get the minimum number of students that are obese in an elementary school, use the following command:
```
>>> data['NO. OBESE'].min()
5.0
```
STD: To get the standard deviation of the number of obese students, use the following command:
```
>>> data['NO. OBESE'].std()
1690.3831128098113
```
COUNT: To count the total number of schools with the ELEMENTARY grade in the DELAWARE county, use the following command:
```
>>> data = df[(d['GRADE LEVEL'] == 'ELEMENTARY') & (d['COUNTY'] == 'DELAWARE')]
>>> data['COUNTY'].count()
19
```
The table is filtered for the ELEMENTARY grade and the DELAWARE county. Notice that the conditions are enclosed in parentheses. This is to ensure that individual conditions are evaluated and if the parentheses are not provided, then Python will throw an error.

Joins

SQL-like joins can be performed on the DataFrame using pandas. Let's define a lookup DataFrame, which assigns levels to each of the grades using the following command:

>>> grade_lookup = {'GRADE LEVEL': pd.Series(['ELEMENTARY', 'MIDDLE/HIGH', 'MISC']),
 'LEVEL': pd.Series([1, 2, 3])}

>>> grade_lookup = DataFrame(grade_lookup)

Let's take the first five rows of the GRADE data column as an example for performing the joins:

>>> df[['GRADE LEVEL']][0:5]
 GRADE LEVEL
0 DISTRICT TOTAL
1 ELEMENTARY
2 MIDDLE/HIGH
3 DISTRICT TOTAL
4 ELEMENTARY

The inner join

The following image is a sample of an inner join:

An inner join can be performed with the following command:

>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='inner')
>>> d_sub[['GRADE LEVEL', 'LEVEL']]

 GRADE LEVEL LEVEL
1 ELEMENTARY 1
4 ELEMENTARY 1
2 MIDDLE/HIGH 2

The join takes place with the join() method. The first argument takes the DataFrame on which the lookup takes place. Note that the grade_lookup DataFrame's index is being set by the set_index() method. This is essential for a join, as without it, the join method won't know on which column to join the DataFrame to.

The second argument takes a column of the d DataFrame to join the data. The third argument defines the join as an inner join.

The left outer join

The following image is a sample of a left outer join:

A left outer join can be performed with the following commands:

>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='left')
>>> d_sub[['GRADE LEVEL', 'LEVEL']]

 GRADE LEVEL LEVEL
0 DISTRICT TOTAL NaN
1 ELEMENTARY 1
2 MIDDLE/HIGH 2
3 DISTRICT TOTAL NaN
4 ELEMENTARY 1

You can notice that DISTRICT TOTAL has missing values for a level column, as the grade_lookup DataFrame does not have an instance for DISTRICT TOTAL.

The full outer join

The following image is a sample of a full outer join:

The full outer join can be performed with the following commands:

>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='outer')
>>> d_sub[['GRADE LEVEL', 'LEVEL']]

 GRADE LEVEL LEVEL
0 DISTRICT TOTAL NaN
3 DISTRICT TOTAL NaN
1 ELEMENTARY 1
4 ELEMENTARY 1
2 MIDDLE/HIGH 2
4 MISC 3

The groupby function

It's easy to do an SQL-like group by operation with pandas. Let's say, if you want to find the sum of the number of obese students in each of the grades, then you can use the following command:

>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).sum()
GRADE LEVEL
DISTRICT TOTAL 127101
ELEMENTARY 72880
MIDDLE/HIGH 53089

This command chooses the number of obese students column, then uses the group by method to group the data-based group level, and finally, the sum method sums up the number. The same can be achieved by the following function too:

>>> d['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate(sum)

Here, the aggregate method is utilized. The sum function is passed to obtain the required results.

It's also possible to obtain multiple kinds of aggregations on the same metric. This can be achieved by the following command:

>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std])
 sum mean std
GRADE LEVEL 
DISTRICT TOTAL 127101 128.384848 158.933263
ELEMENTARY 72880 76.958817 100.289578
MIDDLE/HIGH 53089 59.251116 65.905591

官术网_书友最值得收藏!

Mastering Python for Data Science