官术网_书友最值得收藏!

Data operations

Once the missing data is handled, various operations can be performed on the data.

Aggregation operations

There are a number of aggregation operations, such as average, sum, and so on, which you would like to perform on a numerical field. These are the methods used to perform it:

  • Average: To find out the average number of students in the ELEMENTARY school who are obese, we'll first filter the ELEMENTARY data with the following command:
    >>> data = d[d['GRADE LEVEL'] == 'ELEMENTARY']
    213.41593780369291
    

    Now, we'll find the mean using the following command:

    >>> data['NO. OBESE'].mean()
    

    The elementary grade level data is filtered and stored in the data object. The NO. OBESE column is selected, which contains the number of obese students and using the mean() method, the average is taken out.

  • SUM: To find out the total number of elementary students who are obese across all the school, use the following command:
    >>> data['NO. OBESE'].sum()
    219605.0
    
  • MAX: To get the maximum number of students that are obese in an elementary school, use the following command:
    >>> data['NO. OBESE'].max()
    48843.0
    
  • MIN: To get the minimum number of students that are obese in an elementary school, use the following command:
    >>> data['NO. OBESE'].min()
    5.0
    
  • STD: To get the standard deviation of the number of obese students, use the following command:
    >>> data['NO. OBESE'].std()
    1690.3831128098113
    
  • COUNT: To count the total number of schools with the ELEMENTARY grade in the DELAWARE county, use the following command:
    >>> data = df[(d['GRADE LEVEL'] == 'ELEMENTARY') & (d['COUNTY'] == 'DELAWARE')]
    >>> data['COUNTY'].count()
    19
    

    The table is filtered for the ELEMENTARY grade and the DELAWARE county. Notice that the conditions are enclosed in parentheses. This is to ensure that individual conditions are evaluated and if the parentheses are not provided, then Python will throw an error.

Joins

SQL-like joins can be performed on the DataFrame using pandas. Let's define a lookup DataFrame, which assigns levels to each of the grades using the following command:

>>> grade_lookup = {'GRADE LEVEL': pd.Series(['ELEMENTARY', 'MIDDLE/HIGH', 'MISC']),
 'LEVEL': pd.Series([1, 2, 3])}

>>> grade_lookup = DataFrame(grade_lookup)

Let's take the first five rows of the GRADE data column as an example for performing the joins:

>>> df[['GRADE LEVEL']][0:5]
 GRADE LEVEL
0 DISTRICT TOTAL
1 ELEMENTARY
2 MIDDLE/HIGH
3 DISTRICT TOTAL
4 ELEMENTARY

The inner join

The following image is a sample of an inner join:

The inner join

An inner join can be performed with the following command:

>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='inner')
>>> d_sub[['GRADE LEVEL', 'LEVEL']]

 GRADE LEVEL LEVEL
1 ELEMENTARY 1
4 ELEMENTARY 1
2 MIDDLE/HIGH 2

The join takes place with the join() method. The first argument takes the DataFrame on which the lookup takes place. Note that the grade_lookup DataFrame's index is being set by the set_index() method. This is essential for a join, as without it, the join method won't know on which column to join the DataFrame to.

The second argument takes a column of the d DataFrame to join the data. The third argument defines the join as an inner join.

The left outer join

The following image is a sample of a left outer join:

The left outer join

A left outer join can be performed with the following commands:

>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='left')
>>> d_sub[['GRADE LEVEL', 'LEVEL']]

 GRADE LEVEL LEVEL
0 DISTRICT TOTAL NaN
1 ELEMENTARY 1
2 MIDDLE/HIGH 2
3 DISTRICT TOTAL NaN
4 ELEMENTARY 1

You can notice that DISTRICT TOTAL has missing values for a level column, as the grade_lookup DataFrame does not have an instance for DISTRICT TOTAL.

The full outer join

The following image is a sample of a full outer join:

The full outer join

The full outer join can be performed with the following commands:

>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='outer')
>>> d_sub[['GRADE LEVEL', 'LEVEL']]

 GRADE LEVEL LEVEL
0 DISTRICT TOTAL NaN
3 DISTRICT TOTAL NaN
1 ELEMENTARY 1
4 ELEMENTARY 1
2 MIDDLE/HIGH 2
4 MISC 3

The groupby function

It's easy to do an SQL-like group by operation with pandas. Let's say, if you want to find the sum of the number of obese students in each of the grades, then you can use the following command:

>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).sum()
GRADE LEVEL
DISTRICT TOTAL 127101
ELEMENTARY 72880
MIDDLE/HIGH 53089

This command chooses the number of obese students column, then uses the group by method to group the data-based group level, and finally, the sum method sums up the number. The same can be achieved by the following function too:

>>> d['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate(sum)

Here, the aggregate method is utilized. The sum function is passed to obtain the required results.

It's also possible to obtain multiple kinds of aggregations on the same metric. This can be achieved by the following command:

>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std])
 sum mean std
GRADE LEVEL 
DISTRICT TOTAL 127101 128.384848 158.933263
ELEMENTARY 72880 76.958817 100.289578
MIDDLE/HIGH 53089 59.251116 65.905591
主站蜘蛛池模板: 六盘水市| 太和县| 西昌市| 太白县| 望谟县| 仁化县| 霍山县| 萨迦县| 马鞍山市| 汕尾市| 莆田市| 潼关县| 泰顺县| 秦安县| 南华县| 武川县| 宁乡县| 江口县| 洞头县| 深圳市| 拜城县| 滦南县| 平遥县| 磐安县| 凤城市| 昌黎县| 大连市| 呼玛县| 南安市| 修武县| 宁都县| 乌拉特中旗| 江达县| 温泉县| 日土县| 天全县| 靖安县| 庄河市| 电白县| 偏关县| 西和县|