官术网_书友最值得收藏!

Data aggregation

As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
 country city date hours
0 Germany Hamburg 2015-06-01 8
1 Germany Hamburg 2015-06-02 10
2 Germany Hamburg 2015-06-03 9
3 Germany Hamburg 2015-06-04 7
4 Germany Hamburg 2015-06-05 3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
 hours
city
Berlin count 10.000000
 mean 6.000000
 std 3.741657
 min 0.000000
 25% 4.000000
 50% 6.000000
 75% 9.750000
 max 10.000000
Birmingham count 10.000000
 mean 5.100000
 std 2.078995
 min 2.000000
 25% 4.000000
 50% 5.500000
 75% 6.750000
 max 8.000000

On certain data sets, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:

>>> df.groupby(["country", "date"]).describe()
 hours country date
France 2015-06-01 count 5.000000
 mean 6.200000
 std 1.095445
 min 5.000000
 25% 5.000000
 50% 7.000000
 75% 7.000000
 max 7.000000
 2015-06-02 count 5.000000
 mean 3.600000
 std 3.577709
 min 0.000000
 25% 0.000000
 50% 4.000000
 75% 6.000000
 max 8.000000
UK 2015-06-07 std 3.872983
 min 0.000000
 25% 2.000000
 50% 6.000000
 75% 8.000000
 max 9.000000

We can compute single statistics as well:

>>> df.groupby("city").mean()
 hours
city
Berlin 6.0
Birmingham 5.1
Bordeax 4.7
Edinburgh 7.5
Frankfurt 5.8
Glasgow 4.8
Hamburg 5.5
Leipzig 5.0
London 4.8
Lyon 5.0
Manchester 5.2
Marseille 6.2
Munich 6.6
Nice 3.9
Paris 6.3

Finally, we can define any function to be applied on the groups with the agg method. The above could have been written in terms of agg like this:

>>> df.groupby("city").agg(np.mean)
hours
city
Berlin 6.0
Birmingham 5.1
Bordeax 4.7
Edinburgh 7.5
Frankfurt 5.8
Glasgow 4.8
...

But arbitrary functions are possible. As a last example, we define a custom function, which takes an input of a series object and computes the difference between the smallest and the largest element:

>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s)))
 hours
city
Berlin 10
Birmingham 6
Bordeax 10
Edinburgh 8
Frankfurt 9
Glasgow 10
Hamburg 10
Leipzig 9
London 10
Lyon 8
Manchester 10
Marseille 10
Munich 9
Nice 10
Paris 9
主站蜘蛛池模板: 文安县| 永修县| 通州区| 湛江市| 西藏| 兴隆县| 新巴尔虎右旗| 尉氏县| 大化| 伊吾县| 孟州市| 大安市| 临潭县| 都江堰市| 太白县| 辽阳市| 上思县| 梧州市| 阿尔山市| 江达县| 宁南县| 马龙县| 黄骅市| 永嘉县| 丹阳市| 开远市| 永定县| 于都县| 泽普县| 淮安市| 太保市| 宜昌市| 神池县| 贡嘎县| 凌云县| 阿拉善左旗| 鱼台县| 泸水县| 大理市| 远安县| 密云县|