- Python:Data Analytics and Visualization
- Phuong Vo.T.H Martin Czygan Ashish Kumar Kirthi Raman
- 400字
- 2021-07-09 18:51:45
Data aggregation
As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe
function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:
>>> df.head() country city date hours 0 Germany Hamburg 2015-06-01 8 1 Germany Hamburg 2015-06-02 10 2 Germany Hamburg 2015-06-03 9 3 Germany Hamburg 2015-06-04 7 4 Germany Hamburg 2015-06-05 3
To view a summary per city
, we use the describe
function on the grouped data set:
>>> df.groupby("city").describe() hours city Berlin count 10.000000 mean 6.000000 std 3.741657 min 0.000000 25% 4.000000 50% 6.000000 75% 9.750000 max 10.000000 Birmingham count 10.000000 mean 5.100000 std 2.078995 min 2.000000 25% 4.000000 50% 5.500000 75% 6.750000 max 8.000000
On certain data sets
, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:
>>> df.groupby(["country", "date"]).describe() hours country date France 2015-06-01 count 5.000000 mean 6.200000 std 1.095445 min 5.000000 25% 5.000000 50% 7.000000 75% 7.000000 max 7.000000 2015-06-02 count 5.000000 mean 3.600000 std 3.577709 min 0.000000 25% 0.000000 50% 4.000000 75% 6.000000 max 8.000000 UK 2015-06-07 std 3.872983 min 0.000000 25% 2.000000 50% 6.000000 75% 8.000000 max 9.000000
We can compute single statistics as well:
>>> df.groupby("city").mean() hours city Berlin 6.0 Birmingham 5.1 Bordeax 4.7 Edinburgh 7.5 Frankfurt 5.8 Glasgow 4.8 Hamburg 5.5 Leipzig 5.0 London 4.8 Lyon 5.0 Manchester 5.2 Marseille 6.2 Munich 6.6 Nice 3.9 Paris 6.3
Finally, we can define any function to be applied on the groups with the agg
method. The above could have been written in terms of agg
like this:
>>> df.groupby("city").agg(np.mean) hours city Berlin 6.0 Birmingham 5.1 Bordeax 4.7 Edinburgh 7.5 Frankfurt 5.8 Glasgow 4.8 ...
But arbitrary functions are possible. As a last example, we define a custom
function, which takes an input of a series object and computes the difference between the smallest and the largest element:
>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s))) hours city Berlin 10 Birmingham 6 Bordeax 10 Edinburgh 8 Frankfurt 9 Glasgow 10 Hamburg 10 Leipzig 9 London 10 Lyon 8 Manchester 10 Marseille 10 Munich 9 Nice 10 Paris 9
- GNU-Linux Rapid Embedded Programming
- 過程控制工程及仿真
- 大數據專業英語
- Python Algorithmic Trading Cookbook
- 可編程控制器技術應用(西門子S7系列)
- 四向穿梭式自動化密集倉儲系統的設計與控制
- 網絡化分布式系統預測控制
- INSTANT Drools Starter
- Kubernetes for Developers
- 邊緣智能:關鍵技術與落地實踐
- Visual C++項目開發案例精粹
- 生成對抗網絡項目實戰
- 樂高創意機器人教程(中級 上冊 10~16歲) (青少年iCAN+創新創意實踐指導叢書)
- 中老年人學電腦與上網
- Practical Network Automation