================ by Jawad Haider
Chpt 1 - Introduction to Numpy¶
03 - Aggregations: Min, Max, and Everything in Between¶
- Aggregations: Min, Max, and Everything in Between
- Multidimensional aggregates
- Some aggregation functions
- Example: What Is the Average Height of US Presidents?
Aggregations: Min, Max, and Everything in Between¶
Often when you are faced with a large amount of data, a first step is to compute sum‐ mary statistics for the data in question. Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the “typical” val‐ ues in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).
47.51294159911191
47.5129415991119
6.73 ms ± 88.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
31.5 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Be careful, though: the sum function and the np.sum function are not identical, which can sometimes lead to confusion! In particular, their optional arguments have differ‐ ent meanings, and np.sum is aware of multiple array dimensions
(1.4444103570987465e-06, 0.9999881721555508)
(1.4444103570987465e-06, 0.9999881721555508)
5.44 ms ± 49.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
35.5 µs ± 64.6 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Multidimensional aggregates¶
[[0.15322668 0.10058762 0.85504471 0.19779527]
[0.24515716 0.61526756 0.9677193 0.0045308 ]
[0.57826711 0.49512073 0.68039294 0.43271134]]
5.325821234912031
# Aggregation functions take an additional argument specifying the axis along which the aggregate is computed.
m.min(axis=0)
array([0.15322668, 0.10058762, 0.68039294, 0.0045308 ])
0.9677193034993363
array([0.57826711, 0.61526756, 0.9677193 , 0.43271134])
array([0.10058762, 0.0045308 , 0.43271134])
The way the axis is specified here can be confusing to users coming from other lan‐ guages. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. So specifying axis=0 means that the The way the axis is specified here can be confusing to users coming from other lan‐ guages. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. So specifying axis=0 means that the first axis will be collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.
Some aggregation functions¶
Example: What Is the Average Height of US Presidents?¶
Aggregates available in NumPy can be extremely useful for summarizing a set of val‐ ues. As a simple example, let’s consider the heights of all US presidents.
order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189
import pandas as pd
data=pd.read_csv("../data/president_heights.csv")
height=np.array(data['height(cm)'])
print(height)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
177 185 188 188 182 185]
order | name | height(cm) | |
---|---|---|---|
0 | 1 | George Washington | 189 |
1 | 2 | John Adams | 170 |
2 | 3 | Thomas Jefferson | 189 |
3 | 4 | James Madison | 163 |
4 | 5 | James Monroe | 183 |
Mean Height = 179.73809523809524
St.dv Height = 6.931843442745892
Mean Height = 179.73809523809524
St.dv Height = 6.931843442745892
Max Height = 193
Min Height = 163
print(f"25th precentile Height = ",np.percentile(height,25))
print(f"Median Height = ",np.median(height))
print(f"75th Percentile = ",np.percentile(height,75))
25th precentile Height = 174.25
Median Height = 182.0
75th Percentile = 183.0
We see that the median height of US presidents is 182 cm, or just shy of six feet