================ by Jawad Haider
Chpt 2 - Data Manipulation with Pandas¶
Example: Birthrate Data¶
Example: Birthrate Data¶
As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at link births.csv (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post)
# shell command to download the data:
# !curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/
# master/births.csv
year | month | day | gender | births | |
---|---|---|---|---|---|
0 | 1969 | 1 | 1.0 | F | 4046 |
1 | 1969 | 1 | 1.0 | M | 4440 |
2 | 1969 | 1 | 2.0 | F | 4454 |
3 | 1969 | 1 | 2.0 | M | 4548 |
4 | 1969 | 1 | 3.0 | F | 4548 |
We can start to understand this data a bit more by using a pivot table. Let’s add a dec‐ ade column, and take a look at male and female births as a function of decade:
year | month | day | gender | births | decade | |
---|---|---|---|---|---|---|
0 | 1969 | 1 | 1.0 | F | 4046 | 1960 |
1 | 1969 | 1 | 1.0 | M | 4440 | 1960 |
2 | 1969 | 1 | 2.0 | F | 4454 | 1960 |
3 | 1969 | 1 | 2.0 | M | 4548 | 1960 |
4 | 1969 | 1 | 3.0 | F | 4548 | 1960 |
gender | F | M |
---|---|---|
decade | ||
1960 | 1753634 | 1846572 |
1970 | 16263075 | 17121550 |
1980 | 18310351 | 19243452 |
1990 | 19479454 | 20420553 |
2000 | 18229309 | 19106428 |
We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visual‐ ize the total number of births by year
sns.set()
births.pivot_table('births', index='year',columns='gender', aggfunc='sum').plot()
plt.ylabel("total birth per year")
Text(0, 0.5, 'total birth per year')
Further data exploration¶
Though this doesn’t necessarily relate to the pivot table, there are a few more interest‐ ing features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation
quartiles = np.percentile(births['births'],[25,50,75])
mu=quartiles[1]
sig=0.74*(quartiles[2]-quartiles[0])
4814.0
689.31
This final line is a robust estimate of the sample mean, where the 0.74
comes from the interquartile range of a Gaussian distribution.
With this we can use the query() method to filter out rows with births
outside these values:
year | month | day | gender | births | decade | |
---|---|---|---|---|---|---|
0 | 1969 | 1 | 1.0 | F | 4046 | 1960 |
1 | 1969 | 1 | 1.0 | M | 4440 | 1960 |
2 | 1969 | 1 | 2.0 | F | 4454 | 1960 |
3 | 1969 | 1 | 2.0 | M | 4548 | 1960 |
4 | 1969 | 1 | 3.0 | F | 4548 | 1960 |
... | ... | ... | ... | ... | ... | ... |
15062 | 1988 | 12 | 29.0 | M | 5944 | 1980 |
15063 | 1988 | 12 | 30.0 | F | 5742 | 1980 |
15064 | 1988 | 12 | 30.0 | M | 6095 | 1980 |
15065 | 1988 | 12 | 31.0 | F | 4435 | 1980 |
15066 | 1988 | 12 | 31.0 | M | 4698 | 1980 |
14610 rows × 6 columns
Next we set the day column to integers; previously it had been a string because some columns in the dataset contained the value ‘null’:
/tmp/ipykernel_18639/3805690895.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
births['day']=births['day'].astype(int)
year | month | day | gender | births | decade | |
---|---|---|---|---|---|---|
0 | 1969 | 1 | 1 | F | 4046 | 1960 |
1 | 1969 | 1 | 1 | M | 4440 | 1960 |
2 | 1969 | 1 | 2 | F | 4454 | 1960 |
3 | 1969 | 1 | 2 | M | 4548 | 1960 |
4 | 1969 | 1 | 3 | F | 4548 | 1960 |
... | ... | ... | ... | ... | ... | ... |
15062 | 1988 | 12 | 29 | M | 5944 | 1980 |
15063 | 1988 | 12 | 30 | F | 5742 | 1980 |
15064 | 1988 | 12 | 30 | M | 6095 | 1980 |
15065 | 1988 | 12 | 31 | F | 4435 | 1980 |
15066 | 1988 | 12 | 31 | M | 4698 | 1980 |
14610 rows × 6 columns
Finally, we can combine the day, month, and year to create a Date index This allows us to quickly compute the weekday corresponding to each row:
# create a datetime index from the year, month, day
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
# Using this we can plot the births by weekday for several decades
import matplotlib.pyplot as plt
import matplotlib as mpl
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
/tmp/ipykernel_18639/3967923407.py:6: UserWarning: FixedFormatter should only be used together with FixedLocator
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
births | ||
---|---|---|
1 | 1 | 4009.225 |
2 | 4247.400 | |
3 | 4500.900 | |
4 | 4571.350 | |
5 | 4603.625 | |
... | ... | ... |
12 | 27 | 4850.150 |
28 | 5044.200 | |
29 | 5120.150 | |
30 | 5172.350 | |
31 | 4859.200 |
366 rows × 1 columns
/tmp/ipykernel_18639/1749910599.py:1: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
births_by_date.index=[pd.datetime(2012,month,day) for (month,day)in births_by_date.index]
births | |
---|---|
2012-01-01 | 4009.225 |
2012-01-02 | 4247.400 |
2012-01-03 | 4500.900 |
2012-01-04 | 4571.350 |
2012-01-05 | 4603.625 |
... | ... |
2012-12-27 | 4850.150 |
2012-12-28 | 5044.200 |
2012-12-29 | 5120.150 |
2012-12-30 | 5172.350 |
2012-12-31 | 4859.200 |
366 rows × 1 columns
# Focusing on the month and day only, we now have a time series reflecting the average
#number of births by date of the year.
fig, ax = plt.subplots(figsize=(12,4))
births_by_date.plot(ax=ax)
<AxesSubplot:>