Skip to content

================ by Jawad Haider

Chpt 2 - Data Manipulation with Pandas

Example: Birthrate Data



Example: Birthrate Data

As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at link births.csv (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post)

# shell command to download the data:
# !curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/
# master/births.csv
import numpy as np
import pandas as pd
births = pd.read_csv("../data/births.csv")
births.head()
year month day gender births
0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
3 1969 1 2.0 M 4548
4 1969 1 3.0 F 4548

We can start to understand this data a bit more by using a pivot table. Let’s add a dec‐ ade column, and take a look at male and female births as a function of decade:

births['decade']=10* (births['year']//10)
births.head()
year month day gender births decade
0 1969 1 1.0 F 4046 1960
1 1969 1 1.0 M 4440 1960
2 1969 1 2.0 F 4454 1960
3 1969 1 2.0 M 4548 1960
4 1969 1 3.0 F 4548 1960
births.pivot_table('births', index='decade',columns='gender', aggfunc='sum')
gender F M
decade
1960 1753634 1846572
1970 16263075 17121550
1980 18310351 19243452
1990 19479454 20420553
2000 18229309 19106428

We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visual‐ ize the total number of births by year

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
births.pivot_table('births', index='year',columns='gender', aggfunc='sum').plot()
plt.ylabel("total birth per year")
Text(0, 0.5, 'total birth per year')

Further data exploration

Though this doesn’t necessarily relate to the pivot table, there are a few more interest‐ ing features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation

quartiles = np.percentile(births['births'],[25,50,75])
mu=quartiles[1]
sig=0.74*(quartiles[2]-quartiles[0])
mu
4814.0
sig
689.31

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution.
With this we can use the query() method to filter out rows with births outside these values:

births=births.query('(births > @mu -5 * @sig) & (births<@mu + 5*@sig)')
births
year month day gender births decade
0 1969 1 1.0 F 4046 1960
1 1969 1 1.0 M 4440 1960
2 1969 1 2.0 F 4454 1960
3 1969 1 2.0 M 4548 1960
4 1969 1 3.0 F 4548 1960
... ... ... ... ... ... ...
15062 1988 12 29.0 M 5944 1980
15063 1988 12 30.0 F 5742 1980
15064 1988 12 30.0 M 6095 1980
15065 1988 12 31.0 F 4435 1980
15066 1988 12 31.0 M 4698 1980

14610 rows × 6 columns

Next we set the day column to integers; previously it had been a string because some columns in the dataset contained the value ‘null’:

births['day']=births['day'].astype(int)
/tmp/ipykernel_18639/3805690895.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  births['day']=births['day'].astype(int)
births
year month day gender births decade
0 1969 1 1 F 4046 1960
1 1969 1 1 M 4440 1960
2 1969 1 2 F 4454 1960
3 1969 1 2 M 4548 1960
4 1969 1 3 F 4548 1960
... ... ... ... ... ... ...
15062 1988 12 29 M 5944 1980
15063 1988 12 30 F 5742 1980
15064 1988 12 30 M 6095 1980
15065 1988 12 31 F 4435 1980
15066 1988 12 31 M 4698 1980

14610 rows × 6 columns

Finally, we can combine the day, month, and year to create a Date index This allows us to quickly compute the weekday corresponding to each row:

# create a datetime index from the year, month, day
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births['dayofweek']=births.index.day_of_week
# Using this we can plot the births by weekday for several decades
import matplotlib.pyplot as plt
import matplotlib as mpl
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
/tmp/ipykernel_18639/3967923407.py:6: UserWarning: FixedFormatter should only be used together with FixedLocator
  plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])

births_by_date=births.pivot_table('births',
                                 [births.index.month, births.index.day])
births_by_date
births
1 1 4009.225
2 4247.400
3 4500.900
4 4571.350
5 4603.625
... ... ...
12 27 4850.150
28 5044.200
29 5120.150
30 5172.350
31 4859.200

366 rows × 1 columns

births_by_date.index=[pd.datetime(2012,month,day) for (month,day)in births_by_date.index]
/tmp/ipykernel_18639/1749910599.py:1: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
  births_by_date.index=[pd.datetime(2012,month,day) for (month,day)in births_by_date.index]
births_by_date
births
2012-01-01 4009.225
2012-01-02 4247.400
2012-01-03 4500.900
2012-01-04 4571.350
2012-01-05 4603.625
... ...
2012-12-27 4850.150
2012-12-28 5044.200
2012-12-29 5120.150
2012-12-30 5172.350
2012-12-31 4859.200

366 rows × 1 columns

# Focusing on the month and day only, we now have a time series reflecting the average
#number of births by date of the year.
fig, ax = plt.subplots(figsize=(12,4))
births_by_date.plot(ax=ax)
<AxesSubplot:>