Skip to content

================ by Jawad Haider

Chpt 2 - Data Manipulation with Pandas

Example: US States Data



Example: US States Data

Merge and join operations come up most often when one is combining data from dif‐ ferent sources. Here we will consider an example of some data about US states and their populations.

import numpy as np
import pandas as pd
pop = pd.read_csv("../data/state-population.csv")
areas= pd.read_csv("../data/state-areas.csv")
abbrevs= pd.read_csv("../data/state-abbrevs.csv")
pop.head()
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
areas.head()
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
abbrevs.head()
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA

Given this information, say we want to compute a relatively straightforward result:rank US states and territories by their 2010 population density

merged=pd.merge(pop,abbrevs, how='outer',left_on='state/region',right_on='abbreviation')
merged
state/region ages year population state abbreviation
0 AL under18 2012 1117489.0 Alabama AL
1 AL total 2012 4817528.0 Alabama AL
2 AL under18 2010 1130966.0 Alabama AL
3 AL total 2010 4785570.0 Alabama AL
4 AL under18 2011 1125763.0 Alabama AL
... ... ... ... ... ... ...
2539 USA total 2010 309326295.0 NaN NaN
2540 USA under18 2011 73902222.0 NaN NaN
2541 USA total 2011 311582564.0 NaN NaN
2542 USA under18 2012 73708179.0 NaN NaN
2543 USA total 2012 313873685.0 NaN NaN

2544 rows × 6 columns

merged=merged.drop('abbreviation',1)
/tmp/ipykernel_88168/2168094788.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  merged=merged.drop('abbreviation',1)
merged.head()
state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
# CHeck if there is any mismatch,
merged.isnull().any()
state/region    False
ages            False
year            False
population       True
state            True
dtype: bool
## Some of the population and state info is null, lets check which one
merged[merged['population'].isnull()].head()
state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
merged[merged['state'].isnull()].head()
state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
merged.loc[merged['state'].isnull(), 'state/region'].unique()
array(['PR', 'USA'], dtype=object)
# To fix the missing values of PR, USA
merged.loc[merged['state/region']=='PR', 'state']='Puerto Rico'
merged.loc[merged['state/region']=='USA', 'state']='United State'
merged.isnull().any()
state/region    False
ages            False
year            False
population       True
state           False
dtype: bool
#Now lets merged the result wiht the area dataset
final=pd.merge(merged, areas, on='state',how='left')
final.head()
state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
final.isnull().any()
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool
# Lets check the regions which areas is null
final['state'][final['area (sq. mi)'].isnull()].unique()
array(['United State'], dtype=object)
# No area value for USA; we can either insert it by suming all the areas or just drop it
final.dropna(inplace=True)
final.head()
state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0

Now we have all the data we need. To answer the question of interest, let’s first select the portion of the data corresponding with the year 2000, and the total population. We’ll use the query() function to do this quickly (this requires the numexpr package to be installed;

data2000=final.query("year==2000 & ages=='total'")
data2000.head()
state/region ages year population state area (sq. mi)
28 AL total 2000 4452173.0 Alabama 52423.0
68 AK total 2000 627963.0 Alaska 656425.0
124 AZ total 2000 5160586.0 Arizona 114006.0
162 AR total 2000 2678588.0 Arkansas 53182.0
220 CA total 2000 33987977.0 California 163707.0

Now let’s compute the population density and display it in order. We’ll start by rein‐ dexing our data on the state, and then compute the result:

data2000.set_index('state', inplace=True)
density=data2000['population']/data2000['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)
density.head()
state
District of Columbia    8412.441176
Puerto Rico             1084.098151
New Jersey               966.592639
Rhode Island             679.785113
Connecticut              615.399892
dtype: float64
density.tail()
state
South Dakota    9.800755
North Dakota    9.080434
Montana         6.146192
Wyoming         5.053262
Alaska          0.956641
dtype: float64