================ by Jawad Haider

Chpt 2 - Data Manipulation with Pandas¶

Example: US States Data¶

Example: US States Data

Example: US States Data¶

Merge and join operations come up most often when one is combining data from dif‐ ferent sources. Here we will consider an example of some data about US states and their populations.

import numpy as np
import pandas as pd

pop = pd.read_csv("../data/state-population.csv")
areas= pd.read_csv("../data/state-areas.csv")
abbrevs= pd.read_csv("../data/state-abbrevs.csv")

pop.head()

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

areas.head()

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707

abbrevs.head()

	state	abbreviation
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA

Given this information, say we want to compute a relatively straightforward result:rank US states and territories by their 2010 population density

merged=pd.merge(pop,abbrevs, how='outer',left_on='state/region',right_on='abbreviation')

merged

	state/region	ages	year	population	state	abbreviation
0	AL	under18	2012	1117489.0	Alabama	AL
1	AL	total	2012	4817528.0	Alabama	AL
2	AL	under18	2010	1130966.0	Alabama	AL
3	AL	total	2010	4785570.0	Alabama	AL
4	AL	under18	2011	1125763.0	Alabama	AL
...	...	...	...	...	...	...
2539	USA	total	2010	309326295.0	NaN	NaN
2540	USA	under18	2011	73902222.0	NaN	NaN
2541	USA	total	2011	311582564.0	NaN	NaN
2542	USA	under18	2012	73708179.0	NaN	NaN
2543	USA	total	2012	313873685.0	NaN	NaN

2544 rows × 6 columns

merged=merged.drop('abbreviation',1)

/tmp/ipykernel_88168/2168094788.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  merged=merged.drop('abbreviation',1)

merged.head()

	state/region	ages	year	population	state
0	AL	under18	2012	1117489.0	Alabama
1	AL	total	2012	4817528.0	Alabama
2	AL	under18	2010	1130966.0	Alabama
3	AL	total	2010	4785570.0	Alabama
4	AL	under18	2011	1125763.0	Alabama

# CHeck if there is any mismatch,
merged.isnull().any()

state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

## Some of the population and state info is null, lets check which one
merged[merged['population'].isnull()].head()

	state/region	ages	year	population	state
2448	PR	under18	1990	NaN	NaN
2449	PR	total	1990	NaN	NaN
2450	PR	total	1991	NaN	NaN
2451	PR	under18	1991	NaN	NaN
2452	PR	total	1993	NaN	NaN

merged[merged['state'].isnull()].head()

	state/region	ages	year	population	state
2448	PR	under18	1990	NaN	NaN
2449	PR	total	1990	NaN	NaN
2450	PR	total	1991	NaN	NaN
2451	PR	under18	1991	NaN	NaN
2452	PR	total	1993	NaN	NaN

merged.loc[merged['state'].isnull(), 'state/region'].unique()

array(['PR', 'USA'], dtype=object)

# To fix the missing values of PR, USA
merged.loc[merged['state/region']=='PR', 'state']='Puerto Rico'
merged.loc[merged['state/region']=='USA', 'state']='United State'
merged.isnull().any()

state/region    False
ages            False
year            False
population       True
state           False
dtype: bool

#Now lets merged the result wiht the area dataset
final=pd.merge(merged, areas, on='state',how='left')
final.head()

	state/region	ages	year	population	state	area (sq. mi)
0	AL	under18	2012	1117489.0	Alabama	52423.0
1	AL	total	2012	4817528.0	Alabama	52423.0
2	AL	under18	2010	1130966.0	Alabama	52423.0
3	AL	total	2010	4785570.0	Alabama	52423.0
4	AL	under18	2011	1125763.0	Alabama	52423.0

final.isnull().any()

state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool

# Lets check the regions which areas is null
final['state'][final['area (sq. mi)'].isnull()].unique()

array(['United State'], dtype=object)

# No area value for USA; we can either insert it by suming all the areas or just drop it
final.dropna(inplace=True)
final.head()

	state/region	ages	year	population	state	area (sq. mi)
0	AL	under18	2012	1117489.0	Alabama	52423.0
1	AL	total	2012	4817528.0	Alabama	52423.0
2	AL	under18	2010	1130966.0	Alabama	52423.0
3	AL	total	2010	4785570.0	Alabama	52423.0
4	AL	under18	2011	1125763.0	Alabama	52423.0

Now we have all the data we need. To answer the question of interest, let’s first select the portion of the data corresponding with the year 2000, and the total population. We’ll use the query() function to do this quickly (this requires the numexpr package to be installed;

data2000=final.query("year==2000 & ages=='total'")

data2000.head()

	state/region	ages	year	population	state	area (sq. mi)
28	AL	total	2000	4452173.0	Alabama	52423.0
68	AK	total	2000	627963.0	Alaska	656425.0
124	AZ	total	2000	5160586.0	Arizona	114006.0
162	AR	total	2000	2678588.0	Arkansas	53182.0
220	CA	total	2000	33987977.0	California	163707.0

Now let’s compute the population density and display it in order. We’ll start by rein‐ dexing our data on the state, and then compute the result:

data2000.set_index('state', inplace=True)
density=data2000['population']/data2000['area (sq. mi)']

density.sort_values(ascending=False, inplace=True)
density.head()

state
District of Columbia    8412.441176
Puerto Rico             1084.098151
New Jersey               966.592639
Rhode Island             679.785113
Connecticut              615.399892
dtype: float64

density.tail()

state
South Dakota    9.800755
North Dakota    9.080434
Montana         6.146192
Wyoming         5.053262
Alaska          0.956641
dtype: float64