Merge and join operations come up most often when one is combining data
from dif‐ ferent sources. Here we will consider an example of some data
about US states and their populations.
/tmp/ipykernel_88168/2168094788.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
merged=merged.drop('abbreviation',1)
merged.head()
state/region
ages
year
population
state
0
AL
under18
2012
1117489.0
Alabama
1
AL
total
2012
4817528.0
Alabama
2
AL
under18
2010
1130966.0
Alabama
3
AL
total
2010
4785570.0
Alabama
4
AL
under18
2011
1125763.0
Alabama
# CHeck if there is any mismatch,merged.isnull().any()
state/region False
ages False
year False
population True
state True
dtype: bool
## Some of the population and state info is null, lets check which onemerged[merged['population'].isnull()].head()
# To fix the missing values of PR, USAmerged.loc[merged['state/region']=='PR','state']='Puerto Rico'merged.loc[merged['state/region']=='USA','state']='United State'merged.isnull().any()
state/region False
ages False
year False
population True
state False
dtype: bool
#Now lets merged the result wiht the area datasetfinal=pd.merge(merged,areas,on='state',how='left')final.head()
state/region
ages
year
population
state
area (sq. mi)
0
AL
under18
2012
1117489.0
Alabama
52423.0
1
AL
total
2012
4817528.0
Alabama
52423.0
2
AL
under18
2010
1130966.0
Alabama
52423.0
3
AL
total
2010
4785570.0
Alabama
52423.0
4
AL
under18
2011
1125763.0
Alabama
52423.0
final.isnull().any()
state/region False
ages False
year False
population True
state False
area (sq. mi) True
dtype: bool
# Lets check the regions which areas is nullfinal['state'][final['area (sq. mi)'].isnull()].unique()
array(['United State'], dtype=object)
# No area value for USA; we can either insert it by suming all the areas or just drop itfinal.dropna(inplace=True)final.head()
state/region
ages
year
population
state
area (sq. mi)
0
AL
under18
2012
1117489.0
Alabama
52423.0
1
AL
total
2012
4817528.0
Alabama
52423.0
2
AL
under18
2010
1130966.0
Alabama
52423.0
3
AL
total
2010
4785570.0
Alabama
52423.0
4
AL
under18
2011
1125763.0
Alabama
52423.0
Now we have all the data we need. To answer the question of interest,
let’s first select the portion of the data corresponding with the year
2000, and the total population. We’ll use the query() function to do
this quickly (this requires the numexpr package to be installed;