Skip to content

================ by Jawad Haider

Chpt 2 - Data Manipulation with Pandas

06 - Combine Dataset: Concate and Append



Combining Datasets: Concat and Append

Some of the most interesting studies of data come from combining different data sources. These operations can involve anything from very straightforward concatena‐ tion of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets. Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.

import numpy as np
import pandas as pd
def make_df(cols, ind):
    data={c:[str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data,ind)
make_df('ABC',range(3))
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
# Recall Numpy concatenation 
x=[1,2,3]
y=[4,3,6]
z=[9,8,0]
np.concatenate([x,y,z])
array([1, 2, 3, 4, 3, 6, 9, 8, 0])
x=[[2,4],[3,5]]
x=np.concatenate([x,x], axis=1)
x
array([[2, 4, 2, 4],
       [3, 5, 3, 5]])

Simple Concatenation with pd.concat

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a number of options that we’ll discuss momentarily

#Signature in Pandas v0.18 pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

ser1=pd.Series(['A','B','C'],index=[1,2,3])
ser2=pd.Series(['D','E','F'], index=[4,5,6])
pd.concat([ser1,ser2])
1    A
2    B
3    C
4    D
5    E
6    F
dtype: object
df3=make_df("AB",[0,1])
df4=make_df("CD",[0,1])
df3
A B
0 A0 B0
1 A1 B1
df4
C D
0 C0 D0
1 C1 D1
pd.concat([df3,df4])
A B C D
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
0 NaN NaN C0 D0
1 NaN NaN C1 D1
pd.concat([df3,df4],axis=1)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1

Duplicate indices

One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices!

x=make_df("AB",[0,1])
y=make_df("AB",[2,3])
x
A B
0 A0 B0
1 A1 B1
y
A B
2 A2 B2
3 A3 B3
y.index
Int64Index([2, 3], dtype='int64')
x.index=y.index
x.index
Int64Index([2, 3], dtype='int64')
x
A B
2 A0 B0
3 A1 B1
y
A B
2 A2 B2
3 A3 B3
pd.concat([x,y])
A B
2 A0 B0
3 A1 B1
2 A2 B2
3 A3 B3
pd.concat([x,y],axis=1)
A B A B
2 A0 B0 A2 B2
3 A1 B1 A3 B3

Catching the repeats as an error. If you’d like to simply verify that the indices in the

result of pd.concat() do not overlap, you can specify the verify_integrity flag. With this set to True, the concatenation will raise an exception if there are duplicate indices. Here is an example, where for clarity we’ll catch and print the error message:

try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)
ValueError: Indexes have overlapping values: Int64Index([2, 3], dtype='int64')

Ignoring the index. Sometimes the index itself does not matter, and you would prefer

it to simply be ignored. You can specify this option using the ignore_index flag. With this set to True, the concatenation will create a new integer index for the resulting Series:

print(x);print(y)
    A   B
2  A0  B0
3  A1  B1
    A   B
2  A2  B2
3  A3  B3
pd.concat([x,y],ignore_index=True)
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3

Adding MultiIndex keys. Another alternative is to use the keys option to specify a label

for the data sources; the result will be a hierarchically indexed series containing the data:

pd.concat([x,y], keys=['x','y'])
A B
x 2 A0 B0
3 A1 B1
y 2 A2 B2
3 A3 B3

Concatenation with joins

In the simple examples we just looked at, we were mainly concatenating DataFrames with shared column names. In practice, data from different sources might have differ‐ ent sets of column names, and pd.concat offers several options in this case. Consider the concatenation of the following two DataFrames, which have some (but not all!) columns in common:

df5=make_df('ABC',[1,2])
df6=make_df('BCD',[3,4])
df6
B C D
3 B3 C3 D3
4 B4 C4 D4
df5
A B C
1 A1 B1 C1
2 A2 B2 C2
pd.concat([df5,df6])
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4

By default, the entries for which no data is available are filled with NA values. To change this, we can specify one of several options for the join and join_axes param‐ eters of the concatenate function.

pd.concat([df5,df6],join='inner')
B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4
pd.concat([df5,df6],join='outer')
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4

The append() method

Because direct array concatenation is so common, Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes. For example, rather than calling pd.concat([df1, df2]), you can simply call df1.append(df2):

df1=make_df('AB',[1,2])
df2=make_df('AB',[3,4])
df1
A B
1 A1 B1
2 A2 B2
df2
A B
3 A3 B3
4 A4 B4
df1.append(df2)
/tmp/ipykernel_36950/3062608662.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df1.append(df2)
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4