================ by Jawad Haider

Chpt 2 - Data Manipulation with Pandas¶

06 - Combine Dataset: Concate and Append¶

Combining Datasets: Concat and Append
Simple Concatenation with pd.concat
Duplicate indices
Catching the repeats as an error. If you’d like to simply verify that the indices in the
Ignoring the index. Sometimes the index itself does not matter, and you would prefer
Adding MultiIndex keys. Another alternative is to use the keys option to specify a label
Concatenation with joins
The append() method

Combining Datasets: Concat and Append¶

Some of the most interesting studies of data come from combining different data sources. These operations can involve anything from very straightforward concatena‐ tion of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets. Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.

import numpy as np
import pandas as pd

def make_df(cols, ind):
    data={c:[str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data,ind)

make_df('ABC',range(3))

	A	B	C
0	A0	B0	C0
1	A1	B1	C1
2	A2	B2	C2

# Recall Numpy concatenation 
x=[1,2,3]
y=[4,3,6]
z=[9,8,0]
np.concatenate([x,y,z])

array([1, 2, 3, 4, 3, 6, 9, 8, 0])

x=[[2,4],[3,5]]
x=np.concatenate([x,x], axis=1)
x

array([[2, 4, 2, 4],
       [3, 5, 3, 5]])

Simple Concatenation with pd.concat¶

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a number of options that we’ll discuss momentarily

#Signature in Pandas v0.18 pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

ser1=pd.Series(['A','B','C'],index=[1,2,3])
ser2=pd.Series(['D','E','F'], index=[4,5,6])
pd.concat([ser1,ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

df3=make_df("AB",[0,1])
df4=make_df("CD",[0,1])

df3

	A	B
0	A0	B0
1	A1	B1

df4

	C	D
0	C0	D0
1	C1	D1

pd.concat([df3,df4])

	A	B	C	D
0	A0	B0	NaN	NaN
1	A1	B1	NaN	NaN
0	NaN	NaN	C0	D0
1	NaN	NaN	C1	D1

pd.concat([df3,df4],axis=1)

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1

Duplicate indices¶

One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices!

x=make_df("AB",[0,1])
y=make_df("AB",[2,3])
x

	A	B
0	A0	B0
1	A1	B1

	A	B
2	A2	B2
3	A3	B3

y.index

Int64Index([2, 3], dtype='int64')

x.index=y.index

x.index

Int64Index([2, 3], dtype='int64')

	A	B
2	A0	B0
3	A1	B1

	A	B
2	A2	B2
3	A3	B3

pd.concat([x,y])

	A	B
2	A0	B0
3	A1	B1
2	A2	B2
3	A3	B3

pd.concat([x,y],axis=1)

	A	B	A	B
2	A0	B0	A2	B2
3	A1	B1	A3	B3

Catching the repeats as an error. If you’d like to simply verify that the indices in the¶

result of pd.concat() do not overlap, you can specify the verify_integrity flag. With this set to True, the concatenation will raise an exception if there are duplicate indices. Here is an example, where for clarity we’ll catch and print the error message:

try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([2, 3], dtype='int64')

Ignoring the index. Sometimes the index itself does not matter, and you would prefer¶

it to simply be ignored. You can specify this option using the ignore_index flag. With this set to True, the concatenation will create a new integer index for the resulting Series:

print(x);print(y)

    A   B
2  A0  B0
3  A1  B1
    A   B
2  A2  B2
3  A3  B3

pd.concat([x,y],ignore_index=True)

	A	B
0	A0	B0
1	A1	B1
2	A2	B2
3	A3	B3

Adding MultiIndex keys. Another alternative is to use the keys option to specify a label¶

for the data sources; the result will be a hierarchically indexed series containing the data:

pd.concat([x,y], keys=['x','y'])

		A	B
x	2	A0	B0
x	3	A1	B1
y	2	A2	B2
y	3	A3	B3

Concatenation with joins¶

In the simple examples we just looked at, we were mainly concatenating DataFrames with shared column names. In practice, data from different sources might have differ‐ ent sets of column names, and pd.concat offers several options in this case. Consider the concatenation of the following two DataFrames, which have some (but not all!) columns in common:

df5=make_df('ABC',[1,2])
df6=make_df('BCD',[3,4])
df6

	B	C	D
3	B3	C3	D3
4	B4	C4	D4

df5

	A	B	C
1	A1	B1	C1
2	A2	B2	C2

pd.concat([df5,df6])

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	NaN	B3	C3	D3
4	NaN	B4	C4	D4

By default, the entries for which no data is available are filled with NA values. To change this, we can specify one of several options for the join and join_axes param‐ eters of the concatenate function.

pd.concat([df5,df6],join='inner')

	B	C
1	B1	C1
2	B2	C2
3	B3	C3
4	B4	C4

pd.concat([df5,df6],join='outer')

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	NaN	B3	C3	D3
4	NaN	B4	C4	D4

The append() method¶

Because direct array concatenation is so common, Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes. For example, rather than calling pd.concat([df1, df2]), you can simply call df1.append(df2):

df1=make_df('AB',[1,2])
df2=make_df('AB',[3,4])
df1

	A	B
1	A1	B1
2	A2	B2

df2

	A	B
3	A3	B3
4	A4	B4

df1.append(df2)

/tmp/ipykernel_36950/3062608662.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df1.append(df2)

	A	B
1	A1	B1
2	A2	B2
3	A3	B3
4	A4	B4