Some of the most interesting studies of data come from combining
different data sources. These operations can involve anything from very
straightforward concatena‐ tion of two different datasets, to more
complicated database-style joins and merges that correctly handle any
overlaps between the datasets. Series and DataFrames are built with this
type of operation in mind, and Pandas includes functions and methods
that make this sort of data wrangling fast and straightforward.
One important difference between np.concatenate and pd.concat is that
Pandas concatenation preserves indices, even if the result will have
duplicate indices!
x=make_df("AB",[0,1])y=make_df("AB",[2,3])x
A
B
0
A0
B0
1
A1
B1
y
A
B
2
A2
B2
3
A3
B3
y.index
Int64Index([2, 3], dtype='int64')
x.index=y.index
x.index
Int64Index([2, 3], dtype='int64')
x
A
B
2
A0
B0
3
A1
B1
y
A
B
2
A2
B2
3
A3
B3
pd.concat([x,y])
A
B
2
A0
B0
3
A1
B1
2
A2
B2
3
A3
B3
pd.concat([x,y],axis=1)
A
B
A
B
2
A0
B0
A2
B2
3
A1
B1
A3
B3
Catching the repeats as an error. If you’d like to simply verify that the indices in the¶
result of pd.concat() do not overlap, you can specify the
verify_integrity flag. With this set to True, the concatenation will
raise an exception if there are duplicate indices. Here is an example,
where for clarity we’ll catch and print the error message:
ValueError: Indexes have overlapping values: Int64Index([2, 3], dtype='int64')
Ignoring the index. Sometimes the index itself does not matter, and you would prefer¶
it to simply be ignored. You can specify this option using the
ignore_index flag. With this set to True, the concatenation will create
a new integer index for the resulting Series:
print(x);print(y)
A B
2 A0 B0
3 A1 B1
A B
2 A2 B2
3 A3 B3
pd.concat([x,y],ignore_index=True)
A
B
0
A0
B0
1
A1
B1
2
A2
B2
3
A3
B3
Adding MultiIndex keys. Another alternative is to use the keys option to specify a label¶
for the data sources; the result will be a hierarchically indexed series
containing the data:
In the simple examples we just looked at, we were mainly concatenating
DataFrames with shared column names. In practice, data from different
sources might have differ‐ ent sets of column names, and pd.concat
offers several options in this case. Consider the concatenation of the
following two DataFrames, which have some (but not all!) columns in
common:
By default, the entries for which no data is available are filled
with NA values. To change this, we can specify one of several options
for the join and join_axes param‐ eters of the concatenate function.
Because direct array concatenation is so common, Series and DataFrame
objects have an append method that can accomplish the same thing in
fewer keystrokes. For example, rather than calling pd.concat([df1,
df2]), you can simply call df1.append(df2):
df1=make_df('AB',[1,2])df2=make_df('AB',[3,4])df1
A
B
1
A1
B1
2
A2
B2
df2
A
B
3
A3
B3
4
A4
B4
df1.append(df2)
/tmp/ipykernel_36950/3062608662.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df1.append(df2)