One essential feature offered by Pandas is its high-performance,
in-memory join and merge operations. If you have ever worked with
databases, you should be familiar with this type of data interaction.
The main interface for this is the pd.merge func‐ tion, and we’ll see a
few examples of how this can work in practice. ### Relational Algebra
The behavior implemented in pd.merge() is a subset of what is known as
relational algebra, which is a formal set of rules for manipulating
relational data, and forms the conceptual foundation of operations
available in most databases. The strength of the relational algebra
approach is that it proposes several primitive operations, which become
the building blocks of more complicated operations on any dataset. With
this lexicon of fundamental operations implemented efficiently in a
database or other pro‐ gram, a wide range of fairly complicated
composite operations can be performed. Pandas implements several of
these fundamental building blocks in the pd.merge() function and the
related join() method of Series and DataFrames. As we will see, these
let you efficiently link data from different sources.
The pd.merge() function implements a number of types of joins: the
one-to-one, many-to-one, and many-to-many joins. All three types of
joins are accessed via an identical call to the pd.merge() interface;
the type of join performed depends on the form of the input data.
Many-to-one joins are joins in which one of the two key columns contains
duplicate entries. For the many-to-one case, the resulting DataFrame
will preserve those dupli‐ cate entries as appropriate.
Many-to-many joins are a bit confusing conceptually, but are
nevertheless well defined. If the key column in both the left and right
array contains duplicates, then the result is a many-to-many merge. This
will be perhaps most clear with a concrete example
We’ve already seen the default behavior of pd.merge(): it looks for one
or more matching column names between the two inputs, and uses this as
the key. However, often the column names will not match so nicely, and
pd.merge() provides a variety of options for handling this.
At times you may wish to merge two datasets with different column names;
for exam‐ ple, we may have a dataset in which the employee name is
labeled as “name” rather than “employee”. In this case, we can use the
left_on and right_on keywords to specify the two column names:
In all the preceding examples we have glossed over one important
consideration in performing a join: the type of set arithmetic used in
the join. This comes up when a value appears in one key column but not
the other.
**Because the output would have two conflicting column names, the
merge function automatically appends a suffix _x or _y to make the
output columns unique. If these defaults are inappropriate, it is
possible to specify a custom suffix using the suffixes keyword:**