Introduction to merging in SAS

What is a match-merge?

combines observations from two or more SAS data sets based on the values of specified common variables (one or more)

creates a new data set (the merged data set)

is done in a data step with the statements

MERGE to name the input data sets

BY to name the common variable(s) to be used for matching

Prerequisites for a match-merge

input data sets must have at least one common variable to merge on

input data sets must be sorted by the common variable(s) that will be used to merge on

From now on when we use the term "merge" it will mean "match-merge."

Examples of merging

When we want to combine two data sets by merging and we know beforehand that each observation in each data set
has a match in the other data set we can do a very straight-forward merge. Here are a couple of examples.

Example 1.

Often different data on the same cases are stored in two or more different data sets. For example,
you may have two person level data sets on exactly the same individuals but containing different information
on those individuals. To combine the data on those individuals into one data set requires a merge.
Here is the way the merge would work using the simplest example:

The data sets being merged in the example above contain different data on the same cases and the variable ID
is a unique identifier (no duplicates).

Example 2.

Often we want to combine data from two data sets where each observation has a match on the BY-variable(s) but
there are duplicate values of the BY-variable(s) in one of the data sets. For example, we may want to add
household level variables (in one data set) to persons from those same households (in a second data set).
Or, we may want to add community level variables (in one data set) to the households of those same communities (in a second data set).

As an example we will merge two data sets used in examples in the
section "Working with grouped observations"--a person level
data set named percps99 and a household level data set named hhcps99.
We know that the households represented are the same in both data sets
and the
household identifier is the variable H_SEQ. In the person level data
set the variable H_SEQ has duplicate values (when the household has
more than one member). In the household level data set, however, H_SEQ
is unique.

We want to add the household level variables (in hhcps99) to
the persons (in percps99). The two input data
sets to the merge are permanent SAS data sets in the same directory.
The merged data set will be a temporary SAS data set named indiv.

More on MERGE: when matching is not perfect

The second observation (ID=20) in data set one does not have a match on ID in data set two.
A match-merge of one and two by ID like this,

data three;
merge one two;
by id;
run;
NOTE: It is a good habit not to do data manipulation in the data step of a merge like shown in these examples.

results in an output data set three which looks like this:

ID A B C
10 1 2 0 both data sets contributed to this observation
20 3 4 . only data set one (left-hand data set) contributed to this observation
30 5 6 1 both data sets contributed to this observation
40 . . 1 only data set two (right-hand data set) contributed to this observation

The IN= variables

What if you want to keep in the output data set of a merge only the
matches (only those observations to which both input data sets
contribute)?
SAS will set up for you special temporary variables, called the "IN="
variables, so that you can do this and more.

Here's what you have to do:

signal to SAS on the MERGE statement that you need the IN= variables for the input data set(s)

use the IN= variables in the data step appropriately

So to keep only the matches in the match-merge above, ask for the IN= variables and use them: