Welcome to the Institute for Digital Research and Education

SAS Learning Module
Using proc sort and by statements

1. Introduction

This module will examine the use of proc sort
and use of the by statement with SAS procedures. The program below
creates a data file called auto that we will use in our examples. Note
that this file has a duplicate record for the BMW.

2. Sorting data with proc sort

We can use proc sort to sort this data
file. The program below sorts the auto data file on the variable foreign
(1=foreign car, 0=domestic car) and saves the sorted file as auto2. The
original file remains unchanged since we used out=auto2 to
specify that
the sorted data should be placed in auto2.

From the proc print below, you can
see that
auto2 is indeed sorted on foreign. The observations
where foreign is 0 precede all of the observations where foreign
is 1. Note that the order of the observations within each group remain unchanged,
(i.e., the observations where foreign is 0 remain in the same order).

Suppose you wanted the data sorted, but with the foreign
cars (foreign=1) first and the domestic cars (foreign=0)
second. The example below shows the use of the descending keyword to tell SAS that you
want to sort by foreign, but you want the sort order reversed (i.e.,
largest to smallest).

It is also possible to sort on more than one variable at
a time. Perhaps you would like the data sorted on foreign (this time we
will go back to the normal sort order for foreign) and then sorted by
rep78
within each level of foreign. The example below shows how this can be done.

You can see in the proc print below that the data are now ordered by
foreign, domestic cars (foreign=0)
followed by foreign (foreign=1) cars. Within the domestic
cars, the data are sorted by rep78 and within foreign cars
the data are also sorted by rep78.

In the output above, note how the missing values of rep78
were treated. Since a missing value is treated as the lowest value possible (e.g., negative
infinity), the missing values come before all other values of rep78.

3. Removing duplicates with proc sort

At the beginning of this page, we noted that there was a
duplicate observation in auto, that there were two identical records for
BMW. We can use proc sort to remove the duplicate observations from our
data file using the noduplicates option, as long as the duplicate
observations are next to each other. The example below sorts the data by
foreign
and removes the duplicates at the same time. Note that it did not matter what variable we
chose for sorting the data. As you see in the output below, the extra observation for BMW
was deleted.

It is common for duplicate observations to be next to
each other in the same file, but if the duplicate observations are not next to each other,
there is another strategy you can use to remove the duplicates. You can sort the
data file by all of the variables (which can be indicated with the special keyword
_ALL_),
which would force the duplicate observations to be next to each other. This is
illustrated below.

PROC SORT DATA=auto OUT=auto6 NODUPLICATES ;
BY _all_ ;
RUN ;

4. Obtaining separate analyses with sorted data

Sometimes you would like to obtain results separately
for different groups. For example, you might want to get the mean mpg and
weight separately for foreign and domestic
cars. As you see below, it is possible to use proc means with the
class
statement to get these results.

PROC MEANS DATA=auto ;
CLASS foreign ;
VAR mpg weight ;
RUN ;

However, what if you wanted to obtain the correlation of
weight and mpg separately for foreign and domestic cars?
Proc corr does not support a class statement like
proc means does, but you can use the by statement as in the example
below.

As you see in the output below, using the by
statement resulted in getting a proc corr for the domestic cars and a
proc corr for the foreign cars. In general, using the
by statement
requests that the proc be performed for every level of the
by
variable (in this case, for every level of foreign).

Here are other examples of where you might use a by
statement with the auto data file. (Note that some of these analyses are
not very practical because of the small size of the auto data file, so
please imagine that we would be analyzing a larger version of the auto
data file.)

You might use a by statement with proc univariate to request univariate statistics for mpg separately
for foreign and domestic cars so you can seen if mpg is normally
distributed for foreign cars and normally distributed for domestic cars. This also
allows you to generate side by side box and whisker plots allowing you
to compare the distributions of mpg for the separate groups.

You might use a by statement with proc reg if you would like to do separate regression analyses for foreign and domestic
cars.

You might use a by statement with proc means even though it has the class statement. If you wanted the
means displayed on separate pages, then using the by statement would give
you the kind of output you desire.

5. Problems to look out for

If you use a BY statement in a procedure, make
sure the data has been sorted first. For example, if you use
by foreign
then be sure that you have first sorted the file by foreign.

If you want to delete duplicate observations and
the duplicate observations are not next to each other, be sure to sort the data on all of
the variables (i.e., using by _ALL_; ) so the
noduplicates
option will work properly and indeed remove duplicate observations.

6. For more information

For more information about proc sort see
the chapter on PROC SORT in the
SAS Procedures Guide .

7. Web notes

You can view the SAS program associated with this module
by clicking sort.sas . While
viewing the file, you can save it by choosing File then Save As
from the pull-down menu of your web browser. In the Save As dialog box,
change the file name to sort.sas and then choose the directory where you
want to save the file, then click Save.