Complete Proc Tabulate Guide

Would you like to prepare high quality reports or presentation ready tables directly in SAS? Are you looking to move beyond the restricted output produced by PROC FREQ or PROC MEANS and prepare more informative one, two and even three dimensional tables? Are you tired of having to reformat and modify your SAS output with other tools such as Excel before presenting your results? If so, then PROC TABULATE is an excellent procedure to learn which can expand your SAS toolkit when it comes to computing and displaying descriptive statistics.

While the statistics produced by PROC TABULATE can be calculated with other procedures such as PROC MEANS or PROC FREQ, the main advantage of PROC TABULTE is the flexibility available in creating a wide variety of output tables. In this article, we will present a number of examples on how PROC TABULATE can be used to present easy to read one-dimensional, two-dimensional and three-dimensional tables.

Using examples with one, two and three dimensional tables, the following components of PROC TABULATE will be covered:

Adding multiple statistics to your tables

Using parentheses to shorten and simplify your code

Adding classification variables to one or more dimensions

Nesting classification variables to report on multiple by variables

Adding subtotals and totals to multi-dimensional tables

Adding row and column percentages

Applying formatting, adding or dropping row and column labels​

Software​

Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!​

Data Sets

In this article, the CARS dataset from the SASHELP library will be used to illustrate a number of different uses for the PROC TABULATE procedure. The CARS dataset contains, among other variables, MSRPs (Manufacturer suggested retail prices) on a wide range of vehicles. The examples that follow will primarily look at ways to examine the MSRPs and compare between different categories and classification variables found in the CARS dataset.​

The most basic usage of PROC TABULATE, a one-dimensional table, requires at least two components: (1) a VAR statement to specify an analysis variable and (2) a TABLE statement to specify which variables to include in the output table and where to put them in the table.

For this first set of examples, we are going to compute some statistics on MSRP (Manufacturers Suggested Retail Price) of the vehicles found in the SASHELP.CARS dataset. To use the default settings with PROC TABULATE in this example, we simply need to specify which dataset we are analyzing with the data argument (data=SASHELP.CARS), use the VAR statement to specify MSRP as the analysis variable (VAR MSRP) and the use the TABLE statement to specify that MSRP should be included in the output table (TABLE MSRP). The complete PROC TABULATE syntax is as follows:​

proc tabulate data = sashelp.cars; var msrp; table msrp;run;

As you can see in the output below, the default statistic from PROC TABULATE is to provide the SUM of the analysis variable, which is MSRP in this case:

To include a statistic other than SUM, you can modify the TABLE statement. To specify which statistic to include, you need to list the target variable name followed by an asterisks (*) and then the statistic you would like to include after the asterisks. In the PROC TABULATE TABLE statement, the asterisks serves as an operator, telling SAS which statistic to compute for which variable.

There is a wide variety of statistics that can be included with PROC TABULATE, of which a complete list can be found in the SAS documentation. Some of the most commonly used descriptive statistics which will be used in this article include MIN, MAX, MEAN, N and SUM.

So, to determine the max MSRP in the SASHELP.CARS dataset, we can add an asterisks(*) followed by MAX after MSRP in our TABLE statement:

proc tabulate data = sashelp.cars; var msrp; table msrp*max;run;

Which, as you can see below, now outputs the max MSRP found in the CARS dataset instead of the default SUM

To include more than one statistic for a variable, you can list the variable followed by the asterisks (*) operator and then each desired statistic in the TABLE statement. For example, if you would like the MEAN, MIN and MAX, you would specify the tables statement as follows:​

As you can see in the desired output, the 3 statistics for MSRP are shown in a table:​

To simplify your PROC TABULATE code, parentheses can be used to specify multiple statistics for the same variable, without having to specify the same variable name each time. For example, the same 3 statistics as above (MEAN, MIN and MAX) can also be obtained using the following syntax, where the variable name is specified once, followed by the asterisks (*) operator and then the desired statistic(s) within parentheses:

As you can see in the output below, grouping the statistics within parentheses also produces a cleaner output as the MSRP labels are merged into one heading row at the top of the table, reducing redundancy:

Your PROC TABULATE output can be further enhanced by adding grouping or classification variables. For example, say you would like to know what the average (mean) MSRP is for all vehicles by continent of origin (ORIGIN). To include the variable ORIGIN, there are two parts you need to modify in the call to PROC TABULATE.

First, a CLASS statement needs to be added to tell PROC TABULATE that you are adding the ORIGIN variable as a classification variable. Next, the TABLE statement needs to be modified to indicate that the mean MSRP values should be grouped by Origin. To add ORIGIN, simply include a second asterisks operator (*) after the chosen statistic (MEAN in this case) as shown below:

Now, the output table includes the mean MSRP values by Origin for all vehicles in the SASHELP.CARS dataset:

Adding Labels and Formatting Output

As with other SAS procedures, you can add formats and labels to enhance your output. In addition, PROC TABULATE also gives you the ability to remove row or column labels to reduce clutter in your output.

If for example in the table above you did not want to include the mean row label and also wanted to apply a label to the MSRP name, you can do so by adding an equal sign (=) after the variable name in the TABLE statement followed by the desired label in quotations. To remove a row or column label, simply leave a blank between the quotation marks. To format the mean numbers with dollar signs, commas and no decimals places, you could add the dollar8 format to the PROC TABULATE statement:

Now, as you can see below, the output is much easier to read. The “Mean” row label has been hidden, MSRP has been expanded to give a more descriptive label and the mean values have been converted to a much easier to read format for a dollar amount:​

As you can see below, the mean MSRP for the 3 continents of origin are split into 3 columns:

To transpose the results, and have the Origin values as rows instead of columns, a comma (,) is used in the tables statement. The argument(s) before the comma will become the rows in the table and the argument(s) after the comma become the columns. So, in this example our tables statement will have ORIGIN listed before the comma while MSRP and MEAN are added after the comma, as shown here:

As you can see, the table has now been transposed, with the origins as rows now instead of columns:​

Of course, transposing your table is not the only use case of the the comma operator in the tables statement. Say for example you would like to determine the Mean and Max MSRP of all vehicles in SASHELP.CARS by Origin.

While it is possible to produce these results as a one-dimensionak table without the comma, the table is not very elegant in appearance. To produce these results as a one-dimensional table, the table statement would start with MSRP (the variable of interest), followed by mean and max (the desired statistics) and finally the classification variable, origin:

As you can see in the results, the table is quite busy as the Origin’s need to be repeated twice for both Mean and Max.

​Producing the same results as a two-dimensional table creates a much easier to read table. In order to not have the origin values repeated twice, we can use the comma to specify that origin should become a row variable while the Mean and Max of MSRP will be placed after the comma to tell PROC TABULATE they will be the column values. The syntax to create this would be:

You can continue to build up your two-dimensional tables by adding an additional classification variable. For this example, let’s say you would like know the Mean MSRP for all vehicles in the SASHELP.CARS dataset by Origin as before, but you would also like the Mean MSRP’s broken down by vehicle type.

First, since we are adding a new classification variable, TYPE, we need to add the TYPE variable to the CLASS statement. Next, the table statement includes ORIGIN as the row variable before the comma. After the comma, MSRP, MEAN and TYPE variables are added to indicate that we would like to include them as column values:

As you can see in the output below, we now have the desired breakdown of mean vehicle MSRPs both by type and Origin.

Nesting Classification Variables

Another unique feature of PROC TABULATE is the ability to nest classification variables. Nesting classification variables allows you to add additional classification variables to the row, column or both components of your table.

For this next example, you would like to have the mean MSRP values for all vehicles types not only broken down by Origin, but also by Drivetrain (DRIVETRAIN). That is, for each origin, you would like to see how the mean MSRPs compare by Drivetrain for all vehicle types.

To generate this new table, there are two modifications that need to made to the SAS syntax from the previous example. First, since we are introducing a third classification variable, DRIVETRAIN, we need to add DRIVETRAIN to the CLASS statement. Next, DRIVETRAIN needs to be added after origin with the asterisks (*) operator to specify that you would like DRIVETRAIN as an additional classification variable for ORIGIN in the rows of the table:

The desired output is produced as shown below. You will notice now that we have introduced a large number of missing values into the table. The missing values result from having no vehicles fall under a given cross-tabulational scenario. For example, there are no Hybrid vehicles in the CARS dataset from Europe or USA. There are also no Truck’s from Asia that are Front wheel drive, and so on.

Additional Formatting Tips

As you continue to add layers to the table, they can become quite busy and difficult to read. Earlier we discussed how to add labels to variables by adding an equal sign followed by the label in quotation marks after each variable in the TABLE statement. From our previous example, adding labels can be done as follows:

However, now that we have a two-dimensional table with a nested classification variable, our TABLE statement is quite long and becomes difficult to read with all the variable and statistic labels added.

To make the code easier to read, you can add a LABEL statement to label variables and the KEYLABEL statement to label statistics so you don’t need to clutter up the TABLE statement. After the LABEL or KEYLABEL statements, simply specify the variable or statistic, followed by an equal sign and the desired label in quotation mark. Recall that by setting the mean statistic equal to a blank label, we can hide the row heading with “Mean” to tidy up the output table.

We can also add a label to the large empty box on the top left of the table using the BOX option with the TABLE statement, which can be used to give the entire table a title.

Individual statistics can also have formats applied. By adding an F=formatname to the desired statistic in the table statement, you can apply any standard format such as the Dollar8. format to the mean statistic.

​As you can see in the output below, the mean MSRP values are now formatted, more informative row and column labels have been added, and a title has been added to the top left box:

Totals, Subtotals and Percentages

Similar to the cross-tabulation output that you can produce with PROC FREQ, PROC TABULATE also has the flexibility to include totals, subtotals, row and column percentages in your table.

In this next example, we will determine how many vehicles (N) of each drivetrain (DRIVETRAIN) in SASHELP.CARS are from each continent (ORIGIN). In addition, we will also add the total number of cars for each DRIVETRAIN, regardless of origin.

To add these subtotals to your PROC TABULATE output, you simply need to include the ALL keyword in your TABLE statement. The important part is where you place the ALL statement. If you add ALL to the left side of the comma like in the example below, you will get the column subtotals:

​PROC TABULATE also provides a number of different options for including percentages in your table. Some of the most common percentages that you can include are ROWPCTN (row percentages), COLPCTN (column percentages) and PCTN (percentage of the total).

To include a percentage in your table, simply add one of the percentage keywords to your TABLE statement in the same way you would add a statistic. For example, to include the row, column, and total percentages in addition to the count of drivetrain to your table, you would use the following syntax:

​As you can see in the output shown below, we now have 3 percentage measures added to the table which shows the number of vehicles by origin and drivetrain:

Three Dimensional Table

Despite the name, three-dimensional tables created with PROC TABULATE are not truly three dimensional cubes by traditional definition. However, what PROC TABULATE does give you is the ability to easily create multiple two-dimensional tables for different classes or categorical variables without having to run PROC TABULATE multiple times or create multiple input datasets.

First, let’s start with a simple two-dimensional table. For this example, we are interested in the mean MSRP of vehicles by the number of cylinders in the engine (CYLINDERS) and the continent of origin (ORIGIN):

​Since we included origin on the left of the comma and MSRP, CYLINDERS and mean on the right side of the comma, the output has the continent of origin as rows with the mean MSRP by cylinders as the columns:

Next, we would also like to breakdown the mean MSRP values by DRIVETRAIN, in addition to ORIGIN and CYLINDERS. To do this, we now have two options for coding our PROC TABULATE call. Keeping with the two dimensional approach, we can add DRIVETRAIN as a classification variable by adding it to the CLASS statement and then also add DRIVETRAIN to the left side of the comma as a classification variable for origin:

​While the resulting table produces the desired results, the table is cluttered and can be difficult to read:

​Using the three-dimensional approach, we can tell PROC TABULATE to produce the original table multiple times, one table for each level of DRIVETRAIN. Similar to how we transformed a one-dimensional table into a two-dimensional table by adding the comma separator, we now add a second comma to the TABLE statement in order to add the “third dimension”.

The position for third dimension classification variable is also important. To produce a table for each drivetrain that has the origins as rows and the cylinders as columns, the TABLE statement needs to start with drivetrain (our “third dimension”) followed by a comma, then origin and finally a second comma before MSRP, cylinders and mean, as shown below: