Proc Transpose Tutorial

​Do you often need to transpose your data sets from long (multiple rows per subject) to wide (multiple columns per subject)? Are you looking for a quicker, more efficient way to transpose your data compared to using complex Data Step Code and DO LOOPs? Would you like to better understand the many options available in PROC TRANSPOSE?

This article will walk through the different uses of PROC TRANSPOSE, providing a variety of helpful examples along the way to help you gain a better understanding of the most important features and use cases of PROC TRANSPOSE.

The following topics will be covered in the article:

Transposing Long to Wide Datasets(a) Simple long to wide transposition(b) Applying the PREFIX, LABEL and NAME options(c) Selecting which variables to transpose (VAR Statement)(d) Expanding Transposed Variables into groups (BY Statement)(e) Using variables values to name transposed variables (ID Statement)

Transposing Wide to Long Datasets(a) Comparison with Long to Wide transpositions(b) Simple wide to long transposition​(c) Expanding Transposed Variables into groups (BY Statement)

Software​

Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!​

Data Sets

The SASHELP.SHOES dataset used in this article contains data on SHOE sales and inventory, broken down by shoe products as well as the geographic variables region and subsidiary. For the purposes of the examples below, a duplicate record must be removed from the SHOES dataset. To create a cleaned copy of SHOES to save in your WORK directory, simply run the following code to create the SHOES_CLEAN dataset in WORK for use in future examples found in this article.

Transposing Long to Wide Datasets

PROC TRANSPOSE provides the ability to go from a long dataset (where there are multiple rows for a given subject) to a wide dataset (where there are multiple columns for a subject).​You can find multiple examples in the SASHELP library to help illustrate what a long dataset looks like. The ORSALES dataset in SASHELP has sales data for different products, with one row for each quarter in a year. If you using the following SAS code to sort ORSALES, you can see that the product groups are repeated across multiple rows, resulting in a long dataset:

In the ORSALES_SORTED dataset created by the code above, you can see why this would be considered a “long” dataset”:

The SASHELP.SHOES dataset which contains sales and inventory data on shoes by region is another example of a long dataset. As you can see the in the dataset below, each row contains data about a single product from each of the subsidiaries. The subsidiary is repeated for each product and the products are repeated for each subsidiary:

​The following examples will illustrate how to utilize PROC TRANSPOSE to convert this long dataset (where there are multiple rows per subsidiary) into a wide dataset (where the rows are uniquely defined by each subsidiary).

First, let’s see what happens when you use PROC TRANSPOSE on SASHELP.SHOES without any options. In the SAS code below, SASHELP.SHOES is the original input dataset (specified using DATA=) and WORK.SHOES_TRANS is the output dataset which we would like to transpose (specified using OUT=):

proc transpose data = sashelp.shoes out = shoes_trans;run

As you can see in the Output Data, the SASHELP.SHOES dataset has now been transposed from long (with 395 rows and 7 columns) to wide (4 rows and 397 columns):

However, the results are not quite what we were looking for and there a few important points to note about the transposed dataset we just created.

First, notice that only the numeric variables have been kept and transposed in the output dataset (i.e. REGION, PRODUCT, SUBSIDIARY are not found in the data). This is because by default SAS will only transpose the numeric variables. Next, you can see that the column names are system generated and as a result are not very informative. Finally, with the important variables such as product, region and subsidiary information removed from the dataset, it becomes difficult to analyze or interpret in it’s current state.

To address these gaps in the newly created transposed data set, let’s walk through some of the options available within PROC TRANPOSE.

Variable Names (PREFIX, NAME and LABEL options)​

​First, let’s address how to rename the systems generated _NAME_, _LABEL_ and COL# variables found in our newly created SHOES_TRANS dataset. ​To rename the COL# variables, we use the prefix option to specify a prefix to the sequentially numbered variables. In the SAS code below, the COL# variables can be renamed to be Amount1, Amount2, Amount3 etc., by specifying prefix=Amount. Similarly, the _NAME_ variable can be renamed by using the NAME= argument and the LABEL= argument is used to rename the _LABEL_ variable, as shown here:

As you can
see in the revised output dataset, there are now more informative variable
names in the data:

Selecting which Variables to Transpose (VAR Statement)

As discussed earlier, the goal of this transposition exercise is to create a wide dataset for all the Sales amounts for all products by each Subsidiary. The next step is to select which variables to keep in our transposed dataset. Since we are not interested in the Inventory, Number of Stores or Total Returns at this time, we can add the VAR statement to tell SAS that only SALES should be transposed (and kept) in the output dataset:​

Now, as you can see in the partial Output Dataset shown below, the resulting data has only 1 row, since the Inventory, Stores and Total Returns have been excluded:

Expanding Transposed Variables into Groups (BY Statement)

The next step is to expand the data to break down the Sales amounts in groups by each Subsidiary. This can done easily using the BY statement. However, before we can use the BY statement, the input dataset must first be sorted by the BY variable (SUBSIDIARY in this example).

In the SAS code shown below, a PROC SORT statement is used to create SHOES_SORT which is sorted by SUBSIDIARY. Next, the BY statement is added to the PROC TRANSPOSE call to expand the SALES data for each SUBSIDIARY:

As you can see the Output Data SHOES_TRANS shown partially below, we now have 1 row for each subsidiary, with the Total Sales amount for each product in columns:

Using Row Values to Name Columns in a Transposed Dataset (ID Statement)

While the transposed dataset has now made it much easier to see the Sales data by Subsidiary, it’s not year clear which Amount corresponds to which product. Although the Amount label we added with the PREFIX option earlier is certainly more informative than the COL# variables which were created by default, the ID statement allows you to apply even more informative variable headings.

Since we know that each Amount variable corresponds to a given Product in the original SHOES dataset, we can tell SAS to use the values of PRODUCT to name the columns in our transposed dataset using the ID statement, as shown in the code below. Since the ID variable will now apply variable names to columns with Sales data, we no longer need to include the prefix option in this example, so it has been removed.

Note that the SHOES_CLEAN dataset, created using the SAS code at the beginning of the article, is being used in this example:

Building on the previous example, suppose now that you would also like to transpose the inventory amounts for each product within a given subsidiary. Unfortunately, simply adding the INVENTORY variable to the VAR statement does not yield the desired result, which can be verified by running the code below:

As you can see by running the above code, the Output Data set now contains 2 rows for each Subsidiary – one for Sales and one for Inventory:

Since the goal is to create a wide dataset that has only one row per subsidiary and multiple columns for Sales and Inventory, a new strategy needs to be implemented.

The most effective way to transpose multiple variables and uniquely define the rows of your dataset by a single subject (Subsidiary in this case) is to call PROC TRANSPOSE twice and then merge the resulting dataset together.

The first step in the process is to alter the dataset so that we can distinguish the product sales amounts from the product inventory amounts. To do this, a prefix label is added to denote either Sales or Inventory. Since Sales and Inventory will now be added as prefixes to the column names, the _NAME_ and _LABEL_ variables are no longer required, so we can drop the name and label parameters as well as the _NAME_ and _LABEL_ variables themselves.

To create both a transposed Sales and Inventory dataset, PROC TRANSPOSE is run twice, once with only SALES specified in the VAR statement and a second time with only INVENTORY specified in the VAR statement. Two new datasets, SHOES_TRANS_SALES and SHOES_TRANS_INV are then created in WORK using the code below:

Using the code above, the SHOES_TRANS_SALES and SHOES_TRANS_INV datasets, shown partially below, are produced:

​Next, to create a single wide dataset that has both the transposed Sales and Inventory variables, the two newly created datasets are merged together using SAS DATA STEP by their common unique variable, SUBSIDIARY:

data shoes_trans_all; merge shoes_trans_sales shoes_trans_inv;run;

​As you can see in the partial output below, we now have a combined dataset with 17 columns that contains both the transposed SALES and INVENTORY values for each product, uniquely defined by SUBSIDIARY:

Transposing Wide to Long Datasets

In addition to being able to transpose a long dataset to a wide dataset, PROC TRANSPOSE also has the ability to transpose a wide dataset to a long dataset.

The concepts and syntax for transposing wide to long datasets are essentially identical, but the goal when creating wide dataset from a long dataset is different. With a wide to long dataset transformation, the goal is to reduce the number of columns per subject and create a data structure where multiple rows are used to define the different attributes of a subject.

While the concepts of using PROC TRANPOSE to create a long dataset from a wide dataset shown earlier in this article can also be applied to wide to long dataset transpositions, the following examples illustrate how a wide to long transformation looks.

One example of a wide dataset in the SASHELP library is SASHELP.LEUTEST data set. In SASHELP.LEUTEST, there are more columns than rows and each row uniquely represents a genetic sample of different types of leukemia (y), while the 7129 x1-x7129 columns represent the genes:

Converting the LEUTEST dataset from a wide dataset to a long dataset is very similar to preforming a simple long to wide dataset conversion with PROC TRANSPOSE.

First, let’s start with a simple call to PROC TRANSPOSE using the following code and then examine the results:

proc transpose data = sashelp.leutest out = leutest_trans ;run;

Since we are dealing with only numeric variables in this example, by default SAS already selects all the numeric variables from the original SASHELP.LEUTEST dataset and transposes them to create the LEUTEST_TRANS dataset shown partially below.

As you can see below, we have now gone from having 1 row per sample and over 7000 columns for each gene, to a dataset with one row per gene and 34 columns representing each of the 34 different samples:

​While we have achieved the desired result with few options, we can further enhance the output data by adding new variable names. The system generated _NAME_ variable can be replaced using the NAME= and the PREFIX= option with the PROC TRANSPOSE statement can be used to rename the COL1-COLN variables:

Using the NAME and PREFIX options, we now have an easier to interpret output set which has been successfully transposed from wide (7130 columns and 34 rows ) to long (35 columns and 7130 rows), as you can see in the partial output data shown below:

As with long dataset to wide dataset transformations, the BY statement can also be used to with wide to long dataset transformations to expand transposed variables into by groups.

The SASHELP.APPLIANC dataset contains sales data for 24 appliances across 156 different sales cycles. Although there are more rows than columns in this example, the number of units sold for each appliance is divided across 24 columns. The goal with this transposition is to create a long dataset of appliances where there is only 1 column which contains all of counts of units sold for all appliances, and there are separate rows per appliance, per cycle.

In this example, we will start by running a simple PROC TRANSPOSE without any options.

proc transpose data = sashelp.applianc out = applianc_trans;run;

By default, SAS transposes the data in such a way that there is a separate row for each appliance, and there are now separate columns for the number of units sold for all cycles:

While this may be helpful in some cases, our desired transformation in this example was to transpose the data so that we have just a single column for all appliances that contains the number of units solid for all cycles.

Since we are only handling numeric variables in the case, SAS has already automatically selected all the variables to transpose. Thus, in this example, the only modification we need to make is to add a BY variable for cycle so that the number of units sold is broken down into separate rows by the cycle number. Recall that by adding a BY variable, the dataset must be sorted by that same variable, as shown in the code below:

In the output dataset shown partially below, you can see we have now created a very long dataset with 3744 rows (separate rows per unit and per cycle) and a only 4 columns (the cycle number, appliance number, label and the number of units sold):

​As before, the Output Data can be refined by adding more suitable variable names in place of _NAME_, _LABEL_, and COL1. Using the name, label and prefix options respectively, the auto generated _NAME_, _LABEL_, and COL1 variables can be easily renamed:

After adding the name, label and prefix options to the PROC TRANSPOSE statement, we now have an output dataset which is easier to understand, interpret and analyze:

One of the primary advantages of having data such as the number of appliances sold in a single column is that performing calculations such as descriptive statistics becomes much easier than if you needed to perform the same calculation across multiple variables.

For example, the following PROC MEANS code can be used to easily generate the overall total and mean value for number of appliances sold across the entire dataset: