Metrics Maven: Beyond Average - A Look at Mean in PostgreSQL

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. Over the next few articles, we'll get cozy with mean, median, and mode in PostgreSQL.

Mean, median and mode - these three metrics can provide valuable insights from your data. Over the next few articles we'll get deeply familiar with each of them and also highlight some key considerations for understanding what's really contributing to the numbers we report.

Mean

You're probably already familiar with how to calculate a mean, which usually goes by its more common name — average. Here's a quick refresher, though... To find the mean of a set of numbers, you add them all up and then divide by the count. PostgreSQL provides the AVG aggregate function to compute this metric for you. We've used the AVG function a few times for various examples in other articles. In the example below, you'll see our use of the ROUND function with AVG, which we covered in our Making Data Pretty article:

SELECT
ROUND(AVG(item_count),2) AS avg_items,
ROUND(AVG(order_value),2) AS avg_value
FROM orders;

The mean values (a.k.a. "averages") we get back look like this:

avg_items | avg_value
----------------------
1.91 | 19.98

Now, in looking at the data set for this query, you may have noticed that order 50005 has 0 items and has a NULL order value, so you may be wondering how it's treated in the average calculation. Let's look.

Nulls and Zeroes

Understanding how NULL values and 0 values are handled is something to be aware of when reporting metrics so that you can make sure you're getting the numbers you expect.

Let's look at the avg_items and avg_value metrics we calculated above. In the case of the avg_items, the average is calculated using all 11 values (0 counts as a value by default). The AVG function is performing this calculation:

To alter this default behavior (either to ignore 0 values or to include NULL values), we can add some conditions to act on the AVG function. In this next example, we've chosen to ignore the 0 value for avg_items and to include the NULL value for the avg_value:

SELECT
ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items,
ROUND(AVG(CASE WHEN order_value IS NULL THEN 0 ELSE order_value END),2) AS avg_value
FROM orders;

In the above query, for avg_items, we're using NULLIF to set 0 values to NULL for the calculation so that they will be ignored by the default behavior. For avg_value, we're using a CASE WHEN conditional to convert NULL values to 0 so that they will be included by the default behavior. Here's our new result:

avg_items | avg_value
----------------------
2.10 | 18.16

What we actually want for our report, however, is for any orders with 0 or NULL values to be ignored so our query will end up looking like this:

SELECT
ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items,
ROUND(AVG(order_value),2) AS avg_value
FROM orders;

And we'll have the following values for our final report:

avg_items | avg_value
----------------------
2.10 | 19.98

Since we've also got quite a few orders with the same item counts and order values, we may want to get an idea of what the means would be if we average only the unique values instead.

Distinct

By adding a DISTINCT clause to our query, we can calculate the mean using only the unique values. Just as easy as this:

As we can see, the values are quite a bit higher when we remove duplicate values for item count and order value. These would be the values we might expect if our orders were spread more evenly. What this tells us is that our orders tend more toward the lower end of the range for both the item count and the order value, which brings our average values down in our final report. This indicates that we're probably not getting the coverage across our catalog that we would want. If we saw lower values here than we got from our final report, then we'd know that our orders tended more toward the higher end. It's always good to get a sense of whether your data is on the lower or higher end so that your business can decide how (or if) it wants to try to shift that data in the future. In this case, if we wanted to shift our orders more toward the higher end, we might do some kind of promotion to increase the number of items purchased per order and focus an advertising campaign on higher-priced items.