Metrics Maven: Making Data Pretty in PostgreSQL

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll look at how you can make your reports more understandable (aka "pretty") using built-in functions and operators in PostgreSQL.

The first step in producing any report is to know the audience. Will the report be used in the VP's quarterly slide deck to the C-level executives? Is it intended for use by the product marketing team to understand the impact of a recent ad campaign? Or will it be used as input to analyses by fellow data scientists? Depending on who will be using your reports (and for what purposes), you may need to make the output more high-level and human-friendly (or focus more on precision and detail). For our use case, we're going to continue looking at app downloads, but we want to add some additional human-friendly detail to the report and make the data easy-to-understand.

In our previous article about how to calculate a moving average
, we successfully calculated both a simple and a cumulative moving average of app downloads by day. But the results left a little something to be desired since the average values were calculated to 16 decimal places. Here's a sample of what our output looked like for the simple moving average over 30 days:

That kind of precision is undoubtedly important in some scenarios, but for us it's overkill (read: "ugly"). In this article, we're going to use some built-in Postgres functions and operators to make it more human-friendly by adding some detail in some areas and simplifying in others. We're calling this "making the data pretty". We'll use this data from the simple 30 day rolling average calculation to walk through some options for our use case here.

Mathematical functions

Using PostgreSQL mathematical functions
, we have a few different options to tackle the "too many decimal places" problem. We can round our results, truncate our results, or find the ceiling or floor. Any of these functions will make our data easier to read. Each of these functions has its own nuance, though, so let's compare them.

Before we dive in, an important thing to note is that the mathematical functions described below cannot be used by the OVER
clause for window frames since it expects window functions (like RANK()
or ROW_NUMBER()
) or aggregations (like SUM()
or AVG()
). That puts a minor hitch in our plan, but we can esily handle it by using WITH
to create a common table expression
(CTE) or by using our existing query as a derived table wrapped with a new query. Since CTEs help make queries easier to read, we're going to use that option. Our CTE for the examples in this article, then, looks like this and we're calling it "avg_app_downloads_30_days_rolling_by_day":

WITH avg_app_downloads_30_days_rolling_by_day AS (
SELECT ad.date,
AVG(ad.downloads)
OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
FROM app_downloads_by_date ad
)
SELECT ....

To this base, we'll add our new SELECT
query for creating our final report.

Rounding numbers

The first function we're going to look at to simplify our moving average values is round
. The round
function is going round the value up or down, but we can choose whether we want to round to a certain number of decimal points or just to the nearest integer value.

To round to a certain number of decimal points, we provide the number of decimal points we want to see as the second input to the function; the first input to the function is the value we want to round (in our case the value from the "avg_downloads" field). Let's say we want to round to just 2 decimal places. Our query will look like this:

WITH avg_app_downloads_30_days_rolling_by_day AS (
SELECT ad.date,
AVG(ad.downloads)
OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
FROM app_downloads_by_date ad
)
SELECT date,
round(avg_downloads, 2) as avg_downloads_rounded_decimal
FROM avg_app_downloads_30_days_rolling_by_day
;

We don't want to deal with partial downloads, though. It makes more sense for our use case to round to the nearest whole number. For that reason we'll apply round
to just the value from the "avg_downloads" field like we show below without specifying a number of decimal points. Also, from here on out in the examples in this article, since we know the CTE isn't changing, we'll focus on just the new SELECT
query we're performing on the CTE:

Notice how June 9th and 11th got rounded up to the next whole number while the other dates were rounded down.

Truncating numbers

Another option is to truncate the values. trunc
can be used to truncate the value to a certain number of decimal places or to the integer value. The difference with trunc
as compared to round
is that there is no rounding; any characters past the specified position are simply chopped off. trunc
follows the same syntax as round
so here's what a query looks like for truncating to 2 decimal places:

For this data, the results are almost the same as when we were using round
, except that June 8th didn't get rounded up. If we truncate to just the integer without specifying any decimal points, our results will look like this:

June 9th and June 11th are now one whole number difference from the results we got when we used round
. While this may seem insignificant for this data set, it can be very impactful for other data sets and could skew the results one way or another if this report were used for further calculations down the line. Understanding the nuances of how different functions affect the data is key to selecting the right ones for your purposes.

Number ceilings and floors

Before we leave mathematical functions, let's have a look at ceiling
and floor
. The ceiling
function (you can also used the shortened form ceil
) will round up to the nearest integer value. Conversely, floor
will round down to the nearest integer value. These functions don't support decimal places; they deal in whole numbers only. Let's compare them to the functions we already reviewed and to each other.

For this data, because floor
rounds down to the nearest integer, the results are the same as if we'd used trunc
, but each of them is one whole number difference from the results of ceiling
. That can be significant so make sure to choose your functions wisely.

Now that we have an idea how different mathematical functions can affect our results, we're going to stick with round
to the nearest integer for our use case. That way we don't have to deal with the partial downloads represented by decimal points and we can approximate the closest whole number the moving average is indicating.

Let's turn our attention to formatting the date next.

Formatting functions

Our date field is pretty simple and easy-to-read, but for the final report we want to generate, we'd like to have a more human-friendly textual label for the date. We can use formatting functions
to play with the "date" field in our results. Specifically, we're going to use the to_char
function to list out some elements of the date that will be useful for reporting and analysis, including day of the week, month name, and quarter of the year. The to_char
function takes any numeric data type, as well as timestamps and intervals, and converts them to a string using a pattern that we can specify. Though our "date" field is not, strictly-speaking, a timestamp, it will be automatically converted to a timestamp by the function for processing.

Let's start with a simple example to get the quarter using the to_char
function:

Here, the to_char
pattern we're using to convert the date to a string is "Q" for quarter. Pretty straightforward, but we've kept the original date value for reference. The result we get back is the quarter number of the year. A "1" represents months from January through March, "2" for April through June, "3" for July through September, and "4" for October through December. Since we've been using dates in June in our examples, our results look like this:

Based on the patterns we've specified in our query, we're going to get a 4-digit year, the quarter number of the year (which we looked at above), the 3-character shortened form of the month in all caps, the week number of the month, the day as a 2-digit number, and also the day of the week with the first letter captialized.

Patterns can also be combined into a single request. Now that we have a sense of what some of the different patterns return, let's combine a few patterns, as an example, to produce a new "day_of_week" column:

In the revised query above, we've indicated that the "day_of_week" column should have the full month name with a capitalized first letter followed by the 2-digit day and a comma then followed by the day of the week fully spelled out with the first letter captialized.

Notice that full name of the month and the full name of the day of the week in the "day_of_week" column are padded to 9 characters. If you use a shortened form of the name (such as if we'd used "MON" to get the capitalized shortened form of the month instead), the padding is not used.

The to_char
formatting function is going to be useful for making the dates in our final report "pretty". They're not the only options for dates, though. There are also some handy date functions
to be aware of.

Date functions

There are two date functions in particular we'll look at here briefly: date_part
and date_trunc
. These can be used to extract elements from or simplify timestamps (dates are converted to timestamps automatically for processing, same as with the to_char
function).

Extracting date parts

date_part
can be used to pull individual elements from a timestamp, time, or interval. Note that the extract
function does the same thing as date_part
. extract
is SQL-standard compliant whereas date_part
is based on Ingres
, but as the extract
function in PostgreSQL uses the date_part
syntax under the hood, you can just call date_part
directly. If you're concerned about the portability of your code to other SQL systems, then choose extract
instead of date_part
.

The difference between date_part
and the to_char
function we looked at above is that date_part
returns a numeric value. Because of this, there are no formatting patterns for date_part
. If you specify "month", you'll get back the numeric value of the month only. The to_char
function can also return the number of the month, but has the option of returning the 3-character short form of the month (with various captialization options) or the complete name of the month spelled out (with various captialization options).

Here's an example of extracting the month using date_part
(notice how the syntax is different... the part of the date to return is the first input; the value to process is the second input):

Truncating dates

date_trunc
is similar to the trunc
function we looked at for number values in the section above about mathematical functions, but it operates on timestamps and intervals. There are different levels of truncation that can be applied. Since our dates don't go more granular than the daily level, we could truncate at the month level, quarter level, year level... all the way up to the millennium if that was necessary for the use case.

Let's look and see how date_trunc
works for the month level (you'll notice it uses the same syntax as date_part
):

As we mentioned, our date field gets automatically converted to a timestamp for processing. Then, at the month level, the time components of the timestamp are zeroed out (if we had any) and the day is set to "01" as a default. If we had other months represented in our example, they'd look similar... July would look like this, for example: 2016-07-01 00:00:00+00. August would look like this: 2016-08-01 00:00:00+00. As you can see there are some important differences between extracting date parts, truncating dates, and formatting using to_char
.

String functions

Since we're going to use the to_char
formating function for dates in our report (rather than using date_part
or date_trunc
) and those values will therefore be converted to strings, let's have a look at string functions
.

String functions allow you to manipulate string values in a variety of ways. There are options for case, trimming, substrings, length, and more. One commonly-used function that we're going to look at here is concatenation. Even though we were able to do some interesting formatting using just the to_char
function on our dates, concatenation will allow us to make our report more human-friendly. For example, let's use concatenation to create an easier-to-understand column for quarter:

Here we're using the concatenation operator "||" to prepend the quarter number with the 4-digit year, a hyphen, and the letter "Q". We can't do that using just the to_char
function because "Q" means quarter number and we'd end up duplicating the number (using to_char(date, 'QQ')
, we'd get "22"). By concatenating, though, we can make it clear to the report user that values for the specified dates happened in "Q2" as opposed to the more ambiguous "2".

Let's go through another example using the concatenation operator to clean up the "day_of_week" column we created as well as rearrange some of the elements. To do that we'll also use the trim
function to remove the extra padding around the month and day of week names. Here's how:

We are using the trim
function on the trailing spaces for the full name of the weekday by wrapping our to_char
function inside trim
. Next we're concatenating a comma and a space for readability, then the month name also wrapped in a trim
for the trailing spaces. We're then concatenating another space then the 2-digit day then another comma and whitespace and finally the 4-digit year.

Now that we've reviewed some options to make our data "pretty", let's make a final report.

Final report

In our final report, we're going to use some of the functions we reviewed in this article, including round
, to_char
, trim
, and the concatenation operator "||". Our report will have a column for quarter, one for month, and one for the display date to provide detail for the 30 day rolling average of app downloads. We're also going to make it clear that the average app downloads is a 30 day rolling average in the field name. Here's our query:

Next steps

In this article we covered how to make our reports more suited to the audience and the use case. For us, that meant making our report a little more human-friendly. We looked at a variety of mathematical, formatting, date, and string functions to add some detail and make our data easy-to-understand.

There are a slew of additional options for how to present data in your reports. Depending on your data and the needs of your audience, you could look at further grouping and ordering of the data, perhaps even using the new group by options for sets, rollups and cubes
that we wrote about when PostgreSQL 9.5 was released. You could even go a step further and use a data visualization tool for charting and dashboarding like we demonstrated in our article about how we visualize data using Leftronic
. There are a number of 3rd party visualization tools on the market with a variety of features. Regardless of what additional steps you may want to take with your reports beyond what we discussed here, it all starts with the data you have at hand. By making that data "pretty", your reports will be easy-to-understand and therefore more useful to your audience.

In our next Metrics Maven article, we'll look at how to use crosstab to effectively create pivot tables in PostgreSQL so we can see our data in a different way.