Common Statistical Operations

Adding Summary Statistics to your SQL Table

At Periscope Data we continuously help people build, diagnose, and update their data. When approaching a table of data for the first time, it helps our understanding to see summarizing statistics.

By looking at such staples as the range, sum, and average, you can get a quick glance at the shape and distribution of the data before taking a deep dive.

Today we will demonstrate how we append this data to our SQL tables. ​

Understanding Blog Post Metrics

Whenever we write a blog post, we want to understand if readers are visiting the Periscope Data homepage and signing up for a trial. We have tables blog_pings, homepage_pings, and signup_pings that let us observe when users visit the homepage after reading a blog post.

We want to count visitors to the blog even if they did not visit the homepage. Since blog_pings includes everyone in homepage_pings, which in turn includes all of signup_pings, we use left join to merge these records without losing any of the rows.

select

post

, count(distinct blog_pings.cookie) as blog

, count(distinct homepage_pings.cookie) as homepage

, count(distinct signup_pings.cookie) as signups

from

blog_pings

leftjoin homepage_pings on

blog_pings.cookie = homepage_pings.cookie

leftjoin signup_pings on

blog_pings.cookie = signup_pings.cookie

groupby

1

Running this query gives us the counts of visitors who visited the blog page, the home page, and those who signed up for a trial of Periscope Data.

This table makes it very easy to analyze the conversion rates of the most active and least active blog posts, but not how an average post performed just by scrolling through the table. We can add the average and standard deviation as headers to the table to guide the analyst when checking up on a blog post.

First, we bundle the previous query into a temporary table using with blog_signups as (...)

with blog_signups as (

select

post

, count(distinct blog_pings.cookie) as blog

, count(distinct homepage_pings.cookie) as homepage

, count(distinct signup_pings.cookie) as signups

from

blog_pings

leftjoin homepage_pings on

blog_pings.cookie = homepage_pings.cookie

leftjoin signup_pings on

blog_pings.cookie = signup_pings.cookie

groupby

1

)

Now we can pull statistics like the total, mean, and standard deviation from blog_signups.

select'Total', sum(blog), sum(homepage),

sum(signups) from blog_signups

unionselect'Average', avg(blog), avg(homepage),

avg(signups) from blog_signups

unionselect'Standard Deviation', stddev(blog),

stddev(homepage), stddev(signups) from blog_signups

Having these statistics is great, but it becomes tedious to check multiple data sources. We can address this by simply folding our statistics into the main table by unioning them together.

select'Total', sum(blog), sum(homepage),

sum(signups) from blog_signups

unionselect'Average', avg(blog),

avg(homepage), avg(signups) from blog_signups

unionselect'Standard Deviation', stddev(blog),

stddev(homepage), stddev(signups) from blog_signups

unionselect*from blog_signups

This will make sorting the table difficult, since the Total, Average, and Standard Deviation are on either ends of the range of the data. To help us in sorting the data, we can add an index column that we will use as a primary sorting key. First, we add an index to blog_signups:

with blog_signups as (

select

1asindex

, post

, count(distinct blog_pings.cookie) as blog

, count(distinct homepage_pings.cookie) as homepage

, count(distinct signup_pings.cookie) as signups

from

blog_pings

leftjoin homepage_pings on

blog_pings.cookie = homepage_pings.cookie

leftjoin signup_pings on

blog_pings.cookie = signup_pings.cookie

groupby

1, 2

)

Then we index the statistics in our order preference:

select4, 'Total', sum(blog), sum(homepage),

sum(signups) from blog_signups

unionselect3, 'Average', avg(blog), avg(homepage),

avg(signups) from blog_signups

unionselect2, 'Standard Deviation', stddev(blog),

stddev(homepage), stddev(signups) from blog_signups

Now we wrap this entire query into another with block which we call signup_statistics:

with signup_statistics as (

with blog_signups as (...)

select4, 'Total', sum(blog), sum(homepage),

sum(signups) from blog_signups

unionselect3, 'Average', avg(blog), avg(homepage),

avg(signups) from blog_signups

unionselect2, 'Standard Deviation', stddev(blog),

stddev(homepage), stddev(signups) from blog_signups

unionselect*from blog_signups

)

select

post

, blog

, homepage

, signups

from

signup_statistics

orderby

indexdesc

, signups desc

And there we have it! Now whenever an author wants to check up on their post, they will be able to more easily understand how their post faired against the distribution.