Common Statistical Operations

Impact or No? -- Simple Event Study in SQL

Periscope Data

|

March 24, 2016

Timeseries charts can be quite powerful for depicting changes of a metric over time. For instance, this chart shows how revenue has grown over time:

From a cursory glance, it appears that revenue has really skyrocketed after our mobile launch on 12/24/14, but how can we be sure of the impact? In this blog, we dissect and quantify the effect of a single event on timeseries data.

Timeseries Data

A timeseries is a collection of measurements for one metric over time. Timeseries data comes in a variety of shapes and sizes. For instance, the height of one person collected over time is expected to increase with each data point dependent on the previous data points; whereas stock prices for Apple are expected to be dynamic and hard to predict.

If the past is a good predictor of the future, we can model timeseries data using an autoregressive model, which means tomorrow’s value depends linearly on today’s value plus some imperfectly predictable term.

Autoregressive models can be used to model revenue since, as an example, tomorrow’s revenue is probably very similar to today’s revenue with some noise added in. What complicates this analysis is the impact of a singular event on revenue.

Mobile Launch Scenario

Revisiting our earlier example, it’s really hard to precisely measure the impact that our mobile launch has had on revenue. However, we can actually build out an autoregressive model to measure this impact and then use the data to calculate the effect of the launch on revenue.

To measure the impact of a particular event, we will use a dummy variable D to indicate whether or not the mobile launch has happened yet. D will take on the value 1 when the mobile launch has already happened or 0 when it hasn’t. Thus, to generate the complete dataset for this series, we use the following in Postgres/Redshift SQL:

with revenue as

(

select d

date

, sum(price) as dailyrev

, casewhen date >='2014-12-24'

then1

else0

endas D

from purchases

groupby1, 3

)

We first build an AR(1) model – an auto-regressive model which uses the first lag of the metrics to predict its value. As with most timeseries models, we will be using the log of returns, where the return r is calculated as the proportional change in our metric x:

Doing the same in SQL, we create the CTE:

with

...

, log_revenue as

(

select date, log(dailyrev) as logrev

, log(lag(dailyrev) over (orderby date))

as laglogrev

, D

from revenue

)

Putting this together with our dummy variable, we get the following AR(1) model:

Calculating Coefficients for the Regression

Now that we have the model, we want to calculate the coefficients for all the beta values by running a regression analysis. To do this, we can use the ordinary least squares method to estimate these values:

Within SQL, we can create a few more CTEs — one for doing all the sigma calculations and one each for each beta:

with

...

, sigmas as

(

select

sum(D*D)-sum(D)*sum(D)/count(*) as SigSqD

, sum(laglogrev*logrev)-sum(laglogrev)*sum(logrev)

/count(*) as Sigy1y

, sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*) as Sigy1D

, sum(D*logrev)-sum(D)*sum(logrev)/count(*) as SigDy

, sum(laglogrev*laglogrev)-sum(laglogrev)*sum(laglogrev)

/count(*) as SigSqy1

, pow(sum(laglogrev*D)-sum(laglogrev)*sum(D)/count(*),2)

as SqSigy1D from log_revenue

)

, beta1 as

(

select

(SigSqD*Sigy1y-Sigy1D*SigDy)/(SigSqy1*SigSqD-SqSigy1D)

as beta1

from

sigmas

)

, beta2 as

(

select (SigSqy1*SigDy-Sigy1D*Sigy1y)/(SigSqy1*SigSqD-SqSigy1D)

as beta2

from

sigmas

)

, beta0 as

(

select

avg(logrev)-(select beta1 from beta1) *avg(laglogrev)

- (select beta2 from beta2)*avg(D)

as beta0

from

log_revenue

)

Now that we have the finalized regression coefficients, we can draw a nice trendline through the data:

with

...

, finalized_regression as

(

select

log_revenue.*

, beta1

, beta2

, beta0

from

log_revenue

, beta0

, beta1

, beta2

)

, predicted_revenue as (

select

finalized_regression.date

, finalized_regression.logrev as log_revenue

, (beta0 + beta1*log_revenue.laglogrev + beta2*log_revenue.D)

as predicted_log_revenue

from finalized_regression

join log_revenue using (date)

)

select*

from predicted_revenue

Looks like the trendline is a great fit!

Translating an Event into $$$

The beta values from our above regression turned out to be:

beta0 = 2.609

beta1 = 0.724

beta2 = 0.553

So what do these numbers tell us about our mobile launch on 12/24/14? The key number we’re looking at is beta2, which is the coefficient in front of the dummy variable we created to indicate post-launch metrics. Since we were examining the logarithm of revenue returns, we can simply take the exponential to find its impact:

e^(0.553*1) = 1.738x

From the above analysis, launching the mobile version of the game produced 74% higher daily revenue than pre-launch! The day after our launch, our revenue is $25k higher according to our model.