The YouMoz Blog

Statistics 101: Deviations

This entry was written by one of our members and submitted to our YouMoz section.The author's views below are entirely his or her own and may not reflect the views of Moz.

So, it’s time for my second swat-up on statistics. Last time we covered the different types of averages there are, and how you may use them. This time, we’re looking at how we can measure deviations from that. I know I said there would be a week’s wait, but I've ended up swanning around Oxford and Jersey (the original, not New).

How is this useful? Well, deviations are all about how confident you can be in your data and the results. As I hinted in the last post, two other areas very applicable to SEO are multivariate testing and predicting search volume trends. However, as this has ended up as a very long post I will have to put those in later. Be warned – this gets a bit technical towards the end.

What is standard deviation? Normal Distributions

The standard deviation of a normally-distributed population is the square root of its variance, meaning that its formal definition for a normal distribution is

“The square root mean squared deviation from the mean”

or mathematically

where there are N samples in our population and µ (the Greek mu) is the population mean.

Now,if you've not encountered it before this sounds horrendous so let’s break it down. When looking at equations, it’s usually easiest to start in the middle. In our case, this is the squared deviation from the mean

(xi - µ)2

As this is all about deviations we can see why the deviation from the mean is here. But why squared? Well, imagine that you have a population set that looks something like {100,-101,1}. Summing all of the deviations together gives 0 – clearly something isn’t right. By squaring each one before summing, we get rid of the effects negative numbers have. After summing over all the squared deviations, divide by the population size to give the mean in the usual way. This gives us the variance, σ2 (sigma squared).

So why the square root? Well, imagine we were working on distances or percentages. If we were to express our deviation with units, we would be saying things like “six meters plus or minus one meter squared” – our units don’t match, so the maths doesn’t make sense. By square rooting, we get back to where our units should be. See, not that bad after all was it?

What is standard deviation? Poisson Distributions

Normal distributions are great for situations where our events occur very close together, so that they are almost a continuous process. Over the course of a week, for example, if you get 10,000 visitors, you are getting one visitor every minute – pretty much continuous. When you get down to say a few hundred visitors/conversions/transactions a week, each one becomes a rare event – there is a measurable gap between them. Your data will therefore not have a normal distribution, but a Poisson distribution (for those of you wondering about the difference I refer you to the central limit theorem, from which you can prove that the Poisson distribution becomes Gaussian for sufficiently small intervals). In this case, though, things are really easy – the standard deviation turns out to be the square root of the expected number of occurrences within your time period, λ. So if your data gives you an expected number of weekly conversions for the past six months of λ=900, and from your modelling you predict that you should get 805 conversions next week, you could present the figure 850±30 daily conversions to your HiPP.

How is this all useful?

As I noted at the start of the post, the most useful thing about standard deviations is being able to sow how confident you are in your data. For the case of a normal distribution, one standard deviation contains 68.2% of data points, as illustrated in this lovely graph.

Two, or more precisely 1.96, standard deviations then give you your 95% confidence interval, and you can be 99% confident that a point will fall within three standard deviations.

Applying this to SEO, imagine you have a situation where a client wants to know whether his email newsletter has had an effect. You look back at the data for however long makes sense to you and find the mean and standard deviation. If you can see a peak in direct and referral visits at the time of the newsletter, you can then take the mean number of visitors for those days (of course if you don’t see a peak you know the answer already - back to the drawing board I’m afraid). By seeing how many standard deviations this value is away from the mean, you can tell the client exactly how confident you are the campaign raised visitor numbers. They will take you far more seriously for saying “the mean for the days where the email campaign had a noticeable effect was 1.2 standard deviations away from the mean for the rest of the period without those days included, so our data shows that we can only be 40% confident your email campaign boosted traffic” than for saying “yeah, it looks like it probably didn’t work too well”.

However, there is a caveat.

If you are using a small sample size you should us the standard error, as this puts in a bit of a correction. Standard error is just given by

So:

If you only have a week of data (for example comparing the performance of an educational website just after the Easter holidays to during) you should use the standard error

If you have a few weeks of data, you can just use the STDEV() function in Excel.

Making Predictions – Conversion Rates

The great thing about the standard deviation is that it’s relatively easy to work with. Much of the finance industry is based on the idea that security prices, be they exchange rates, yields, or stock prices, have a Gaussian distribution. Unfortunately for them, in many cases that’s not true. Fortunately for us, their work can be used as a bank of knowledge (no pun intended) that has been built up ready for us to use. You don’t need to understand everything here, just the example and how to apply the formulae.

One very important calculation is Value at Risk (VaR). Investment being a risky game, customers often want to know

“If I invest a certain amount in this portfolio, what is the most I will lose at some point in the future?”

One simple formula for VaR, coming from a process called Riskmetrics and set down by JP Morgan in 1994, is given by

Where:

· xt is the total current investment at time t

· α(λ) is a function of the confidence interval ζ(zeta)

· σ is the conditional deviation – a measure of the standard deviation that depends on the previous variance

· T is the period – 30 days.

The Riskmetrics formula itself is

Where β tells us how important the contribution from the past is – usually taken as β=0.95 so that 1- β=0.05

In SEO terms, we can apply this directly in PPC. Again, this is a risky business, so people want to know how much the most they will lose is. For example, a client may say

“I currently spend £100 per day on PPC advertising. I want to increase this to £200 next month – what is the minimum I will make in June?”

You find that in April the daily revenue from PPC was £1,000 with a daily standard deviation of 5%, and that so far in May the average daily revenue has been £960. Taking May, as month zero, so t=0, means that April was month t=-1 and June will be month t=1. We don’t know what the standard deviation of this month’s data will be – we’re only half way through – but the Riskmetrics formula shows us how to work this out. From the equation above, we can see that as the total revenue in April was x-1=£30,000 and the overall standard deviation was σ-1 = (0.05 x 30,000) = £1,500 the conditional variance for May will be:

As a sanity check, this means that over May we would expect the conditional deviation of the total revenue to be £6,538 – of the same order of magnitude as April’s result. For June, then, if we assume the May average will stay the same for the rest of the month, so that x0 = 30 x £960 = £28,800 will be May's total revenue from PPC traffic, we will get that June's conditional variance will be:

This gives our conditional deviation for June as σ-1 = £6271, or as a daily amount σ1 = £209. As a proportion of the daily spend, this is 21%. Although this seems quite high, remember that we are now predicting a fair distance into the future. This makes everything that bit more uncertain.

Now we can apply our VaR formula. For a confidence interval of 95%, alpha will be 1.65 (you’ll have to trust me on that one). So assuming we start on average day we get that since x1 = £60,000

Hence there is a 5% chance that the portfolio will lose £24,012 off the total revenue of £60,000. As the ad spend will be £6,000, you can tell your client that

“We can be 95% confident that you will make at least £60,000 - £24,012 = £35,988 of revenue from your new PPC spendof £6,000 in June”

And there we are – a lovely prediction just as promised. OK, so that was really from a more advanced statistics or economics lessons than a statistics 101, but as I say you don’t need to understand all the maths behind it, just how to apply it.

A nice way to end the post, I feel, and to let you head off and explore the world of deviations on your own.

About BenjaminMorel —
Benjamin Morel is an agency-based digital marketeer and project manager working for Obergine in Oxford, UK. He is passionate about inbound marketing, especially building strategies centered around data and communication. Follow him on Twitter @BenjaminMorel or on Google +.

If you have seasonal data, then evidently things become a little diffferent. As BionicTurtle notes, the idea that your data points are normally distributed (i.e. completely random) is not always true.

One things you could do is look at previous years' performance in the month/week you are interested in andcompare it to the months/weeks prior to that one.You could then use that as an indicator of how things will go this yea. However this is not necessarily going to be the most reliable method and is probably best used as an indication.

To improve things to could correlate this data with Google Trends/Insights data, however this still may not be a very accurate prediction IMO.

The best way of analysing trends would be using spectral density functions. Although coming from physics (e.g. radio astronomy which is where I know iabout it from) this technique does have a history of being applied to economics, looking at seasonal fluctations in share price, as similar proceses are at work. I've not had the opportunity to apply this to SEO as yet, but I see no reason why you shouldn't be able to use data on visitor numbers or revenue. If you don't like maths, though, the idea of complex numbers and reciprocal spaces may not appeal and your prediction would only hold true if other factors didn't come into play e.g. rankings, changes in referral traffic.But then that's always the way

Ok, am I allowed to say I do not know what just happened! [grins] Funny thing is where I lack the logic skills to follow your progression, my son does. I will pass this along to him and see what he comes up with.

“We can be 95% confident that you will make at least £60,000 - £24,012 = £35,988 of revenue from your new PPC spendof £6,000 in June”

Can I get the prediction for August? :) If I'm not going to make money in August, I would just assume know now and take the month off!

Lol, you could make a preciction for August but you would need to extend the formula for conditional volatility to take Beta_2 etc into account. That would not only make it a lot more complex, but also would give a huge volatility.

It's like long range weather forcasts - if they're 85% accurate each day, then tomorrow the forcasters have an 85% chance of getting it right, the next daythere's a 72% chance of getting it right, and by day 4 you're down to 50/50. A nice case in point is that our Met Office predicted six weeks of good weather. I'm looking outside at a rain and a force 5 wind, gusting 6.

I would be very happy with August off as well - although only if the weather improves ;-)

The RiskMetrics is an EWMA: the terms add not subtract as they are WEIGHTS that must sum to 1.0, so the formula is (1-B)*recent squared return PLUS (+) B*recent variance. The sanity check doesn't work: with this data, updated variance would not necessarily increase. An EWMA sanity check has the current variance estimate slightly (5% or 6%) updated by the current squared return.Your RM result is random conincidence with no real meaning.

... and the VaR conclusion is dangerous. First, you've not established that $60K is justified (x2 linear?). Second, you've really confused the monthly to daily and back to monthly (SQRT(30)) scaling. And, third, there is an implicit confusion between variance/volatility of the spend and the revenue (which we might treat as independent versus dependent variables).

That last step introduces TWO variables (spend & revenue; you could generalize to include the MEAN and STDEV of one variable but even that is different...) into a univariate VaR. It makes no sense at all. I don't mean to harp but your illustration of VaR would only confuse somebody learning for the first time.

I agree that in a power series for the MA part of the model (before the sum) the sum of the weightings should be one. If we were to expand the above as a power series in Beta the (1 - Beta) part means that this would hold. However, in the original equation we are combining an MA process and an autoregressive model (after the sum), so that Beta is acting on twoindependent variables. This means that in the equation the sum to one does not need to hold.

If you would like more information, search Google Books for "Analysis of Financial Time Series" by Ruey S. Tsay and go to Section 7.2. This also gives an heuristic proof for the VaR model, which rests as it does in my experience on RiskMetrics being a Martingale process (an interesting point which shows that this is a simple model).

Do you feel that there is a better way of tackling the VaR? I know that the 60k figure assumes that next month's mean daily margin is the same as this month's, but unless we are making on-site changes as well this should be the case to within some degree of statistical fluctuation. A doubling of spend should therefore yield a doubling of return, and since one is reciprocal to the other we may concentrate solely on return. Until the the quote to the client, that is, because there the spend gives context.

How would you scale? My sanity check seems to work, although this is just a physicist's check and so looks for speed rather than complete accuracy. I cannot preofess to being as accurate as a mathemagician or a statastitian - as I note above it is only order of magnitude. Still, better than in cosmology where in order of magnitude checks pi can equal any number below five to make the sums easier.

Thanks, I own and have studied Tsay, too many times I fear.Yea, EWMA is a GARCH and GARCH generalizes to ARCH(m). And ARCH(M) i suppose generalizes to some as-of-yet-discovered universal stochastic process.

You say, "we are combining an MA process and an autoregressive model (after the sum)"... ummm, you are just teasing, right? I must have missed the SEO class where they explain why you'd select this model. Are we still talking about revenue and stuff? or is the goal to confuse

but i don't even get how your derivation satisfies the "martingale" (i have a hard time seriously using these big words in this context, where is the irony emoticon)?

just take the simplest possible case: a flat volatility @ 5% (eg), with recent "innovation" (return) of 5% ... i admit i sometimes use "innovation" to sound like i've studied this topic properly

but your last paragraph is interesting, i do agree we are going for approximations, to which i'd also add the virtue of simplicity, whenever possible

or let me summarize: IMO, if the context is revenues/expenses & SEO and not, say, particle physics (though i do have physics envy!), if we are merely doing approximations and order of magnitude preditions about the future, we should maybe not pull models like autoregressive moving averages (ARMA) off the shelf (which may even break under the simplest case!), maybe their precision contradicts our very approximation premise. RM EWMA, okay i don't get even why that, but if that, maybe the vanilla version?

(I bet you might even agree with me there is some comical irony in using an ARMA process yet a normal distribution -- implied by the one-tailed 1.645 deviate @ 95% -- which is probably the most unrealistic assumption here

I think the most productive approach in this *context* is to bag the fancy variance updating and spend some time tweaking the *distribution* into something more realistic)

I fear that until this post I had never come across Tsay - perhaps it is an American standrd rather than a British one?

You are of course right about the AR process - the volatility is conditional. The reason for selecting this model is simplicity, its predictive power, and the fact that it leads easily to a VaR calculation. Perhaps next time I shall check the biog before talking about commentors' expert ares ;-)

I have to admit that I was a little blazé about applying the model. Having been presented with the version that I was, I thought that this was one of those cases where a plus or a minus bring out the same model. I did not test the model - the aim of the lecture was the VaR calculation - and I admit that it fails the martingale "test" even when applying FTs and such. I shall apply to Jen to edit this post in light of your comment - many thanks for your criticism.

My last paragraph to my last reply was not talking about approximations in the model, just in the error checking. I don't know whether there is a better way to check for errors when looking at economic processes?

I must disagree with your last point, though. This is the distribution used by both JP Morgan and Tsay in their discussions on the subject and leads to the VaR calculation. I agree that in *financial* situations this does not agree with reality (I feel the Cauchy distribution is a better predictor in every way, although I still prefer calling it the Lorentz distribution). However, while I have never had time to apply Hurst exponents or any other form of indicator to client data I do not see any reason why month-on-month you should not be able to use a Gaussian. Perhaps you can expand? I agree that it will not always work long-term (see comments below on seasonal distributions).

I'm glad I managed to engage someone with an arts background enough that they managed to finish the post and understand most of it. This series seems to be achieving what I set out to do!

In essence, the first part says don't rely on just averages. When you have a sample, break it down into chunks and find out how widely the means for those chunks vary e.g. if you have an average bounce rate for a month from GA, you can set that as the metric for the graph at the top of the UI and export this to get the daily bounce rates - use the =STDEV(...) on the data to find the standard deviation. This will give you an idea of how accurate using that mean value is - if the standard deviation is 10%, then the majority of your days will have had between 20% and 60% bounce rate - that is two standard deviations. So if the bounce rate for the beginning of thenext month is 50% you don't need to worry too much, you haven't done anything wrong, it's just statistical variation playing its usual tricks. If you have a small sample, though, your SD won't be accurate so it's better to use standard error.

The second part t is more difficult to explain without maths. Basically, if you know the mean and SD for last month and this month for a particular metric, in this case reveue, you can tell your clients the worst, and best, cases for next month by plugging them into the RiskMetrics formula.

Ah, interesting. A nice gentle way of doing things rather than the usual get it done in five seconds approach. I tend to make a paste with hot water and sugar even when I'm making a normal coffee.My girlfriend laughs, but it does seem to mix the sugar in better.

Thank you! If there is anything else you would like to know about the above,feel free to drop me an email or a PM.