Build Your Own Weighted Sort (GA&nbspStyle)

If you're a Google Analytics fan, you probably already know that Google released a new and incredibly useful featured called Weighted Sort. If you haven't seen it, here's a quick example – let's say you want to know which of your referring sites have the highest bounce rate. You could pull up your referrers, sort by bounce rate, and get something like this:

Fascinating, right? I now know that I lost 7 visitors due to 5 sites. If I could just get that bounce rate down to 60%, I'd have 3 more visitors. Wow. What did you really want to know, intuitively? Probably something more like this:

That's better – it's not the absolute highest bounce rate you wanted to know about, but the most important high bounce rate referrers. In a nutshell, that's the question weighted sort tries to answer.

How It Works

So, how does weighted sort work, exactly? Avinash Kaushik wrote a fascinating and very transparent post on the method behind Google's weighted sort algorithm. I encourage you to read his post and I don't want to copy it, but I'll try to do a very basic review here.

Google uses something called the "Estimated True Value" (ETV). ETV essentially says this – if the count column of the sort (in this case, Visits) is very low, assume that the column of interest (Bounce Rate) is roughly the average for the data in question. In other words, if a row has 1 visit and the average bounce rate is 75%, then set the ETV of bounce rate for that row to 75%. Since 1 visit isn't enough, statistically speaking, to make any really conclusions, we'll essentially ignore it.

On the other end of the spectrum, if you have a very high visit value, assume the bounce rate is accurate as is. Simple enough, right? What about values in the middle? Well, Google sets the ETV somewhere in between the average and the row's bounce rate. Exactly how much of each they use is the tricky part.

The Equation

This is where Avinash's post ends and mine really begins. I should warn you – it's not going to get Ben-complicated, but there is going to be some math. After a bout of 4am insomnia, I pieced together a simplified weighted sort equation. I'm going to present it first, explain it, and then provide an Excel spreadsheet with some real-life examples.

Let's assume we've got a data set exactly like above – visit counts and bounce rates for a set of referring sites. We're going to need 4 sets of variables:

V = Visits for Row X

B = Bounce Rate for Row X

MV = Max Visits for the data set

AB = Average (mean) Bounce Rate for the data set

For any given row, the ETV of Bounce Rate – ETV(B) – can be represented by the following equation:

ETV(B) = (V / MV * B) + ((1 - (V / MV)) * AB)

Crystal clear, right? It's not really as bad as it looks. Let's take an example – say we have the following data (same 4 variables as above):

V = 100

B = 80%

MV = 500

AB = 60%

The ETV(B) will consist of two components:

V / MV * B = 100 / 500 * 0.80 = 0.20 * 0.80 = 0.16

1 - (V / MV) * AB = (1 - (100 / 500)) * 0.60 = 0.80 * 0.60 = 0.48

ETV(B) = 0.16 + 0.48 = 0.64

Pay attention to the parts in bold – since 100 visits is 20% of the max visits for this data set, this row gets 20% of its bounce rate from the actual value and the rest (80%) from the average value for the data set. So, essentially, how much we use the "real" bounce rate for the row is a function of the proportion of that row's visit value to the visit value of the top referrer.

Build Your Own

Want to try it yourself? You can download my Excel spreadsheet and see the formula at work across a larger data set of actual referring visits from my own site. Although this replicates a function you already have in Google Analytics, it can be used for all sorts of applications that you don't have in GA, including PPC metrics (Visits by Quality Score, for example).

There are actually four sheets in the Excel workbook:

Basic ETV formula

Google's ETV sort

Weighted ETV formula

Log-based ETV formula

Those last two require a bit of explaining. In my very simple model (1), I calculate the average bounce rate by just taking an average across all the rows (for this data set = 70.6%). The thing is, that's not how Google calculates the average bounce rate. They actually weight it by the number of visits, which makes perfect sense. So, in Google Analytics, my bounce rate for this data set is 74.6%, which is what (3) shows. If you compare (2) to (3), you'll see that my weighted formula only differs in the Top 10 by rows #8 and #9 being swapped.

My approach is a pretty good approximation for this data set, but it's still just an approximation. If you have a very large range of visit values (1 to 100,000), you might find that rows with smaller but still interesting counts (1,000+) get unfairly ignored. Sheet (4) is a more complex formula that uses the Log (base 2) of visits instead of the raw visit value. This has the effect of de-emphasizing the visit count in favor of the "real" bounce rate for that row.

If you're still with me at this point, I hope you'll play around with the spreadsheet. If you find issues with your own data sets or discover some better/cooler way of doing it, please share it in the comments.

Hey Pete, as I was reading your post my eyes started getting blurry as you broke out the math. I think I'd better tackle this in the morning after coffee.

And while I am definitely a Google Analytics fan I'm probably the only one reading that didn't know about the new Weighted Sort [sigh] If I could only find a way to add 34 more hours to my days, I'm sure I'd stay on top of everything.

You know, I'm on Twitter and industry blogs so much that I just assume everyone knows about everything instantly, but some people have real jobs and a social life ;) I'm sure plenty of people haven't checked it out yet, as you don't really even see Weighted Sort until you click on an applicable sort column. It's well worth checking out, though.

A good intorduction to something we have needed for a long time now. No more faffing about using advanced filters to get rid of all pages with <200 vists and >1000 so that we can find actionable data - it will make my life just that little bit easier every day.

Between this and the latest AdWords Editor updates (at last I can see positions for campaigns and ads! no more flicking between online and off line! Why on earth didn't they give us this data before?) life in the future will involve a bit less fiddling to get the data and a bit more time doing something useful with it.

I did know of this news thanks to Twitter stream... but actually did not checked it seriously until now.

I've experimented with your worksheet with the Analytics of my own site (yes, Mike is right: to learn you have to "play" with things) and actually discovered some little keywords gems that could be worth to polish.

Thanks again Peter.

PD: if you do things like this Just for fun - as you tweeted - what when are you playing hard for real?

Old post but still very interesting. I have been playing around with this using different count and interest columns. One question though, how should you deal with instances when the interest column is zero and the count is small? Division by zero. I guess the correct answer should lie towards the mean rather than zero.

Sorry, it's been quite a while. The only division I'm seeing/remembering is against the maximum value for the metric you're weighting ("Visits", in the spreadsheet). If the maximum value for that column was zero, then you'd basically have no data. Happy to dig deeper, if you can help me understand what you're seeing.

Awesome, I actually didn't know about weighted sort until I read this. The excel sheet is a huge bonus too, saves me the trouble of making it myself but lets me see the equations and play around with them.

You just took me back to my high school Algebra days! Besides the point, great breakdown of how the new weighted system works. I've had the time to play with it, and I'm pretty impressed so far. But your post definitely gave me a new way to use the system. Keep up the great writing.

Good post Peter. As Avinash once said, "the problem with averages is that it gives you average results". So when you have a very large range of data set you get misleading and erroneous results like the one you specified in your post (1 to 100,000 visits). So it make sense to segment your data as much as possible before you play with it. For e.g. before you apply weighted sort, filter out branded keywords using filters on your reporting interface or better create a custom report. In this way you can minimize erroneous results one get through averages and your rows with smaller but interesting counts (1,000+) won't be ignored.

Brilliant summary and thanks for providing a simplified model that can be used (within reason) across various datasets. What hits this out of the park for me is the Excel download. I don't know if the old-timey learning classifications still hold up but I need to hear, see and play with stuff. I need it all if I am to truly internalize the knowledge. You have given me this.

Besides, it's time I nail a handful of useful math/marketing models. I'm an authority and can hold my own across all my areas of "professional" responsibility (both internally and especially with clients), but this is my weak spot. I'll be posting back any questions tomorrow evening. Today is packed full of deliverables.

...

Oh, I guess I can ask this... Does anyone have useful models they use for digital marketing that you would be willing to share? Personally, I have some I need to dust off from Conductor, SEOmoz, Laura Lippay, etc. Perhaps we can do this by PM as to not dilute this post?

Great post, I've been playing around with your worksheet and think it's great! I've tried it out on a few clients and it really made me think about some of the keyword they have been targeting. Everyone should do a quick run through with their clients and see the results. Double thumbs up!!