8 Lessons I’ve Learned About Using Google Analytics at Scale

Like many websites, we use Google Analytics to track data about our visitors and what they do on our sites. However, Tuts+ is a fair bit bigger than a lot of those sites, and at our size, we run into a few problems using it. Here's what I've learned about working with Google Analytics at scale.

1. Slice Data Into Smaller Date Ranges to Deal With Sampling Issues

Google Analytics will only return 500,000 rows of data for any query you send it (with a few exceptions). If you make a request where the result covers more than this amount of data, Google Analytics will take a sample of 500,000 rows, and multiply it up as necessary.

For example, if you have five million visitors in September, and you ask Google Analytics for a report on where those visitors came from across the month, Google Analytics will select 10% of the data it has about your September visitors, figure out where those 10% of visitors came from, and multiply those numbers by 10 to generate the report it gives you.

This type of extrapolation is a common statistical technique, and a reasonable way for Google to cut down on processing time. Unfortunately, in my experience, the resulting reports can be misleading, or just plain wrong.

The solution, for getting accurate data, is to ask for multiple slices of data over shorter date ranges, and then stitch them back together. Carrying on with the above example: if you have five million visitors in September, then you have around 500,000 visitors every three days. This means that if you make ten queries, each asking for three days' worth of data, each three-day report Google Analytics gives you will be completely unsampled. Stitch these back together, and you have unsampled data for the whole of the month.

2. Learn to Use the API

The Google Analytics web interface is powerful, and great for exploring your data, getting a rough idea of the numbers, or quickly sharing a report with colleagues. However, slicing the data into shorter time periods, downloading it, and putting it back together via the web interface is a time-consuming, error-prone pain in the neck.

It's much easier to do this via the Core Reporting API. The Query Explorer is a fairly friendly way to get started constructing queries, and it's straightforward to go from there to grabbing data using Python, or another programming language.

3. Store Your Downloaded Google Analytics Data in an Actual Database

Since you have to stitch these slices of data back together locally, you might as well store them in a central place where you can query and use them later. I use a clone of the Tuts+ back-end database, modified to add additional tables for Google Analytics data; this means that I can pair the traffic data to the relevant article, rather than to the URL (which is important, as our canonical URLs have changed over the years).

It also means that I can run complex queries like "show me the 100 posts with the most pageviews over the past month within a specific set of categories", or "get the most popular post from each instructor that has published at least five posts". It's difficult to do this with Google Analytics alone, since it doesn't understand our concepts of "category" or "instructor". Another advantage to having everything in a single database is that it's a lot easier to get it into external analysis and visualization tools like Tableau.

4. Make Use of Views

In Google Analytics, each reporting view has a limit of 10,000 API requests per day: that's the maximum number of times I can ping it to get some data via the API. We have 18,000 posts across Tuts+, so getting information for each one separately from a single view would take two days.

Ah, but views live inside a property, and each property has a total limit of 50,000 API requests. So, we have our Tuts+ property set up with a set of views each containing all traffic info for a single topic (like Code, Web Design, and Business), and an additional view containing all traffic info for Tuts+ as a whole. This way, we can query the Code view when we want data about Code posts, the Web Design view when we want data about Web Design posts, and so on, and make the best use of the 50,000 query limit instead of being restricted to 10,000.

So I made a single query to Google Analytics for each of the 250 posts, requesting the overall pageviews for each day the post had been online. Then I made another query to Google Analytics for each post and for each day that the post had more than zero pageviews, requesting pageviews-per-country. This massively reduced the number of queries I had to make overall, which sped things up and meant I wasn't burning through so much of our Google Analytics API quota.

6. Use Multithreading

I've found that it typically takes five or more seconds for Google Analytics to return the result of a single query. This means that making one request for each of our 18,000+ posts, one after the other, takes over 24 hours to process! This is why I do them in parallel instead.

Google Analytics lets you submit up to ten API requests per second, so you can submit 50 requests within the five seconds it takes to process your first one. This means that big queues of requests can be processed 50 times faster than if you made the requests in serial, which turns a 24-hour wait into a lunch-break wait.

7. Make Sure You Can Pick Up Where You Left Off

If (or, rather, when) your code crashes, your internet connection drops, you use up all your Google Analytics API allowance, or something else goes wrong 90% of the way through downloading a huge set of data, you do not want to have to restart the download process.

Make sure you're saving the data to disk as you download it (rather than just storing it in memory), and make sure you can skip to wherever you left off when you start the download again.

8. Decouple the Code That Gets Data From Your Specific Requests

I used to just write a whole new purpose-built set of code whenever I wanted to request some data that I hadn't requested before, copied and pasted from previous scripts, but this was limiting and inflexible.

Eventually, I coded a more general solution that took into account all the lessons I'd learned above, and that could read external definition files that specified the data I wanted to download, like this example for downloading pageview data for all of our posts:

Okay, these files aren't exactly human-readable and user-friendly, but they are fairly simple to understand and create if you have a good grip of the Google Analytics API and the way that our database is structured. It means that I can create new sets of instructions without touching the underlying code that executes them, and I can improve the underlying code without worrying about messing up any of the regular data downloads.

Conclusion

A lot of these tips feel as if they describe the common sense approach when I look at them now, and I can hardly believe that I wasn't handling our data downloads this way all along. But it's taken a lot of trial and error and frustration to reach this process that works for us; I hope that by sharing it I can help someone else bypass that.