An Englishman in Copenhagen writing about digital, music and anything else.

A few weeks ago I tweeted a beta version of a BigQuery Visualiser Shiny app that was well received, and got some valuable feedback on how it could be improved, in particular from @felipehoffa - thanks Felipe!

I got into BigQuery once it started to receive exports from Google Analytics Premium. Since these exports carry unsampled raw data and include unique userIds, its a richer data source for analysis than the Google Analytics reporting API.

It also was a chance to create another Google API library called bigQueryR, the newest member of the googleAuthR family. Using googleAuthR meant Shiny support, and also meant bigQueryR can be used alongside googleAnalyticsR and searchConsoleR under one shared login flow. This is something exploited in this demo of RMarkdown, which pulls data from all three sources into a scheduled report.

Running your own BigQuery Visualiser

You can run the Shiny app locally on your computer within RStudio; within your own company intranet if its running Shiny Server; or publicly like the original app on shinyapps.io.

Feedback

Please let me know what else could improve.

I have a current pending issue on using JSON uploads for authentication that is waiting a bug update in httr, the underlying library.

In particular all the htmlwidgets() packages could be added - this wonderful R library creates an R to d3.js interface, which holds some of the nicest visualisations on the web.

In this first release, I favoured plots that could apply to as much different data sets as possible. For your own use cases you can be more restrictive on what data is requested, and so maybe more ambitious in the plots. If you want inspiration timelyportfolio (he who wrote the listviewer library) has a blog where he makes lots of htmlwidgets libraries.

Enjoy! Hope its of use, let me know if you build something cool with it.

Tags

MeasureCamp #7

I've just come back from #MeasureCamp, where I attended some great talks: on hierarchical models; the process of analysis; a demo of Hadoop processing Adobe Analytics hits; web scraping with Python and how machine learning will affect marketing in the future. Unfortunately the sad part of MeasureCamp is you also miss some excellent content when they clash, but that's the nature of an ad-hoc schedule. I also got to meet some excellent analytics bods and friends old and new. Many thanks to all the organisers!

My sessions on machine learning

After finishing my presentation I discovered I would need to talk waaay to quickly to fit it all in, so I decided to do a session on each example I had. The presentation is now available online here, so you can see what was intended.

I got some great feedback, as well as requests from people who had missed the session for some details, so this blog post will try to fill in some detail around the presentation we spoke about in the sessions.

Introduction

Machine Learning gives ability for programs to learn without being explicitly programmed for a particular dataset. They make models from input data to create useful output, commonly predictive analytics. (Arthur Samuel via Wikipedia)

There are plenty of machine learning resources, but not many that deal with web analytics in particular. The sessions are aimed at inspiring web analysts to use or add machine learning to their toolbox, showing two machine learning examples that detail:

What data to extract

How to process the data ready for the models

Running the model

Viewing and assessing the results

Tips on how to put into production

Machine learning isn't magic. You may be able to make a model that uses obscure features, but a lot of intuition will be lost as a result. Its much better to have a model that uses features you can understand, and scales up what a domain expert (e.g. you) could do if you had the time to go through all the data.

Types of Machine Learning

Machine learning models are commonly split between supervised and unsupervised learning. We deal with an example from each:

Supervised: Train the model against a test set with known outcomes. Examples include spam detection and our example today, classifying users based on what they eventually buy. The model we use is known as Random Forests.

Unsupervised: Let the model find own results. Examples include clustering of users that we do in the second example using the k-means model.

Every machine learning project needs the below elements. They are not necessarily done in order but a successful project will need to incorporate them all:

Pose the question - This is the most important. We pose a question that our model needs to answer. We also review this question and may modify it to try and fit what the data can do as we work on the project.

Data preparation - This is the majority of work. It covers getting hold of the data, munging it so it fits the model and parsing the results. I've tried to include some R functions below that will help with this, including getting the data from Google Analytics into R.

Running the model - The sexy statistics part. Whilst superstar statistics skills is helpful to get the best results, you can still get useful output when applying model defaults which we use today. Important thing is to understand the methods.

Assessing the results - What you’ll be judged on. You will of course have a measure of how accurate the model is, but an important step is visualising this and being able to explain the model to non-technical people.

How to put it into production - the ROI and business impact. A model that just runs in your R code on your laptop may be of interest, but ultimately not as useful for the business as a whole if it does not recommend how to implement the model and results into production. Here you will probably need to talk to IT about how to call your model, or even rewrite your prototype into a more production level language.

Pitfalls Using Machine Learning in Web Analytics

There are some considerations when dealing with web analytics data in particular:

Web analytics is messy data - definitions can vary from website to website on various metrics, such as unique users, sessions or pageviews, so a through understanding of what you are working with is essential.

Most practical analysis needs robust unique userIds - For useful actionable output, machine learning models need to work on data that record useful dimensions, and for most websites that is your users. Unfortunately that is also the definition that is the most woolly in web analytics given the nature of different access points. Having a robust unique userID is very useful and made the examples in this blog post possible.

Time-series techniques are quickest way in - If you don't have unique users, then you may want to look at time-series models instead, since web analytics is also a lot of count data over time. This is the reason I did GA Effect as one of my first data apps, since it could apply to most situations of web analytics.

Correlating confounders - It can be common for web analytics to be recording highly correlating metrics e.g. PPC clicks and cost. Watch out for these in your models as they can overweight results.

Self reinforcing results - Also be wary of applying models that will favour their own results. For example, a personalisation algo that places products at the top of the page will naturally get more clicks. To get around this, consider using weighted metrics, such as a click curve for page links. Always test.

Do regularisation - Make sure all metrics are on the same scale, otherwise some will dominate. e.g. pageviews + bounce rate in same model

The Scenario

Here is the situation the following examples are based upon. Hopefully it will be something familiar to your own case:

You are in charge of a reward scheme website, where existing customers log in to spend their points. You want users to spend as many points as they can, so they have high perceived value. You
capture a unique userId on login into custom dimension1 and use Google Analytics
enhanced e-commerce to track which prizes users view and claim.

Notice this scenario involves the reliable user ID, since every user is logging in to use the website. This may be tricky to do on your own website, so you may need to only work with a subset of your users. In my view, the data gains you can make from reliable user identification means I try to encourage the design of the website to involve logged in content as much as possible.

Random Forests

Now we get into the first example. Random Forests are a popular machine learning tool as it typically has good results - in Kaggle competitions its often the benchmark to beat.

Random Forests are a simple extension, as a collection of decision trees are a Random Forest. A problem with decision trees is that they will overfit your data - when you throw new data at it you will get misclassification. It turns out though, that if you aggregate all the decision trees with subsets of your original data, all those slightly worse models added up make one robust model, meaning when you throw new data at a Random Forest its more likely to be a closer fit.

Example 1: Can we predict what prizes a user will claim from their view history?

Now we are back looking at our test scenario. We have noticed that a lot of user's aren't claiming prizes despite browsing the website, and we want to see if we can encourage them to claim prizes, so they value the points more and spend more to get them.

We want to look at users who do claim, and see what prizes they look at before they claim. Next we will see if we can build a model to predict what a user will claim based on their view history. In production, we will use this to e-mail users who have viewed but not claimed prize suggestions, to see if it improves uptake.

Fetching the data

Use your favourite Google Analytics to R library - I'm using my experimental new library, googleAnalyticsR, but it doesn't matter which, the important thing is looking at what is being fetched. In this example the user ID is being captured in custom dimension 1, and we're pulling out the product SKU code. This is transferable to other web analytics such as Adobe Analytics (perhaps via the RSiteCatalyst package)

Note we needed two API calls to get the views and transactions as these can't be queried in the same call. They will be merged later.

Transforming the data

We now need to put the data into a format that will work with Random Forests. We need a matrix of predictors to feed into the model, one column of response showing the desired output labels, and we split it so it is one row per user action:

Here is some R code to "widen" the data to get this format. We then split the data set randomly 75% for training, 25% for testing.

Running RandomForest and assessing the results

We now run the model - this can take a long time for lots of dimensions (this can be much improved using PCA for dimension reduction, see later). We then test the model on the test data, and get an accuracy figure:

On my example test set I got ~70% accuracy on this initial run, which is not bad, but it is possible to get up to 90-95% with some tweaking. Anyhow, lets plot the test vs predicated product frequencies, to see how it looks:

This outputted the below plot. It can be seen in general the ~70% accuracy predicted many products but with a lot of error happening for a large outlier. Examining the data this product SKU was for a cash only prize. A next step would be to look at how to deal with this product in particular since eliminating it improves accuracy to ~85% in one swoop.

Next steps for the RandomForest

There I stop but there are lots of next steps that could be done to make the model applicable to the business. A non-exhaustive list is:

Introduction to K-means clustering

This video tutorial on k-means explains it well:

The above is an example with two dimensions, but k-means can apply to many more dimensions than that, we just can't visualise them easily. In our case we have 185 product views that will each serve as a dimension. However, problems with that many dimensions include long processing time alongside dangers of over-fitting the data, so we now look at PCA.

Principal Component Analysis (PCA)

We perform Principal Component Analysis (PCA) to see if there are important products that dominate model - this could have been applied to previous Random Forest example as well, and indeed a final production model could include output from one model like k-means to be fed into Random Forests.

The clustering we will do will actually be performed on the top rotated dimensions we find via PCA, and we will then map these back to the original pages for final output. This also takes care of situations such as if one product is always viewed in every cluster: PCA will minimize this dimension.

The code below looks for the principal components, then gives us some outputs to try and decide how many dimensions we will choose. A rule of thumb is we look for components that give us roughly ~85% of the variance. For the below data this was actually 35 dimensions (reduced from the 185 before)

The plot output from the above is below. We can see the first principal component accounts for 50% of the variance, but then the variation is flattish.

How many clusters?

How many clusters to pick for k-means can be a subjective experience. There are other clustering models that pick for you, but some kind of decision process will be dependent on what you need. There are however ways to help inform that decision.

Running the k-means modelling for increasing number of clusters, we can look at an error measure (sum of squares) of how many points are in each. When we plot these attempts for each cluster iteration, we can see how the graph changes or levels off at various cluster sizes, and use that to help with our decision:

The plot for determining the clusters is here - see the fall between 2-4 clusters. We went with 4 for this example, although a case could be made for 6:

Assessing the clusters and visualisation

I find heatmaps are a good way to assess clustering results, since they offer a good way to overview groupings. We are basically looking to see if the clusters found are different enough to make sense.

This gives the following visualisation. In an interactive RStudio or Shiny session, this is zoomable for finer detail, but here we just exported the image:

From the heatmap we can see that each cluster does have distinctly different product views.

K-Means - Next Steps

The next step is to take these clusters and examine the products that are within them, looking for patterns. This is where your domain knowledge is needed, as all we have done here is grouped together based on statistics - the "why" is not in here. When I've performed this in the past, I try to give named persona to each cluster type. Examples include "Big Spenders" for those who visit the payment page a lot, "Sport Freaks" who tend to only look at sport goods etc. Again, this will largely depend on the number of clusters you have chosen, so you may want to vary this to tweak to the results you are looking for.

Recommendations follow on how to group pages: A/B teats can then be performed to test if the clustering makes an impact.

Summary

I hope the above example workflows have inspired you to try it with your own data. Both examples can be improved, for instance we took no account of the order of product views or other metrics such as time on website, but the idea was to give you a way in to try these yourselves.

I chose k-means and Random Forests as they are two of the most popular models, but there are lots to choose from. This diagram from a python machine learning library, scikit-learn, offers an excellent overview on how to choose which other machine learning model you may want to use for your data:

All in all I hope some of the mystery around machine learning has been taken out, and how it can be applied to your work. If you are interested in really getting to grips with machine learning, the Coursera course was excellent and what set me on my way.

Do please let me know of any feedback, errors or what you have done with the above, I'd love to hear from you.

Good luck!

Tags

One of the problems with working with Google APIs is that quite often the hardest bit, authentication, comes right at the start. This presents a big hurdle for those who want to work with them, it certainly delayed me. In particular having Google authentication work with Shiny is
problematic, as the token itself needs to be reactive and only
applicable to the user who is authenticating.

But no longer! googleAuthR provides helper functions to make it easy to work with Google APIs. And its now available on CRAN (my first CRAN package!) so you can install it easily by typing:

After my experiences making shinyga and searchConsoleR, I decided inventing the authentication wheel each time wasn't necessary, so worked on this new R package that smooths out this pain point.

googleAuthR provides easy authentication within R or in a Shiny app for Google APIs. It provides a function factory you can use to generate your own functions, that call or do the actions you needed.

At last counting there are 83 APIs, many of which have no R library, so hopefully this library can help with that. Examples include the Google Prediction API, YouTube analytics API, Gmail API etc. etc.

Example using googleAuthR

Here is an example of making a goo.gl R package using googleAuthR:

If you then want to make this multi-user in Shiny, then you just need to use the helper functions provided:

I'm
excited about the possibilities with this package, as this new improved
data is now available in a way to interact with all the thousands of
other R packages.

If you'd like to see searchConsoleR capabilities, I have the package running an interactive demo here (very bare bones, but should demo the data well enough).

The
first application I'll talk about in this post is archiving data into a
.csv file, but expect more guides to come, in particular combining this
data with Google Analytics.

Automatic search analytics data downloads

The
90 day limit still applies to the search analytics data, so one of the
first applications should be archiving that data to make year on year,
month on month and general development of your SEO rankings.

The below R script:

Downloads and installs the searchConsoleR package if it isn't installed already.

Lets you set some parameters you want to download.

Downloads the data via the search_anaytics function.

Writes it to a csv in the same folder the script is run in.

The .csv file can be opened in Excel or similar.

This should give you nice juicy data.

Considerations

The first time you will need to run the scr_auth() script
yourself so you can give the package access, but afterwards it will
auto-refresh the authentication each time you run the script.

If you ever need a new user to be authenticated, run scr_auth(new_user=TRUE)

You
may want to modify the script so it appends to a file instead, rather
than having a daily dump, although I do this with a folder of .csv's to
import them all into one R dataframe (which you could export again to
one big .csv)

Automation

You can now take the download script and use it in automated batch files, to run daily.

browse to Rscript.exe which should be placed e.g. here:
"C:\Program Files\R\R-3.2.0\bin\x64\Rscript.exe"

input the name of your file in the parameters field

input the path where the script is to be found in the Start in field

go to the Triggers tab

create new trigger

choose that task should be done each day, month, ... repeated several times, or whatever you like

In Linux, you can probably work it out yourself :)

Conclusion

Hopefully this shows how with a few lines of R you can get access to
this data set. I'll be doing more posts in the future using
this package, so if you have any feedback let me know and I may be able
to post about it. If you find any bugs or features you would like,
please also report an issue on the searchConsoleR issues page on Github.

Tags

Introduction

The aim of this post is to give you the tools to enhance your Google Analytics data with R and present it on-line using Shiny. By following the steps below, you should have your own on-line GA dashboard, with these features:

Interactive trend graphs.

Auto-updating Google Analytics data.

Zoomable day-of-week heatmaps.

Top Level Trends via Year on Year, Month on Month and Last Month vs Month Last Year data modules.

A MySQL connection for data blending your own data with GA data.

An easy upload option to update a MySQL database.

Analysis of the impact of marketing events via Google's CausalImpact.

Detection of unusual time-points using Twitter's Anomaly Detection.

A lot of these features are either unavailable in the normal GA reports, or only possible in Google Analytics Premium. Under the hood, the dashboard is exporting the data via the Google Analytics Reporting API, transforming it with various R statistical packages and then publishing it on-line via Shiny.

Feature Detail

Here are some details on what modules are within the dashboard. A quick start guide on how to get the dashboard running with your own data is at the bottom.

Trend Graph

Most dashboards feature a trend plot, so you can quickly see how you are doing over time. The dashboard uses dygraphs javascript library, which allows you to interact with the plot to zoom, pan and shift your date window. Plot smoothing has been provided at the day, week, month and annual level.

Additionally, the events you upload via the MySQL upload also appear here, as well as any unusual time points detected as anomalies. You can go into greater detail on these in the Analyse section.

Heatmap

Heatmaps use colour intensity to show metrics between categories. The heatmap here is split into weeks and day per week, so you can quickly scan to see if a particular day of the week is popular - in the below plot, Monday/Tuesday look like they are best days for traffic.

The data window is set by what you select in the trend graph, and you can zoom for more detail using the mouse.

Top Level Trends

Quite often headlines just need a number to quickly check. These data modules give you a quick glance into how you are doing, comparing last week to the week before, last month to the month before and last month to the same month the year before. Between them, you should see how your data is trending, accounting for seasonal variation.

MySQL Connection

The code provides functions to connect to a MySQL database, which you can use to blend your data with Google Analytics, provided you have a key to link them on.

In the demo dashboard the key used is simply the date, but this can be expanded to include linking on a userID from say a CRM database to the Google Analytics CID, Transaction IDs to off-line sales data, or extra campaign information to your campaign IDs. An interface is also provided to let end users update the database by uploading a text file.

CausalImpact

In the demo dashboard, the MySQL connection is used to upload Event data, which is then used to compare with the Google Analytics data to see if the event had a statistically significant impact on your traffic. This replicates a lot of the functionality of the GA Effect dashboard.

Headline impact of the event is shown in the summary dashboard tab. If its statistically significant, the impact is shown in blue.

Anomaly Detection

Twitter has released this R package to help detect unusual time points for use within their data streams, which is also handy for Google Analytics trend data.

The annotations on the main trend plot are indicated using this package, and you can go into more detail and tweak the results in the Analyse section.

Making the dashboard multi-user

In this demo I’ve taken the usual use case of an internal department just looking to report on one Google Analytics property, but if you would like end users to authenticate with their own Google Analytics property, it can be combined with my shinyga() package, which provides functions which enable self authentication, similar to my GA Effect/Rollup/Meta apps.

In production, you can publish the dashboard behind a Shinyapps authentication login (needs a paid plan), or deploy your own Shiny Server to publish the dashboard on your company intranet.

Quick Start

Now you have seen the features, the below goes through the process for getting this dashboard for yourself. This guide assumes you know of R and Shiny - if you don’t then start there: http://shiny.rstudio.com/

You don’t need to have the MySQL details ready to see the app in action, it will just lack persistent storage.

Setup the files

Find your GA View ID you want to pull data from. The quickest way to find it is to login to your Google Analytics account, go to the View then look at the URL: the number after “p” is the ID.

[Optional] Get your MySQL setup with a user and IP address. See next section on how this is done using Google Cloud SQL. You will also need to white-list the IP of where your app will sit, which will be your own Shiny Server or shinyapps.io. Add your local IP for testing too. If using shinyapps.io their IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75; 54.204.37.78.

Create a file called secrets.R file in the same directory as the app with the below content filled in with your details.

Configuring R

1. Make sure you can install and run all the libraries needed by the app:

2. Run the below command locally first, to store the auth token in the same folder. You will be prompted to login with the Google account that has access to the GA View ID you put into step 3, and get a code to paste into the R console. This will then be uploaded with app and handle the authentication with Google Analytics when in production:

> rga::rga.open(where="token.rga")

3. Test the app by hitting the “Run App” button at the top right of the server.ui script in RStudio, or by running:

> shiny::runApp()

Using the dashboard

The app should now be running locally in a browser window with your own GA data. It can take up to 30 seconds for all the data to load first time.

Deploy the instance on-line to Shinyapps.io with a free account there, or to your own Shiny Server instance.

Customise your instance. If for any reason you don’t want certain features, then remove the feature in the ui.R script - the data is only called when the needed plot is viewed.

Getting a MySQL setup through Google Cloud SQL

If you want a MySQL database to use with the app, I use Google Cloud SQL. Setup is simple:

Make sure you have billing turned on with your billing accounts menu top right.

Go to Storage > Cloud SQL in the left hand menu.

Create a New Instance.

Create a new Database called “onlinegashiny”

Under “Access Control” you need to put in the IP of yourself where you test it, as well as the IPs of the Shiny Server/shinyapps.io. If you are using shinyapps.io the IPs are: 54.204.29.251; 54.204.34.9; 54.204.36.75;54.204.37.78

Under “IP Address” create a static IP (Charged at $0.24 a day)

You now should have all the access info you need to put in the apps secrets.R for MySQL access. The port should be a default 3306

You can also limit the amount of data that is uploaded by the shiny.maxRequestSize option - default is 0.5 MB.

Summary

Hopefully the above could help inspire what can be done with your Google Analytics data. Focus has been on trying to give you the tools that allow action to be made on your data.

There is a lot more you can do via the thousands of R packages available, but hopefully this gives a framework you can build upon.

I’d love to see what you build with it, so do please feel free to get in touch. :)

Meanwhile RStudio are releasing more and more packages that make it quicker and easier to create interactive graphics, with tools for connecting and reshaping data and then plotting using attractive JavaScript visualisation libraries or native interactive R plots. GA Effect is also being hosted using ShinyApps.io, an R server solution that enables you to publish straight from your console, or you can run your own server using Shiny Server.

Putting them together

Web Interaction

First off, using RStudio makes this all a lot easier as they have a lot of integration with their products.

ShinyDashboard is a custom theme of the more general Shiny. As detailed in the getting started guide, creating a blank webpage dashboard with shinydashboard take 8 lines of R code. You can test or run everything locally first before publishing to the web via the “Publish” button at the top.

Probably the most difficult concept to get around is the reactive programming functions in a Shiny app. This is effectively how the interaction occurs, and sets up live relationships between inputs from your UX script (always called ui.R) and outputs from your server side scripts (called server.r). These are your effective front-end and back-end in a traditional web environment. The Shiny packages takes your R code and changes it into HTML5 and JavaScript. You can also import JavaScript of your own if you need it to cover what Shiny can’t.

The Shiny code then creates the UI for the app, and creates reactive versions of the datatables needed for the plots.

Google Authentication

The Google authentication flow uses OAuth2 and could be used for any Google API in the console, such as BigQuery, Gmail, Google Drive etc. I include the code used for the authentication dance below so you can use it in your own apps:

Fetching Google Analytics Data

Once a user has authenticated with Google, the user token is then passed to rga() to fetch the GA data, according to which metric and segment the user has selected.

This is done reactively, so each time you update the options a new data fetch to the API is made. Shiny apps are on a per user basis and work in RAM, so the data is forgotten once the app closes down.

Doing the Statistics

You can now manipulate the data however you wish. I put it through the CausalImpact package as that was the application goal, but you have a wealth of other R packages that could be used such as machine learning, text analysis, and all the other statistical packages available in the R universe. It really is only limited by your imagination.

Here is a link to the CausalImpact paper, if you really want to get in-depth with the methods used. It includes some nice examples of predicting the impact of search campaign clicks.

Here is how CausalImpact was implemented as a function in GA Effect:

Plotting

dygraphs() is an R package that takes R input and outputs the JavaScript needed to display it in your browser, and as its made by RStudio they also made it compatible with Shiny. It is an application of HTMLwidgets, which lets you take any JavaScript library and make it compatible with R code. Here is an example of how the main result graph was generated:

Publishing

I’ve been testing the alpha of shinyapps.io for a year now, but it is just this month (Feb 2015) coming out of beta. If you have an account, then publishing your app is as simple as pushing “Publish” button above your script, where it appears at a public URL. With other paid plans, you can limit access to authenticated users only.

Next steps

This app only took me 3 days with my baby daughter on my lap during a sick weekend, so I’m sure you can come up with similar given time and experience. The components are all there now to make some seriously great apps for analytics. If you make something do please let me know!

I had an OSX 10.10.2 update on my 2011 Macbook Air, and left the laptop open last night. This put it in Hibernation mode which breaks the auto-installation, so when I tried to use the laptop this morning, it booted to the Apple logo, but then the screen went totally black without the option to login. The cursor was still live though.

The fix below will let you login again. It will only work in the above scenario, if its your backlight broken or something else keep searching :)

The Measurement Protocol was launched at the same time as Universal Analytics, but I've seen less adoption of it with clients, so this post is an attempt to show what can be done with it with a practical example.

Three days later they open the email again at home, and click through to the offer on your website.

They complete the form on the page and convert.

Within GA, you will be able to see for that campaign 2 opens, 1 click/visit and 1 conversion for that user. As with all email open tracking, you are dependent on the user downloading the image, which is why I include the option to upload an image and not just a pixel, as it may be more enticing to allow images in your newsletter.

Intro

The Measurement Protocol lets you track beyond the website, without the need of client-side JavaScript. You construct the URL and when that URL is loaded, you see the hit in your Google Analytics account. That's it.

The clever bit is that you can link user sessions together via the CID (Customer ID), so you can track the upcoming Internet of Things off-line to on-line, but also things like email opens and affiliate thank you pages. It also works with things like enhanced e-commerce, so can be used for customer refunds or product impressions.

This demo looks at e-mail opens for its example, but its minor modifications to track other things. For instance, I use a similar script to measure in GA when my Raspberry Pi is backing up our home computers via Time Machine.

Demo on App Engine

To use the Measurement Protocol in production most likely needs server-side code. I'm running a demo on Google App Engine coded in Python, which is pretty readable so should make it fairly easy for a developer to replicate in their favourite language. App Engine is also a good choice if you are wanting to run it in production, since it has a free tier for tracking 1000s of email opens a day, but scalability to handle millions.

There are instructions on Github on how it works, but I'll run through some of the key concepts here in this post.

What the code does

The example has four main URLs:

The homepage explaining the app

The image URL itself, that when loaded creates the hit to GA

A landing page with example custom GA tracking script

An upload image form to change the image you would display in the e-mail.

The URLs above are controlled server side with the code in main.py

Homepage

This does nothing server side aside serve up the page

Image URL

This is the main point of the app - it turns a GET request for the image uploaded into a POST with the parameters found in the URL. It handles the different options and sends the hit to GA as a virtual pageview or event, with a unique user CID and campaign name. An example URL here is:

This does little but take the cid you put in the email URL, and outputs the CID that will be used in Google Analytics. If this is the same CID as in the image URL and the user clicks in the email, those sessions will be linked. You can also add the GA campaign parameters, but the sever side script ignores those - the javascript on the page will take care of it. An example URL here is:

The CID in the landing page URL is then captured and turned into an anonymous CID for GA. This is then served up to the Universal Analytics JavaScript on the landing page, shown below. Use the same UA code for both, else it won't work (e.g. UA-123456-1)

Upload Image

This just handles the image uploading and serves the image up via App Engines blobstore. Nothing pertinent to GA here so see the Github code if interested.

Summary

Its hoped this helps sell using the Measurement Protocol to more developers, as it offers a solution to a lot of the problems with digital measurement today, such as attribution of users beyond the website. The implementation is reasonably simple, but the power is in what you send and what situations. Hopefully this inspires what you could do with your setup.

There are some limitations to be aware of - the CID linking won't stitch sessions together, it just discards a user's old CID if they already had one, so you may want to look at userID or how to customise the CID for users who visit your website first before the email is sent. The best scenario would be if a user is logged in for every session, but this may not be practical. It may be that the value of linking sessions is so advantageous in the future, entire website strategies will be focused on getting users to ID themselves, such as via social logins.

Always consider privacy: look for user's to opt in, and make sure to use GA filters to take out any PII you may put into GA as a result. Current policy looks to be that if the data within GA is not able to be tracked to an individual (e.g. a name, address or email) then you are able to record an anonymous personal ID, that could be exported and linked to PII outside of GA. This is a bit of a shifting target, but in all cases keeping it as user focused and not profit focused as possible should see you through any ethical questions.

CausalImpact is a package that looks to give some statistics behind changes you may have done in a marketing campaign. It examines the time-series of data before and after an event, and gives you some idea on whether any changes were just down to random variation, or the event actually made a difference.

You can now test this yourself in my Shiny app that automatically pulls in your Google Analytics data so that you can apply CausalImpact to it. This way you can A/B test changes for all your marketing channels, not just SEO. However, if you want to try it manually yourself, keep reading.

Considerations before getting the data

Suffice to say, it should only be applied to time-series data (e.g. there is date or time on the x-axis), and it helps if the event was rolled out on only one of those time points. This may influence the choice of time unit you use, so if say it rolled out over a week its probably better to use weekly data exports. Also consider the time period you choose. The package will use the time-series before the event to construct what it thinks should happen vs what actually happened, so if anything unusual or spikes occur in the test period it may affect your results.

Metrics wise the example here is with visits. You could perhaps do it with conversions or revenue, but then you may get affected by factors outside of your control (the buy button breaking etc.), so for clean results try to take out as many confounding variables as possible.

Example with SEO Titles

For me though, I had an example where some title tag changes went live on one day, so could compare the SEO traffic before and after to judge if it had any effect, and also more importantly judge how much extra traffic had increased.

Setup

I first setup, importing the libraries if you haven't got them and authenticating the GA account you want to pull data from.

Import GA data

I then pull in the data for the time period covering the event. SEO Visits by date.

Apply CausalImpact

In this example, the title tags got updated on the 200th day of the time-period I pulled. I want to examine what happened the next 44 days.

Plot the Results

With the plot() function you get output like this:

The left vertical dotted line is where the estimate on what should have happened is calculated from.

The right vertical dotted line is the event itself. (SEO title tag update)

The original data you pulled is the top graph.

The middle graph shows the estimated impact of the event per day.

The bottom graph shows the estimated impact of the event overall.

In this example it can be seen that after 44 days there is an estimated 90,000 more SEO visits from the title tag changes. This then can be used to work out the ROI over time for that change.

Report the results

The $report method gives you a nice overview of the statistics in a verbose form, to help qualify your results. Here is a sample output:

"During the post-intervention period, the response variable had an average value of approx. 94. By contrast, in the absence of an intervention, we would have expected an average response of 74. The 95% interval of this counterfactual prediction is [67, 81]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 20 with a 95% interval of [14, 27]. For a discussion of the significance of this effect, see below.

Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 4.16K. By contrast, had the intervention not taken place, we would have expected a sum of 3.27K. The 95% interval of this prediction is [2.96K, 3.56K].

The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +27%. The 95% interval of this percentage is [+18%, +37%].

This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (20) to the original goal of the underlying intervention.

The probability of obtaining this effect by chance is very small (Bayesian tail-area probability p = 0.001). This means the causal effect can be considered statistically significant."

Next steps

This could then be repeated for things like UX changes, TV campaigns, etc. You just need the time of the event and the right metrics or KPIs to measure them against.

Tags

Mark Edmondson

Views on this blog are entirely my own and do not reflect my employers. Content may also be unbelievably true, or exaggerated falsehoods. Its written for friends, so if I don't know you please act as a potential friend :) Any trolling or spamming will be punishable by death.
Contact: mark [at] markedmondson.me