Measuring the Super Bowl Ads through a Social Media Lens

Resource Interactive evaluated the Super Bowl ads this year from a digital and social media perspective — how well did the ads integrate with digital channels (web sites, social media, mobile, and overall user experience) before and during the game. I got tapped to pull some hard data. It was an interesting experience!

A Different Kind of Measurement

This was a different kind of measurement from what I normally do. I definitely figured out a few things that we’ll be able to apply to client work in the future, but, while, on the surface, this exercise seemed like just a slight one-off from the performance measurement we already do day in and day out, it actually has some pretty hefty differences:

Presumption of Common Objectives — we used a uniform set of criteria to measure the ads, which, by definition, means that we had to assume the ads were all, basically, trying to reach the same consumers and deliver the same results. Or, to be more accurate, we used a uniform set of criteria and then made some assumptions about the brand to inform how an ad and it’s digital integration was judged. That’s a little backwards from how a marketer would normally measure a campaign’s performance.

Over 30 Brands — the sheer volume of brands that advertise at the Super Bowl introduces a wrinkle. From Teleflora to PepsiMax to Kia to Groupon, the full list was longer than any single brand would normally watch as its “major competitors.”

Real-Time Assessment — we determined that we wanted to have our evaluation completed no later than first thing Monday morning. The reality of Marketing, though, is that, even as there is a high degree of immediacy and real-time-ness…successful campaigns actually play out over time. In this case, though, we had to make a judgment within a few hours of the end of the game itself.

No Iterations — I certainly could (and did) do some test data pulls, but I really had no idea what the data was going to look like when The Game actually hit. So, we chose a host of metrics, and I laid out my scorecard with no idea as to how it would turn out once data was plugged in. Normally, I would want to have some time to iterate and adjust exactly what data was included and how it was presented (certainly starting with a well-thought-out plan of what was being included and why, but knowing that I would likely find some not-useful pieces and some additions that were warranted).

It was a challenge, for sure!

The Approach

While the data I provided — the most objective and quantitative of the whole exercise — was not core to the overall scoring…the approach we took was pretty robust (I had little to do with developing the approach — this is me applauding the work of some of my co-workers).

Simply put, we broke the “digital” aspects of the experience into several different buckets, assigned a point person to each of those buckets, and then had that person and his/her team develop a set of heuristics against which they would evaluate each brand that was advertising. That made the process reasonably objective, and it acknowledged that we are far, far, far from having a way to directly and immediately quantify the impact of any campaign. Rather, we recognized that digital is what we do. Ad Age putting us at No. 4 on their Agency A-List was just further validation of what I already knew — we have some damn talented folk at RI, and their experience-based judgments hold sway.

For my part, I worked with Hayes Davis at TweetReach, Eric Peterson at Twitalyzer, and my mouse and keyboard at Microsoft Excel to set up seven basic measures of a brand’s results on Twitter and in Facebook. For each measure, there were either two or three breakdowns of the measure, so I had a total of 17 specific measures. For each measure, I grouped each brand into one of three buckets: Top performer (green), bottom performers (red), all others (no color). My hope was that I would have a tight scorecard that would support the core teams’ scoring — perhaps causing a second look at a brand or two, but largely lining up with the experts’ assessment. And, this is how things wound up playing out.

The Metrics

The metrics I included on my scorecard came from three different angles with three different intents:

Brand mentions on Twitter — these were measures related to the overall reach of the “buzz” generated for each brand during the game; we worked with TweetReach to build out a series of trackers that reported — overall and in 5-minute increments — the number of tweets, overall exposure, and unique contributors

Facebook page growth — this was a simple measure of the growth of the fans of the brand’s Facebook page

The first set of measures were during-the-game measures, and we normalized them using the total number of seconds of advertising that the brands ran. The latter two sets of measures we assessed based on a pre-game baseline. We used Monday, 1/31/2011, as our baseline date. Immediately following the game, there was a lot of manual data refreshing — of Facebook pages and of Twitalyzer — followed by a lot of data entry.

As it turned out, many of the brands came up short when it came to integrating with their social media presence, which made for a pretty mixed bag of unimpressive results for the latter two categories above. Sure, BMW drove a big growth in fans of their page, but they did so by forcing fans to like the page to get to the content, which seems almost like having a registration form on the home page of a web site in order to access any content.

The Results

In the end, I had a “Christmas Tree” one-pager: for each metric, the top 25% of the brands were highlighted in green and the bottom 25% were highlighted in red. I’m not generally a fan of these sorts of scorecards as an operational tool, but, to get a visual cue as to which brands generally performed well as opposed to those that generally performed poorly, it worked. It also “worked” in that there were no hands-down, across-the-board winners.

What Else?

In addition to an overall scoring, we captured the raw TweetReach data and have started to look at it broken down into 5-minute increments to see which specific spots drove more/less social media conversations:

@Claire We certainly recognized that the sentiment of the coverage was important, but we relied on a more qualitative assessment for that (and hope to do a deeper dive there with the raw TweetReach data soon). Twitter is challenging when it comes to sentiment analysis — 140 characters makes for pretty limited context. I would have loved to generate a word cloud with the brand keywords removed for each brand, but time constraints didn’t make that feasible for the initial cut. It still would have been a qualitative assessment, but a qualitative assessment with a good data visualization to support it.