Data Science is Hard: Anomalies Part 2

Since last time we’ve noticed that the vast majority of these incoming pings are duplicate. I don’t mean that they look similar, I mean that they are absolutely identical down to their supposedly-unique document ids.

How could this happen?

Well, with a minimum of speculation we can assume that however these Firefox instances are being distributed, they are being distributed with full copies of the original profile data directory. This would contain not only the user’s configuration information, but also copies of all as-yet-unsent pings. Once the distributed Firefox instance was started in its new home, it would submit these pending pings, which would explain why they are all duplicated: the distributor copy-pasta’d them.

So if we want to learn anything about the population of machines that are actually running these instances, we need to ignore all of these duplicate pings. So I took my analysis from last time and tweaked it.

First off, to demonstrate just how much of the traffic spike we see is the same fifteen duplicate pings, here is a graph of ping volume vs unique ping volume:

The count of non-duplicated pings is minuscule. We can conclude from this that most of these distributed Firefox instances rarely get the opportunity to send more than one ping. (Because if they did, we’d see many more unique pings created on their new hosts)

What can we say about these unique pings?

Besides how infrequent they are? They come from instances that all have the same Random Agent Spoofer addon that we saw in the original analysis. None of them are set as the user’s default browser. The hosts are most likely to have a 2.4GHz or 3.5GHz cpu. The hosts come from a geographically-diverse spread of area, with a peculiarly-popular cluster in Montreal (maybe they like the bagels. I know I do).

All of the pings come from computers running Windows XP. I wish I were more surprised by this, but it really does turn out that running software over a decade past its best before is a bad idea.

Also of note: the length of time the browser is open for is far too short (60-75s mostly) for a human to get anything done with it:

(Telemetry needs 60s after Firefox starts up in order to send a ping so it’s possible that there are browsing sessions that are shorter than a minute that we aren’t seeing.)

What can/should be done about these pings?

These pings are coming in at a rate far exceeding what the entire Aurora 51 population had when it was an active release. Yet, Aurora 51 hasn’t been an active release for six months and Aurora itself is going away.

As such, though its volume seems to continue to increase, this anomaly is less and less likely to cause us real problems day-to-day. These pings are unlikely to accidentally corrupt a meaningful analysis or mis-scale a plot.

And with our duplicate detector identifying these pings as they come in, it isn’t clear that this actually poses an analysis risk at all.

So, should we do anything about this?

Well, it is approaching release-channel-levels of volume per-day, submitted evenly at all hours instead of in the hump-backed periodic wave that our population usually generates:

Hundreds of duplicates detected every minute means nearly a million pings a day. We can handle it (in the above plot I turned off release, whose low points coincide with aurora’s high points), but should we?

Maybe for Mozilla’s server budget’s sake we should shut down this data after all. There’s no point in receiving yet another billion copies of the exact same document. The only things that differ are the submission timestamp and submitting IP address.

Another point: it is unlikely that these hosts are participating in this distribution of their free will. The rate of growth, the length of sessions, the geographic spread, and the time of day the duplicates arrive at our servers strongly suggest that it isn’t humans who are operating these Firefox installs. Maybe for the health of these hosts on the Internet we should consider some way to hotpatch these wayward instances into quiescence.

I don’t know what we (mozilla) should do. Heck, I don’t even know what we can do.

I’ll bring this up on fhr-dev and see if we’ll just leave this alone, waiting for people to shut off their Windows XP machines… or if we can come up with something we can (and should) do.