The diary of a dedicated Ubuntu user that lucked into his dream job working on the Ubuntu team.

Tuesday, August 10, 2010

Can we count users without uniquely identifying them?

Aaaah

Hi all. I'm just back from a rather nice holiday. Well, technically, I'm still on holiday, but there were a few things I wanted to take care of, so I popped in for a few hours of work yesterday and today. I saw that there was this post on Phoronix that triggered me writing a post that I've been meaning to do for the last few weeks, since the Canonical Platform Team got together in Prague three weeks ago, to be exact.

Pre-installed desktops ftw

One of the roles of Canonical relative to Ubuntu is to get Ubuntu pre-installed on as many computers as possible. This is one of the dreams of the Linux desktop. Pre-installs mean end users don't have to fiddle with configurations, installing drivers, etc... (at least when done well) and the users can make an apples to apples comparison between their free desktop and proprietary systems that normally come pre-installed.

Canonical does this by working with OEM customers. OEMs are companies that sell assembled computers to people. One of these customers asked Canonical if there was some way that they could know how many computers that they send out with Ubuntu on them keep Ubuntu on them. The customer's engineer came up with a system where they would create a unique identifier for each Ubuntu computer they sold, and then when the computers requested update info daily, it would send that unique identifier with it.

The customer didn't really want to use a unique identifier though, because though it was anonymous, the customer wanted to *count* computers, but unique identifiers are for *tracking* (following a user over time). We mulled it over and over, and finally, based on our experience with web browsers we hit upon a system of non-unique channel identifiers to do the counting. This would make tracking impossible, but of course, tracking is not the goal, counting is.

Non-unique channel identifiers

So, we flashed on this: if each install sent just the model name and the number of times it has updated, systems could be counted, but no unique data would ever be sent to the server. Now, I am not a mathematician, so each time I try to explain why I think this works, it takes me a while. But in the end, everyone is convinced. In fact, Matt Zimmerman ended up writing a test program to prove to himself that it worked. Let me try, stick with me here ...

Every day each computer from the customer sends it's model name and the number of times it has already sent this data to the server. So if a model of a computer is called, say "foo", the first day it sends "foo" and 0 to census.canonical.com. After sending the 0, the computer remembers that it already sent a 0, so it will send a 1 next time. When the server sees the foo.0 in the log data, it essential stars a new counter for the model foo. The total number of foo.0 are the total number of the model foo ever activated.

Take one of those foo computers. The next day it will send foo.1, saying "this is a computer of model foo, and this is the 2nd time it has pinged that it's alive". Notice that neither foo or the number 1 are unique data. Any number of computers will be reporting the exact same model name and increment number. When the server sees a 1 come in, it finds the first counter at 0 and increments that counter to 1. Now it knows the total number of computers ever activated (all the counters), and it can count all the counters that were incremented in a day and thereby know how many computers were online that day.

Future?

Currently this system is only slated to be used by the specific OEM customer who requested it, and it will be up to the customer to disclose the data they collect as they wish. I wonder if it would be a good thing to install on normal ISOs though, but this would be part of our normal participatory community decision making process. Projects like this make think that users would like to be counted, so long as they can't be tracked. We'll see how it plays out, it may be something to discuss at UDS if the community feels the data would be useful.

Privacy is hard, guys.One cannot be sure that you don't collect IP adresses. So say I am the only one with foo.434 and so far I had always the same IP. Now I visit one of my secret lovers (I have many) and log on from her place. Now you know that I have probably moved my laptop.

even if this was an opt-out, the numbers will not be accurate if a significant number of users deactivate them.What about others who will for some reason (i.e. to discourage such counting be done by OEMs in the future) will send such beacons from random computers (not from that vendor) to skew the results?

@TomBut you'll never be the only one with foo.434. There are literally millions of Ubuntu users; it's a fair bet that there's at least a half-dozen other foo.434's, unless you have some kind of crazy rare laptop.And shouldn't you be more worried about, say, your email provider? They can uniquely identify you (Since chances are you're the only one that uses your email account), and it's a known fact that most of them keep logs of IP addresses. Same goes for almost any website that you sign into. An Ubuntu counter system is the least of your worries.

Dieki:There are literally millions of users? How do you know that? Up till this discussion about how to count users... there's been no actual _counting_ of users. The reality is noone knows how many Ubuntu users are out there. Your _millions_ is completely pulled out of thin air and is a faith-based estimate.

Beyond that, you are not taking into account the time distribution of how ubuntu systems are installed. If install in the post-release rush..sure you are probably somewhat anonymous in the numbers for a period of time. But if you install 1 month.. 2 months.. 3 months... out from release.. Can you be sure that your low counter is not unique? And these late system activations are exactly how OEM installs would trickle in.

Please tell me how installing later from the release cycle some how magically makes the counter in any way more specific to you. People don't update all at the same time, nor do OEM's that ship an OS all ship the OS at the same time. Dell is only more recently moving to 10.04 even, meaning they've been selling 9.10 systems for quite some time.

Ah thanks for the link. I've been pushing people for _any_ description on how the counting is happening for something like 2 years. that article is the first public statement I've seen that attempt to publicly describe how its done. Much appreciated. Now to see if they will publish the algorithm used to boild down the number.. and to get specifics about the time window over which unique ip addresses are compiled.

I think the best solution would be to send MAC addresses to a census server (maybe even a hash of [MAC + BIOS info] - because some people would surely spoof their MACs). And this should be built into the standard ISO. Why should a user be bothered if his computer would send this data with the sole purpose of knowing how many Ubuntu users are out there? I don't think this invades the user's privacy at all...

You don't need to be a mathmo to explain how it works, instead just reframe it.

Each computer sends a unique ID to the server, then you count the unique IDs.

foo, bar, baz, etc.

But you don't want them to persist, so each time you send a unique ID, you generate a new one and throw away the old. You need to link them together, so the server gets sent both the old and new IDs.

Now each request the server can pull up the current unique ID, and replace the record with the new one, and so on.

It doesn't actually matter if the IDs are unique, just as long as the server replaces one with the other and leaves the second record alone.

And since it doesn't matter, it doesn't matter if you use a complex algorithm or a counter. In fact, a counter is better, since then the server can infer the next ID itself and you only need to send your current counter to the server and increment.

Obviously you don't increment until you're sure the server got the count, otherwise you'd leave gaps and create odd artifacts in the data.

I love Ubuntu, but if I should read sometime in the future that Canonical has supplied a system like the one described in this blog (except when it is opt-in), I will make it a point of honor to never use it again.

Come on, surely you have marketing people among your staff? They should already be crying uncle about all the bad publicity this will get you, regardless of whether it works as you describe or not.

What I think would be cool is if all OEM installs had this, and then maybe Canonical could release stats like:33% of OEM installs come from Dell33% of OEM installs come from System7634% of OEM installs come from ZaReason

Or ya know...whatever it actually is. Then we might see them start vying to be the OEM selling the most Ubuntu machines. While I suspect the latter two probably are already trying that, it'd be incentive for Dell to try to have their Linux sales outstrip the smaller OEMs' Linux sales, and so maybe then they'd start actually advertising the existence of their Linux machines.

Mackenzie,Why do you believe that Canonical is in a contractual business relationship with all three of those OEMs at the moment? And why do you believe that Dell's sales aren't outstripping the niche OEMs already?

@Jef:I didn't say they were. I said "would be cool if" -- as in, I'd like a way to see the stats between them. And I was just giving them each 1/3 ;-) But given that Dell hides Ubuntu at a URL you need to have memorized already (not listing it as an option with their usual stuff), I doubt Linux is selling too well there.

Mackenzie: I'm sure Dell sells more than its fair share of _linux_ systems when you look at its full line of products including servers and mobile devices. And I would imagine the _linux_ based Streak will sell its fair share as well even though its not currently on the homepage. I wonder how many of the _linux_ based unlocked Nokia N900's Dell has sold to date. Wouldn't it be fascinating if they have sold more N900's than System76 has sold netbooks.

Bounce the counts through an anonymizing service, and tracking the source IP of the requests becomes a non-issue. I'm sure Anonymizer would be happy to take Canonical's money to do this (disclaimer: I am a former employee of Anonymizer). Or the tor network would be fine, though I'd guess they'd like a few more nodes added for this.

--- Mad scientist idea disclaimer ---

There's still a signature, and its important. If there is only one "Bob's Internet Terminal", then it should *not* send its count every day, as this is highly trackable. However, it can send a "i386 machine" count every day. If you feed back a score to the program that indicates how large the crowd it has just claimed to be a part of is, it can add more info. It would go something like:

client->: hi I am a machine, my counter is 0server->: thank you. You are in a MASSIVE group

client waits 24 hours

client->: hi I am a Dell+OEM installed, my counter is 1server->: thank you. You are in a LARGE group

client waits 24 hours

client->: hi I am a Dell mini10n OEM installed, my counter is 2server->: thank you. You are in a MEDIUM group

client will then send Dell mini10n as long as it gets back MEDIUM

scenario 2:

.0 is repeated as above

client->: Hi I am a Generic Ubuntu Box, my counter is 1sever->: Thank you, you are in a MASSIVE group

24 hrs.

client->: Hi I am a Bob's Super Crazy Unique machine, my counter is 2server->: Thank you, you are in a TINY group

24 hrs.

client->: Hi I am a Generic Ubuntu Box...

and that would continue for a *random* length time of at least 90 days before it feeds back its model string again.

This is highly open to abuse, so you can fight that with random challenge and response to aid in at least keeping the abusers honest. Basically, when you get the .0, you feed them back a token that the client should keep. Then in a tiny sampling every day you say "Hey can I get back the token I gave you?" The client will only feed back the token once per year, so there's no chance of the server being able to "track" the user, but the server should have a reasonable chance of getting back valid tokens, and ONLY getting back the tokens it fed to people once per year.

The token might embed the date it was given, so any abusers will be limited to messing with numbers closer to 0, rather than closer to 365, because they'll have to *wait* all year to get those numbers screwed up. If there are abnormal rates of declined token response, then most likely these are abusers and can be removed statistically.

If people are worried about the sample size of the tokens, I suggest that the community runs and audits this service to ensure that it is not being tampered with.

One thing that worries me is that this token could actually be recovered by other means and then tied to the responses, but again, if you anonymize the IP, its pretty hard to do anything with that other than say that yes, this computer's OS was in fact first booted on day X.

It would be great if there was a way to get accurate statistics on the, let's say, number of Ubuntu installations for a specific country (geoip). This would help tremendously our LoCo efforts as we are now in the dark.

The counting method is simple and obvious (it can even include model info if it's really necessary). To get the number of all-time activations, the server keeps track of the 0's, or activation pings. For each day, the server can track the number of activations (0's), and also the number of normal users active on that day.

This method is not only simpler than the suggested method, it also solves the privacy problems (barring IP address logging, of course). Because the systems do not send a "day #5301" ping, they are not uniquely identifiable. After its "day 0" ping, a system only sends an "I'm alive" 1 ping.