A collection of discussions, links, stories, news and whatever else I find interesting in the fields of computing, information, science, privacy, semantics, mathematics and so on...

Monday, 10 September 2012

Explaining Primary and Secondary Data

One of the confusing aspects of privacy is the notion of whether something is primary or secondary data. These terms emerge from the context of data gathering and processing and are roughly defined thus:

Primary data is the data that is gathered for the purpose of providing a service, or, data about the user gathered directly

Secondary data is the data that is gathered from the provision of that service, ie: not by the user of that service, or, data about application gathered directly

Pretty poor definitions admittedly and possibly overly broad given all the contexts in which these terms must be applied. In our case we wish to concentrate more on services (and/or applications) that we might find in some mobile device or internet service.

First we need to look more at the architectural context in which data is being gathered. At the highest level of abstraction applications run within some given infrastructure:

Aside: I'll use the term application exclusively here, though the term service or even application and service can be substituted.

Expanding this out more we can visualise the communication channels between the "client-side" and "server-side" of the application. We can further subdivide the infrastructure more, but let's leave it as an undivided whole.

In the above we see a single data flow between the client and server via the infrastructure (cf: OSI 7-layer model, and also Tanenbaum). It is this data-flow that we must dissect and examine to understand the primary and secondary classifications.

However the situation is complicated as we can additionally collect information via the infrastructure: this data is the behaviour of the infrastructure itself (in the context of the application). For example this data is collected via log files such as those found in /var/log on Unix/Linux systems, or the logs from some application hosting environment, eg: Tomcat etc. This latter case we have indirect data gathering and whether this falls under primary or secondary as defined above is unclear, though it can be though of secondary, if both our previous definitions of primary and secondary can be coerced in a broader "primary" category. (If you're confused, think of the lawyers...)

Let's run through an example: an application which collects your location and friends' names over time. So as you're walking along, when you meet a friend you can type their name into the app* and it records the time and location and stores this the cloud (formerly known as a "centralised" database). Later you can view who you met, where and at what time in all sorts of interesting ways, such as on a map. You can even share this to Facebook, Twitter or one of the many other social networking sites (are there others?).

where userId is some identifier that you use to login to the application and later retrieve your data. At some point in time we might have the following data in our database:joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″EjoeBloggs, Jack, 2012-09-10, 12:18, 60°10′24″N 24°56′32″Ejane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

This set of data is primary - it is required for the functioning of the application and is all directly provided by the user.

By sending this data we can not avoid using whatever infrastructure is in place. Let's say there's some nice RESTful interface somewhere (hey, you could code this in Opa!) and by accessing that interface the service gathers information about the transaction which might be stored in a log file and look something like this:192.178.212.143, joeBloggs, post, 2012-09-10, 12:14:34.234264.172.211.10, arbunkleJr, post, 2012-09-10, 12:16:35.1234192.178.212.143, joeBloggs, get, 2012-09-10, 12:16:37.0012126.14.15.16, janeDoe, post, 2012-09-10, 12:17:22.0506

This data is indirectly gathered and contains information that is relevant to the running of infrastructure.

The two sets of data above are generally covered by the terms and conditions of using that application or service. These T&C's might also include a privacy policy explicitly or have a separate privacy policy additionally to cover disclosure and use of the information. Typical uses would cover authority requests, monitoring for misuse, monitoring of infrastructure etc. The consents might also include use of data for marketing and other purposes for which you will (or should) have an opt-out. The scope of any marketing request can vary but might include possibilities of identification and maybe some forms of anonymisation.

Note, if the service provides a method for you to share via Facebook or Twitter then this is an act you make and the provider of the service is not really responsible for you disclosing your own information publicly.

So that should explain a little about what is directly gathered, primary information and indirectly gathered information. Let's now continue to the meaning of secondary data.

When the application is started, closed or used we can gather information about this. This kind of data is called secondary because it is not directly related to the primary purpose of the application nor of the functioning of the infrastructure. Consent to collect such information needs to be asked for and good privacy practice suggests that this should be disabled by default. Some applications or services might anonymise the data in the opt-out situation (!). Secondary data collection is often presented as an offer to help with improving the quality of application or service. The amount of information gathered varies dramatically but generally application start, stop and abnormal exit (crashes) are gathered as well as major changes in the functionality, eg: moving between pages or different features. In the extreme we might even obtain a click-by-click data stream including x,y-coördinates, device characteristics and even locations from a gps.

What we can learn from this is how the application is behaving on the device and how the user is actually using that application. From the above we can find out what the status of the device was, the operating system version, type of device, whether the app started correctly in that configuration, from where the user started the app, which screen the app started up in, the accuracy and method of GPS positioning and so on.

So far there is nothing sinister about this, some data is required for the operation of the application and stored "in the cloud" for convenience, some data is collected by the infrastructure as part of its necessary operations and some data we voluntarily give up to help the poor application writers improve their products. And we (the user) consented to all of this.

From a privacy perspective these are all valid uses of data.

Now the problems start in three cases:

exporting to 3rd parties

cross-referencing

"anonymisation"

The above data is fantastic for marketing - a trace of your location over time plus some ideas about your social networking (even if we can't directly identify who "Jane" and "Jack" are .... yet!) provides great information for targeted advertising. If you're wondering the above coördinates are for Helsinki Central Railway Station...plenty of shops and services around there that would like your attention and custom.

How the data is exported to the 3rd party and at what level of granularity is critical for trust in the service. Abstracting the GPS coordinates by mapping to city area or broader plus removal of personally identifiable information (in this case we remove the userID...hashing may not be enough!). The amount of data minimisation here is critical, especially if we want to reduce the amount of tracking that 3rd parties can make. In the above example probably just sending the location and retrieving an advertisment back is enough, especially if it is handled server-side so even the client device address is hidden.

Cross-referencing is the really interesting case here. Given the above data-sets can we deduce "Joe's" friends...taking the infrastructure log file entries:

we can see that joeBloggs mentioned a "Jane" and jane123 mentioned a "Funny Joe" at those times. Now we might be very wrong in the next assumption but I think it is reasonably safe to say, even when we only have a string of characters "Jane" as an identifier we can make a very reasoned guess that Jane is jane123. Actually even the 4 (ASCII) characters that just happen to spell "Jane" aren't even required, though it does help the semantic matching.

This kind of matching and cross-referencing is exactly what happened in the AOL Search Data Leak incident. Which neatly takes me to anonymisation where just because some identifier is obscured doesn't mean that the information doesn't exist.

This we often see with hashing of identifiers, for example, our app designer has been reading stuff about privacy by design and has obscured the identifiers in the secondary data using a suitably randomsalted hash of sufficient length to be unbreakable for the next few universes - and we've salted the IP address too!

We still have a consistent hash for an IP address and user identifier so we can continue to track albeit without being able to recover who made and from where come the request. Note however the content line of the first entry:

How many Onkia Luna 700 2Gb owners running v1.2.3.4 of WP7 with version 1.1 of our application are there? Take a look at Panopticlick's browser testing to see how unique you are based on web-browser characteristics.

And then there are timestamps...let's go back to cross-referencing against our primary data and infrastructure log files and we can be pretty sure that we can reconstruct who that user is.

We could add in additional randomness by regenerating identifiers (or the hash salt if you like in some cases) for every session, this way we could only track over a particular period of usage.

So in conclusion we have presented what is meant by primary and secondary data, stated the different between directly gathered data and indirectly gathered data and explained some of the issues relating to the usage of this data.

Now there are some additional cases such as certain kinds of reporting, for example, music player usage and DRM cases which don't always fall easily into the above categories unless we define some sub-categories to handle this. Maybe more of that later.

2 comments:

To make the raw data easy and unproblematic to maneuver and construe, so that one can make improved and good judgment of the information, it is utterly imperative to systemize the entire compilation of data. See more spss statistical analysis