Differential privacy — Everything you need to know!

With iOS 10.3, Apple is hoping to start analyzing user data to improve iCloud. Typically, large internet companies like Google and Facebook want to vacuum all our data up into the cloud so they can feed their search indexes, social graphs, and artificial intelligence projects. They claim to anonymize some of it by stripping off personal identifiers but, data at sufficient volume still paints precise enough patterns that identity can be sometimes still be determined.

Apple's trying to do it in a way that maintains effective anonymity for their users. And one of the ways they're doing that is with a technique called "differential privacy".

Here's what Apple says about it:

Apple would like your help to improve our products and services by using, in a privacy preserving manner, data from your iCloud account.

Analysis of data from your iCloud account is undertaken only after the data has undergone privacy preserving techniques such as differential privacy. Analysis of such data will allow Apple to improve intelligent features and services such as Siri and other similar or related services.

You may choose to turn off iCloud Analytics at any time. To do so, you can open Settings, tap Privacy, tap Analytics and set "Share iCloud Analytics" to off.

How does differential privacy work?

Unveiled at WWDC 2016, differential privacy works by adding a level of noise to data at the point of collection — like when you add a new word to the auto-predict dictionary in the QuickType keyboard.

Once large amounts of data from large volumes of users is collected, statistical analysis is then used to "see" the patterns through the noise. It's impossible to tell who was the source of any particular piece of data, but the frequency, popularity, and other meta-information about that data becomes usable by Apple to improve services.

That's... obscure. Example, please!

Let's say Apple wanted to know whether to suggest "Wars" or "Trek" more often as an auto-complete for "Star", but they want to make sure not to start any family feuds — because sci-fan fans are passionate!

With differential privacy, when collecting the data, Apple flips a virtual coin and then every time the coin comes up "heads", records the answer incorrectly. In other words, it records a "lie". So, when recording a million answers, half of them would be "lies", and it would be impossible to know if any individual answer was the truth or a "lie".

By collecting enough responses from enough people, and using statistical analysis to figure out how many lies there were, Apple could get really close to the proper answer, again, without ever knowing exactly which individuals answered which way.

Less noise could be added for areas of less popularity and frequency, more noise for areas with higher popularity and frequency. The system could automatically opt-out anyone who was contributing too much data to protect their privacy.

That's just one example of how noise and statistics can be added to protect privacy but still provide useful data. Apple wants to start using to it improve the quality of Siri responses and related services as well.

Privacy different

For end users, the benefits of crowd sourcing are tangible. If millions of people started typing or saying "shway" for "cool", Apple could quickly add it to dictionaries, without ever needing, or caring, to know whether you typed or said it or not.

In other words, differential privacy lets Apple provide most of the crowd-sourced, artificially intelligent services we want, but without the intrusive data collection practices that creep us out.