Apple keeps a record of Siri queries but says it anonymizes the data.

Remember that time when you asked Siri about the nearest place to find hookers? Or perhaps the time you wanted to know where to find the best burritos at 3am? Whatever you've been asking Siri since its launch in late 2011 is likely still on record with Apple, as revealed by a report by our friends at Wired on Friday. Apple spokesperson Trudy Muller told Wired that Apple stores Siri queries on its servers for "up to two years," though the company says it makes efforts to anonymize the data.

"Apple may keep anonymized Siri data for up to two years," Muller said. "Our customers’ privacy is very important to us."

Why does Apple have your Siri queries on record in the first place? Remember, Siri doesn't just operate locally on your iPhone or iPad—when you ask it a question, your voice query is sent to Apple's servers for processing before the answer—a Google search, an answer from Wolfram alpha, a Yelp result, etc.—is sent back. That's why an Internet connection is required in order to use Siri; if you have no Wi-Fi or cellular signal, you can't use Siri to perform any actions.

According to Wired, Apple generates "random numbers to represent the user and it associates the voice files with that number" when your Siri data is sent to the server. This string of numbers isn't associated with your Apple ID or e-mail, but it does represent your device when Apple is processing the query. "Once the voice recording is six months old, Apple 'disassociates' your user number from the clip, deleting the number from the voice file. But it keeps these disassociated files for up to 18 more months for testing and product improvement purposes," Wired wrote.

The question came up thanks to pressure from American Civil Liberties Union lawyer Nicole Ozer, who thinks Apple needs to post its Siri privacy policy online so users are fully informed about what's happening to their information. Indeed, although most iOS users are likely only using Siri to set up reminders or send tweets, people should be cautious about using Siri to send or dictate any sensitive information.

Many have been aware of this since Siri first came out thanks to the Internet connection requirement, but Apple's acknowledgment that it keeps the data is a new reminder about the potential privacy risks. After all, our last poll on whether Ars readers would use Siri on OS X showed that 52 percent would at least give it a shot—people tend to conduct even more sensitive business on their computers than their mobile devices, so the data retention aspect is an important one to keep in mind.

Muller pointed out, however, that the identifiers are deleted immediately—"along with any associated data"—when a user turns Siri off on his or her device. (You can do this by going to Settings > General > Siri on a supported iOS device.)

Jacqui Cheng
Jacqui is an Editor at Large at Ars Technica, where she has spent the last eight years writing about Apple culture, gadgets, social networking, privacy, and more. Emailjacqui@arstechnica.com//Twitter@eJacqui

46 Reader Comments

I think that's somewhat reasonable. They might use it internally for analytics purposes, perhaps to see what kinds of things people try to query that Siri doesn't yet support, and to see where Siri trips up. Storing individual users' data makes sense, because you'd want to know if certain problems were general or user-specific (a heavy accent, for example).

I wonder how difficult it would be to use a voiceprint to match up all the queries by a particular person to a more recent (non-anonymized) voice sample. I could imagine that this anonymized data is not really anonymous. Just harder to search.

I wonder how difficult it would be to use a voiceprint to match up all the queries by a particular person to a more recent (non-anonymized) voice sample. I could imagine that this anonymized data is not really anonymous. Just harder to search.

Anonymous just means "not identified by name." It does not mean "completely unidentifiable."

In other words, to check to make sure it continues to understand and remain accurate for as many vocal profiles, pronunciations, and situations as possible. Obviously, the speed at which it can decide each command is also important.

With more test data - regression becomes easier to detect. In particular - this can help in detecting edge cases.

"Apple keeps a record of Siri queries but says it anonymizes the data."

Yeah why is this stored at all?

"Best Taco in San Jose" WHY is that worth storing at all?

Just in case the FBI wants to know I ate a Taco in San Jose California + Date + Time + Location?

A. The obvious conclusion is that this sort of data improves the service. Occam's razor.B. No one said anything about the FBI, let's not have a knee-jerk reaction and turn this into a government conspiracy. Since this is not directly correlated with an Apple ID or IMEI, it would be pretty hard for a third-party to link this data back to a person and even then, all they would have access to is 6 months of a typical person asking about restaurants and setting reminders to pick up milk on the way home. Anyone involved in illicit activities is not likely going to be conducting that business through a smartphone of any kind, they will be using a burner.

I wonder how difficult it would be to use a voiceprint to match up all the queries by a particular person to a more recent (non-anonymized) voice sample. I could imagine that this anonymized data is not really anonymous. Just harder to search.

With the sheer amount of data they likely have, I would imagine it would be extremely difficult. While the size of the needle would remain consistent, the haystack gets bigger and bigger. It's one thing to analyze a voice sample to determine words and phrases and even intent.

It's an entirely different problem to analyze samples to create a unique profile of the user to be utilized like a fingerprint. This doesn't even take into account that voices can be drastically affected by environmental factors, social factors, mood, and even that many people are great mimics.

It's an interesting problem, for sure, but more likely to be explored in the academic realm rather than any business domain.

I think his point was "why are people worked up about Apple keeping anonymized stuff for two years when Google keeps non-anonymized stuff forever?"

On a different note, if anyone doesn't understand why Apple (or Google, for that matter) want to keep these kinds of logs around, then I recommend reading In the Plex. It goes into some detail regarding how Google mines the search log history to improve the product.

"Apple keeps a record of Siri queries but says it anonymizes the data."

Yeah why is this stored at all?

"Best Taco in San Jose" WHY is that worth storing at all?

Just in case the FBI wants to know I ate a Taco in San Jose California + Date + Time + Location?

Minus million.

Apple are keeping the data because they think it's valuable in some way. Pretty simple. Frankly gov employees aren't sophisticated enough to manipulate the entire computer industry in a way you are alluding to.

That said, I don't use Siri and I no longer search with Google, and this data retention is the reason why. If you don't like this just stop using it.

If you don't drive, i find siri a complete waste for me. it's much slower than me just scheduling my own appointment, sending texts, etc due to slowness of network processing or just the multitude of errors it makes.

Now the Google Search app is not that way. makes me want to switch to android.

Google Now, Siri's rough equivalent on android devices, likely keeps a similar history, and they openly say that it's there to help improve voice recognition, as others have already said. apple probably does the same with that historical data. i'm no tin-foil-hat kinda guy, but i have my privacy concerns as well. just use common sense (which apparently isn't as common as it used to be) when putting stuff online. duh.

I would prefer that the processing happened locally but that is not practical at this point.

if the android voice input can be made to work reliably offline (which on my droid4 with stock JB, it does) then there's no reason siri can't be made to work offline too. for things like making appointments, writing texts, and stuff that doesn't really need an internet connection, it makes sense.

I would prefer that the processing happened locally but that is not practical at this point.

if the android voice input can be made to work reliably offline (which on my droid4 with stock JB, it does) then there's no reason siri can't be made to work offline too. for things like making appointments, writing texts, and stuff that doesn't really need an internet connection, it makes sense.

"Remember that time when you asked Siri about the nearest place to find hookers?"

You women can't have it both ways complaining about rape or being treated as meat and then generalize the opposite sex like cavemen with cash or an expense account.

Either my sarcasm meter is broken, or you'll need to point out where the hookers are identified as females in the sentence you quoted -- because not all hookers are female, and not all people who buy services from hookers are male.

I wonder how difficult it would be to use a voiceprint to match up all the queries by a particular person to a more recent (non-anonymized) voice sample. I could imagine that this anonymized data is not really anonymous. Just harder to search.

Anonymous just means "not identified by name." It does not mean "completely unidentifiable."

I'm not sure I agree with your definition of anonymous; strictly by that definition a photograph of a person's face would qualify. Also a phone number, license plate, or other unique identifier.

There is plenty of evidence that "anonymized" rarely means the data cannot be linked back to your name. The term generally means they removed all the parts that make it easy to link, but usually is used with the implication that privacy is being protected. I think it's useful to question that implication.

"Remember that time when you asked Siri about the nearest place to find hookers?"

You women can't have it both ways complaining about rape or being treated as meat and then generalize the opposite sex like cavemen with cash or an expense account.

Either my sarcasm meter is broken, or you'll need to point out where the hookers are identified as females in the sentence you quoted -- because not all hookers are female, and not all people who buy services from hookers are male.

If you're a woman and need to pay for a hooker, you're doing it wrong.

"Apple keeps a record of Siri queries but says it anonymizes the data."

Yeah why is this stored at all?

"Best Taco in San Jose" WHY is that worth storing at all?

Just in case the FBI wants to know I ate a Taco in San Jose California + Date + Time + Location?

A. The obvious conclusion is that this sort of data improves the service. Occam's razor.B. No one said anything about the FBI, let's not have a knee-jerk reaction and turn this into a government conspiracy. Since this is not directly correlated with an Apple ID or IMEI, it would be pretty hard for a third-party to link this data back to a person and even then, all they would have access to is 6 months of a typical person asking about restaurants and setting reminders to pick up milk on the way home. Anyone involved in illicit activities is not likely going to be conducting that business through a smartphone of any kind, they will be using a burner.

This. I work in "Big Data" (I HATE that euphemism, BTW), and while my perspective is different than most, I don't really understand the much of the paranoia around the storing of certain data. Hell, I work in retail, and I *suppose* I could go in and see what John Q. Public is buying, but that's a waste of my time not to mention fairly difficult to do the way the minimal personal data we have is separate from the transaction data. All of our data is anonymized to a very necessary extent and any personal data we hold is for expressly for trying to match up customers using different forms of ID.

People that work in this field have absolutely no interest in an individual's information, especially those with as much data as Apple or Google. We want aggregated information that gives us trends and patterns to better our services. We have other applications, such as CRMs, that handle sending marketing materials to individuals where there *might* be personalization. Even then, the data fed into it is en masse and analyzed en masse before hand. IMHO, any paranoia about "Big Data" has some hold in self-centeredness and egoism.

Like anything else it should be opt in. If you want to provide this information for apple to store then you should give explicit instruction to do so.

Siri is an opt-in service, and Apple do provide a plain language privacy disclaimer when you first use it (and at any time in the Siri settings menu). The disclaimer clarifies that turning off Siri will delete your User Data, although they may hold onto an anonymized version of your data for a while.

My main problem with Siri is that it is tied to the keyboard voice recognition. I'd love to be able to use that feature without turning on voice assistant functionality (which I don't have a use for).

"Apple keeps a record of Siri queries but says it anonymizes the data."

Yeah why is this stored at all?

"Best Taco in San Jose" WHY is that worth storing at all?

Just in case the FBI wants to know I ate a Taco in San Jose California + Date + Time + Location?

A. The obvious conclusion is that this sort of data improves the service. Occam's razor.B. No one said anything about the FBI, let's not have a knee-jerk reaction and turn this into a government conspiracy. Since this is not directly correlated with an Apple ID or IMEI, it would be pretty hard for a third-party to link this data back to a person and even then, all they would have access to is 6 months of a typical person asking about restaurants and setting reminders to pick up milk on the way home. Anyone involved in illicit activities is not likely going to be conducting that business through a smartphone of any kind, they will be using a burner.

This. I work in "Big Data" (I HATE that euphemism, BTW), and while my perspective is different than most, I don't really understand the much of the paranoia around the storing of certain data. Hell, I work in retail, and I *suppose* I could go in and see what John Q. Public is buying, but that's a waste of my time not to mention fairly difficult to do the way the minimal personal data we have is separate from the transaction data. All of our data is anonymized to a very necessary extent and any personal data we hold is for expressly for trying to match up customers using different forms of ID.

People that work in this field have absolutely no interest in an individual's information, especially those with as much data as Apple or Google. We want aggregated information that gives us trends and patterns to better our services. We have other applications, such as CRMs, that handle sending marketing materials to individuals where there *might* be personalization. Even then, the data fed into it is en masse and analyzed en masse before hand. IMHO, any paranoia about "Big Data" has some hold in self-centeredness and egoism.

Marketing materials sent to individuals obviously needs to have some personal information, or else you couldn't get it to those individuals.

Customer data that is sold ("shared with our partners") is probably not anonymized either, because that would ruin its value. I don't know whether your company does this (or even what company you're at) but I can't imagine you are claiming no one does it.

"Apple keeps a record of Siri queries but says it anonymizes the data."

Yeah why is this stored at all?

"Best Taco in San Jose" WHY is that worth storing at all?

Just in case the FBI wants to know I ate a Taco in San Jose California + Date + Time + Location?

A. The obvious conclusion is that this sort of data improves the service. Occam's razor.B. No one said anything about the FBI, let's not have a knee-jerk reaction and turn this into a government conspiracy. Since this is not directly correlated with an Apple ID or IMEI, it would be pretty hard for a third-party to link this data back to a person and even then, all they would have access to is 6 months of a typical person asking about restaurants and setting reminders to pick up milk on the way home. Anyone involved in illicit activities is not likely going to be conducting that business through a smartphone of any kind, they will be using a burner.

This. I work in "Big Data" (I HATE that euphemism, BTW), and while my perspective is different than most, I don't really understand the much of the paranoia around the storing of certain data. Hell, I work in retail, and I *suppose* I could go in and see what John Q. Public is buying, but that's a waste of my time not to mention fairly difficult to do the way the minimal personal data we have is separate from the transaction data. All of our data is anonymized to a very necessary extent and any personal data we hold is for expressly for trying to match up customers using different forms of ID.

People that work in this field have absolutely no interest in an individual's information, especially those with as much data as Apple or Google. We want aggregated information that gives us trends and patterns to better our services. We have other applications, such as CRMs, that handle sending marketing materials to individuals where there *might* be personalization. Even then, the data fed into it is en masse and analyzed en masse before hand. IMHO, any paranoia about "Big Data" has some hold in self-centeredness and egoism.

Your privacy may not be important, but there may be some person with control over your life whose privacy is important. For example, some search engine figures out the gay bars Mitch McConnel frequents, a person gets this data, and then influences legislation by blackmailing the senator. BTW, I'm not implying McConnel is gay. I just used him as an example.