Posted
by
Soulskill
on Friday April 19, 2013 @03:48PM
from the but-i-didn't-give-her-my-data dept.

New submitter LeadSongDog writes with news that Apple has provided information on how long it holds onto voice search data used by its digital assistant software Siri. Speaking to Wired, an Apple representative said the data is kept for two years after the initial query.
"Here’s what happens. Whenever you speak into Apple’s voice activated personal digital assistant, it ships it off to Apple’s data farm for analysis. Apple generates a random numbers to represent the user and it associates the voice files with that number. This number — not your Apple user ID or email address — represents you as far as Siri’s back-end voice analysis system is concerned. Once the voice recording is six months old, Apple “disassociates” your user number from the clip, deleting the number from the voice file. But it keeps these disassociated files for up to 18 more months for testing and product improvement purposes."
This information came in response to requests for clarification of Siri's privacy policy, which was not very clear as written. The director of privacy group Big Brother Watch said, "There needs to be a very high justification for retaining such intrusive data for longer than is absolutely necessary to provide the service."

Well, considering how the DEA is complaining that they can't read encrypted iMessages, and Apple got rid of google maps as default partly because google kept demanding more personallly identifying user data, I don't think we should assume Apple always rolls over on stuff like this.

As far as I know, Google wasn't getting much out of the deal. Apple was getting all of Google's map data and Google was getting some data about traffic patterns. Google wanted more branding. Apple wanted turn-by-turn directions, something that Android used to distinguish itself as better than Apple.

Where are you getting this story that Google was demanding more personal info from Apple?

is unfortunately in the eye of the beholder... The US government's reliance on it's ability to access private data has helped so much with the Boston suspects, we will wrest these gains into the intrusion of privacy from their cold, dead hands.

Joking I hope? We have no idea how they claim to have found these guys yet.

The Government has relied on people turning in pictures and information "as far as we know" and did not find these guys by spying. I'm not claiming the Govt won't use that as an excuse, I'm saying it's untrue so you should not buy it if they do.

From what I can tell, disabling Google History doesn't seem to come with a promise that Google doesn't keep that data somewhere else. What they say they'll do is stop using your History to present targeted advertising for you across their services, or you can choose to delete individual items from your search history, that way they aren't considered when it comes to determining your interests and the like. What they very carefully seem to avoid saying is that they completely delete your queries from all of their systems, so I wouldn't be surprised if they're still using them in some sort of anonymized form for product improvement purposes, tracking trends, or other things of that sort.

Actually, turning off search history doesn't even do as much as you say. They still use everything you enter into their services, every keystroke, how long you spent looking at a page, when you searched and from where. They use all of that and more to target ads (which many of us never see anyway thanks to Adblock Plus).

Turning off search history hides this data from YOU. They still have it. They still have it associated with your account, and even if you are logged out it's associated with your IP address.

Google ads are white listed for me.Plain text. On topic. Unintrusive and helps out the company giving me good free shit.Every great once in a while I actually click on an ad because it is something I want.

Perfectly reasonable. Myself, I've never seen an advertisement that was legitimately helpful. I'm dubious that there ever could be such a thing because advertising is fundamentally an adversarial relationship between the advertiser and the target of the ad (you): you have money that you want to keep, or get the most value for when you do spend it; they want to give you as little as possible while taking as much of your money as they can. You are fighting each other, you have competing interests. You can see why there's a huge incentive for them to lie, or get as close to lying as they legally can, and emotionally manipulate you in their pursuit of your dollars. I find attempts at such manipulation repugnant, which is probably why I walk around most of the day with a mild nauseated sensation. Still, I'd choose that over the syrupy haze of blissful ignorance.

Google's official ads might be the least intrusive, but their disguised ads are rather pernicious, IMO. For example, every product you are shown when using Google Shopping is a paid product advertisement, every single product. They are ALL ads, and nowhere is this disclosed clearly. They are trying to pass it off as a store like Amazon (which has plenty of hidden ads too, but they at least make a passing nod towards identifying them) but it's more like the yellow pages. You have to pay Google for your product to appear there.

1) Allow absolutely everyone to register for free. Then put up with all the spammers who place fake products and prices in order to get people to their sites.2) Charge a tiny fee that prevents spammers from overrunning the place.

This was similar to the idea that charging a penny per email would run spammers out of business. Only, there is no way to charge for email, given the way the Internet works. But, there was the opportunity to do it with Shopping.

here's another one to use; I've been using it for about a month and like it. Combines ixquick with Google results, and offers additional goodness, such as SSL, no cookies, proxy. (One search engine I miss is Kartoo - if it was still around it would be great along with this kind of anonymized, trackless search.) It also avoids handing over referrer info - which can be used to track you regardless of IP, depending on your settings.

...and have since 2007 These two great blog posts cover the details "Taking steps to further improve our privacy practices" http://googleblog.blogspot.co.uk/2007/03/taking-steps-to-further-improve-our.html [blogspot.co.uk] and "How long should Google remember searches? " http://googleblog.blogspot.co.uk/2007/06/how-long-should-google-remember.html [blogspot.co.uk] an example from it "By anonymizing our server logs after 18-24 months, we think we’re striking the right balance between two goals: continuing to improve Google’s services for you, while providing more transparency and certainty about our retention practices." Google are suprisingly forthcoming about how and what they do with your data, which clashes sharply with Apple(pretend the don't) or Microsoft(who run hate campaigns)

While I'm glad that they make that public, what they're NOT saying there is that they delete our data eventually. As such, if I make a query that can be tied back to me, my other queries can likely be tied back to me as well, since they'll share an anonymized ID between them.

Granted, voice data is MUCH more sensitive than plaintext, but I'm still a bit disappointed that Google isn't promising to delete our queries entirely after a period of time, rather than merely anonymizing them. Anonymization is a good

Well, I've been searching since I made the comment, and the best I've found so far is this thread [google.com] where a Google rep confirms that for every image search they keep a thumbnail of the item that was clicked on, as well as the IP address for 9 months (after which it gets anonymized), and identifying information for the cookie associated with you for 18 months (after which it gets anonymized and the IP address gets partially destroyed). What that means is that they never fully destroy the data, and that if the query was self-identifying in some way, someone could still tie all of the queries you made together since they would still be associated with the cookie data, even if that cookie data is no longer associated with you.

Take it with a grain of salt, however, since that's from back in 2011. As we all know, these tech companies have made big strides to protect our privacy better since then. Wait, no, I have that backwards.

I know your Angry with Apple and confused right now, You bought an Apple phone and Apple still sold you to Google. You paid a mark-up of 50% on a $650 phone just to be sold for a measly $3.20, who would have thought you were so cheap.

Yeah, I find myself not minding this so much. I do think electronic records should somehow "sunset" at some point, even if it's after a few years, for various reasons. However, I don't see what the big deal is whether Apple retains the data for 1 month vs. 6 months vs. 2 years.

When I used Siri for the first time and realized it was sending my questions to a datacenter somewhere, I had an immediate reaction of "that's a bit creepy and disconcerting." But once the data is sent out to the datacenter for processing, you've already opened the door for the data to be misused. Once you assume that the data will be stored for some amount of time, you increase the chances for the data to be misused. But if you extend the time that the data is stored for a for months or a year, I don't feel like you're greatly increasing your exposure.

What holding on the data actually does is it gives Apple some time to process and analyze the data, improving the speech recognition and heuristic models. I'd expect them to want to keep it for a couple years, especially since Siri is new and they're probably still developing their methods for analyzing the data. In this sort of situation, having more data means being able to create a more accurate analysis.

It's becoming exceedingly difficult to keep your search history private. All the major search companies keep it, Apple keeps Siri searches, etc.
DuckDuckGo I believe keeps things as anonymous as you can get. There are also some hacks you can do if you are careful, privacy mode/ incognito is a start, but even then it's easy to tip your hand. If you are truly doing something crazy, use a bootable USB and do your searches from a random public wifi hotspot.

You could say that about using Tor or FreeNet. However search engines are a commodity, and anyone can use StartPage without a complicated setup; it's just a Web site. It's even in the list of search engines that IE asks you to select from, right out of the box. Install fresh Windows, select the search engine, and you are done. If that is suspicious, you are in a good company.

The Siri story in the link was from June 2012. You do know software can be improved. Or do you think Steve Woz was lying.

Well, Siri used Wolfram Aplha before Apple bought it, it uses it now, and there is no indication that it stopped using it any time in between. So even if the problem Woz claims actually existed, it was with Wolfram Alpha, and not with Siri, let alone with Apple "ruining" it.

I am getting tired of Apples continuing Privacy abused, first they sell their customers to the highest bidder now this.

Honest question: when did Apple sell anything related to their customers to the highest bidder? I can't find any information about anything along those lines, yet I've seen you repeat it at least twice in here.

Not only would a deal like that have to be disclosed in SEC filings (i.e. it wouldn't stay private), but if Apple had sold their customer's data to Google for that price, it would have bankrupted Google dozens or hundreds of times over, since they have nowhere close to that much money on hand. Google's market cap is only around $250B at the moment, so Google could literally sell itself to Apple (assuming it magically gained control of those shares) and still

That is indeed ridiculous. I would easily accept a $40M deal. A $400M deal would be already very hard to imagine; many *companies* aren't worth that much. Normally a CEO can work with 20-30 million USD with relative ease - such as acquire small companies or making deals of this sort; but anything beyond that triggers a completely different set of procedures.

"Could A Yahoo-Apple Deal Spell Trouble For Google?" http://www.webpronews.com/iphones-and-ipads-could-soon-get-a-big-dose-of-yahoo-2013-04 [webpronews.com] its a great article, about Yahoo! (Who share there data with Microsoft) and Apple, but from the Article...although its common news "An analyst at Macquarie Capital estimated that Google was making $1.3 billion annually in paid search revenue from iOS devices. Macquarie speculated that Google returned about $1 billion of that to Apple as part of the agreement that made G

And yet, NONE of that says anything about Apple selling their customer's data to Google or anyone else, which is what you've been alleging all along. Google paying Apple for the right to be the default search engine on iOS does not mean that Apple is giving them customer data (Apple's customer is the user, after all, so it's not in their best interests to sell that data), especially so when you consider that the user is fully capable of changing their search engine in the settings for iOS. The only data Goo

Ask Siri "What time is sunset?" and Siri will tell you. Ask Siri "What time was sunrise?" and Siri will say something to the effect that it can't tell you the weather in the past. Ask Siri "What time will sunset be next Tuesday?" and it will say something to the effect that it doesn't know how to get the weather that far ahead.

Huh? What does sunset have to do with weather? Well, Siri gets sunrise/sunset information from the same place as the weather. S

Everyone I've ever spoken to or read about in the field of voice recognition tells me that having samples of people's voices is critical to improving it... and getting those samples (mainly the raw quantity of samples) is the biggest problem they face.

So it doesn’t surprise me at all that anyone keeps a massive archive of samples... the sample data can be critical in improving voice recognition.

As an aside: Google Voice's voice mail feature does more or less the same thing... and the reasoning is the same also: More sample data means better voice recognition.

I can't help but shake my head at the comparison:

Google samples user voices, reads (and transcribes) voice mail, reads your email, your stock information and then feeds it into their advertising engine, and does this for four years and counting; reaction: Meh...

Apple samples voices, anonymizes it, uses it it improve voice recognition over a period of two years; reaction: EVIL! APPLE MUST DIE!

Anonymize means to make anonymous. Not the vocal pattern. Simply the user id is not tied to it after 6 months. A corpus (large body of annotated text and/or voice) is a necessary part of natural language recognition, which is a type of artificial intelligence, but usually you tell me people that is what will happen to the data.
The 6 month window during which you ID is tied to your voice record, is likely very useful for AI such as being able to understand a given person's

Anonymized voice sample you say? "Voice Print Identified" I say. Hell, I create my own image and speach recognition software from scratch, and I don't need all those fucking samples. I just need to run the samples through my algorithms at most twice -- Once, then again to test if the changes were beneficial or not. If I have a constant stream of users (new samples), and I'm smart -- read: Not fucking daft -- then I can just run the samples through once, and let the users of the system rate the samples

Voice prints are a real thing, of course; my point isn't that it's not possible to identify people from a voice sample.

My point is that Apple doesn't make its money by selling you, me, and everyone else to the highest bidder, nor does its business have any real advantage in profiling us. Apple's business isn't advertising, it's selling hardware. (The flop that is iAd notwithstanding)

Google, on the other hand, is entirely different: Their entire revenue stream is from collecting our personal information, categorizing and analyzing it, and then selling or otherwise making that data useful to its actual customers, ie. its advertisers.

Hell, I create my own image and speach recognition software from scratch, and I don't need all those fucking samples. I just need to run the samples through my algorithms at most twice -- Once, then again to test if the changes were beneficial or not

If you honestly believe that, then you've never spent even a minute actually learning the basics of speech recognition, let alone the level of complexity involved in modern algorithms. Signal processing isn't like database programming, where you get a nice result that fits into a box, and can easily reduce unwanted side effects.

Also keep in mind, there's a difference between "automatic speech recognition" - where whole sentences are parsed and understood (such as used with Siri or Google , versus "discrete speech recognition" where very limited actions are understood (like older cell phones when you spoke "dial ").

The problem is that while you might have improved the recognition for one specific sample, you've now made it considerably worse for another... so you have to build up a massive library of samples to do regression testing. One of the biggest challenges in speech recognition over the years is the utter lack of sample data for a wide populace, coupled with computers that are unable to hold enough samples in memory to do any meaningful comparisons.

We've only recently started to see speech recognition of that calibre, and even then, it's accomplished by sending a recording off to a datacenter so fraking huge that it'd easily sit at the top of the TOP500 supercomputer list if their owners bothered to run linpack on it. It's no coincidence that it's also only been in the past couple of years speech recognition has become anything more than a lame joke.

The issue isn't that they retain the voice samples, its that they store user information for 6 months when they dont need to store user information for longer than it takes to complete the query and return the results.

well they probably would want to keep the data as a corpus of text that can be further analyzed or used to guide further searches. It's just that it can be quite abused... and many people these days would rather have the data deleted immediately rather than improve a service that is less than crucial to one's life, so far.

In reference to an earlier question about Google's data retention policies, one of the comments [slashdot.org] provided a great link to a 2007 Google blog post [blogspot.co.uk] that describes why Google holds onto their data for 18 months before they anonymize it. One of the interesting things that was said was:

However, we must point out that future data retention laws may obligate us to raise the retention period to 24 months.

Given that the blog post was written back in 2007, isn't it now possible that 24 months is simply the earliest that a company like Apple is allowed to delete the query, given the various data retention regulations that are in place

If I were in charge of Siri, I'd do the same thing. That kind of real-world data is vital for regression testing. If you don't have a strong corpus of sample data, when you make changes to the code, you've got no idea if what you are doing is improving the situation for some cases, while damaging them for others. You would see people complaining about things like "Well Siri used to work for X query but now it doesn't". When you have this data, you can update the code, run the test suite, and see if it

That kind of real-world data is vital for regression testing. If you don't have a strong corpus of sample data, when you make changes to the code, you've got no idea if what you are doing is improving the situation for some cases, while damaging them for others

Aaaand, unless you run ALL those data samples back through the system in front of a HUMAN, then you STILL have "no idea if what you are doing is improving the situation" at all. So, the point still stands: Keeping a sampling of the data is acceptable. Keeping the lot of it isn't helping anyone you actually want to help -- Least of all the developers. Hell, they could improve the service immensely by simply dropping the data storage requirments!

Aaaand, unless you run ALL those data samples back through the system in front of a HUMAN, then you STILL have "no idea if what you are doing is improving the situation" at all.

Yes, you do. Have you ever used Siri? There are several places where you can reliably determine that recognition was successful, due to manual confirmation or subsequent actions. For instance, if I ask Siri to remind me to do something at 9 o'clock, it might ask me if I mean 9am or 9pm. Anybody who answers either way instead

But are they really victims? We can tell them how much info Apple/Google/etc gathers on them but if they don't care or think its worth it what's the big problem? Most people don't care about such stuff.

It's like a friend eating his favourite fried chicken at his favourite dining place. It's bad for his health but is he a victim?

Speech recognition in the cloud has given companies like Apple and Google a reason/excuse to gather masses of training data. They have put it to good use: speech recognition is much better than it was. If you like speech recognition, use it, meanwhile donating your data and helping the rest of us. If you d

somewhere in a data warehouse with only a few humans, there are millions of disassociated voices crying out to be heard.
"But it keeps these disassociated files for up to 18 more months for testing and product improvement purposes."

Muller pointed out, however, that the identifiers are deleted immediately—"along with any associated data"—when a user turns Siri off on his or her device. (You can do this by going to Settings > General > Siri on a supported iOS device.)

Not a joke. It's just that sometimes I have seen people respond to it and it makes me wonder if it's worth spending the 5 mins to read through it.

But no, as you suggest, every time I see it I scroll past it as I cba. But that people mention hosts files in it, I can't help but wonder if there is anything interesting in it. I can't see how a hosts file could relate to propaganda.

Additionally, simply reading it would answer my question and, possibly, be quicker than asking like this. Problem is, I just really

See my above post. I'm convinced APK is serious, he has got battles raging everywhere, meticulously catalogued, yet he thinks this is proof of his knowledge and experience, not obsessive insanity. And making that point doesn't make him reconsider, it incites him. He also seems to think what looks like many multiples of people saying this are one or a few people who are out to get him. Just read my post and google Alexander Peter Kowalski.t

Alexander Peter KowalskI and anyone arguing with him are insane. I saw their crazy tirades once and googled his name, and HOLY SHIT. This guy has mini battle raging all over many sites for some of the most inane shit you can think of. He meticulously catalogs the people who have crossed him and works to MAKE SURE everyone understands they are fools.

Now, they well be fools, but by his meticulous and obsessive actions Kowalski (APK) has proved without a shadow of doubt his absolutE insanity. I haven't even ar