Do Tech Companies Really Need All That User Data?

Executive Summary

The online economy — from search to email to social media — is built in large part on the fact that consumers are willing to give away their data in exchange for products that are free and easy to use. But a new working paper, released this week by Lesley Chiou of Occidental College and Catherine Tucker of MIT, suggests that the trade-off may not always be necessary. By studying the effects of privacy regulations in the EU, they attempt to measure whether the anonymization and de-identification of search data hurts the quality of search results. “Our results suggest that the costs of privacy may be lower than currently perceived,” the authors write.

Paul Garbett for HBR

The online economy — from search to email to social media — is built in large part on the fact that consumers are willing to give away their data in exchange for products that are free and easy to use. The assumption behind this trade-off is that without giving up all that data, those products either couldn’t be so good or would have to come at a cost.

But a new working paper, released this week by Lesley Chiou of Occidental College and Catherine Tucker of MIT, suggests that the trade-off may not always be necessary. By studying the effects of privacy regulations in the EU, they attempted to measure whether the anonymization and de-identification of search data hurts the quality of search results.

Most search engines capture user data, including IP addresses and other data that can identify a user across multiple visits. This data then allows search companies to improve their algorithms and to personalize results for the user. At least, that’s the idea. To determine whether storage of users’ personal data improves search results, Chiou and Tucker looked at how search results from Bing and Yahoo differed before and after changes in the European Commission’s rules on data retention. In 2008 the Commission recommended that search engines reduce the period over which search engines kept user records. In response, Yahoo decided to strengthen its privacy policy by anonymizing user data after 90 days. In 2010 Microsoft changed its policy, and began deleting IP addresses associated with searches on Bing after six months and all data points intended to identify a user across visits after 18 months. In 2011 Yahoo changed its policy again, this time deciding to store personal data longer — for 18 months rather than 90 days — allowing the researchers yet another chance to measure how changes in data storage affected search results. (Google did not change its policies during this period, and so is not included in the study. Some of Tucker’s past research has been funded by Google.)

The researchers then looked at data from UK residents’ web history before and after the changes. To measure search quality, they looked at the number of repeated searches, a signal of dissatisfaction with search results. In all three cases, they found no statistically significant effect on search result quality following changes in data retention policy. In other words, the decision to anonymize or de-identify the data didn’t appear to impair the search experience. “Our results suggest that the costs of privacy may be lower than currently perceived,” the authors write, though they note that previous studies have come to different conclusions.

The researchers also contend that their results have implications for antitrust and worries over so-called data monopolies. Their paper, they write, suggests that “possession of historical data confers less of an advantage to firms who own the data than is sometimes supposed.”

That interpretation deserves some caveats. First, the Yahoo changes involved only anonymization, which may help protect users’ privacy but doesn’t necessarily detract from incumbents’ data advantage. Second, the Microsoft switch — which involved the de-identification of users, and so more directly speaks to the advantage of incumbents’ large, personalized data sets — was rolled out over a period of months, and may not have been captured in the six-month period the researchers studied. Moreover, even if the long-term storage of large amounts of historical data isn’t an advantage, other aspects of data collection might still benefit incumbents. For instance, it could be that search giants incorporate new data into their algorithms quickly. That would mean data was valuable to incumbents but wouldn’t be captured by the study.

Nonetheless, the authors’ attempt to actually measure the competitive advantages of data is laudable. Research by Microsoft has found that user data can yield better search results. But just how much data is required to get results good enough to entice users? The answer matters not just for search but, crucially, for the nascent artificial intelligence (AI) industry. If massive data troves are required for any decent AI search solution, then it’s likely that the industry will be dominated by existing tech behemoths, who have the capabilities to gather and analyze that much data. If it’s possible for newcomers to acquire enough data to train intelligent systems, then the sector will be more competitive.

The authors note that while there are reasons to think data can constitute an important competitive advantage for search engines, there are also reasons to be skeptical. Historical data may be less valuable in informing search results than fresher data, they note, and a considerable fraction of searches are so uncommon that collecting sufficient data might be impossible, even for larger companies.

The current level of enthusiasm for AI has only added to the rush to collect massive data sets, which continues to present privacy concerns. Inevitably, those collecting the data will suggest that users benefit from giving it up. But Chiou and Tucker’s paper raises doubts about that claim. Yes, people benefit from the many excellent and free tech products out there. Yes, they’ll probably benefit in countless ways from new AI-powered solutions. But they don’t always need to completely give up their privacy to get them.