33 Bits of Entropyhttp://33bits.org
The End of Anonymized Data and What to Do About ItSat, 01 Aug 2015 18:33:38 +0000enhourly1http://wordpress.com/http://s2.wp.com/i/buttonw-com.png33 Bits of Entropyhttp://33bits.org
One more re-identification demonstration, and then I’m outhttp://33bits.org/2015/03/23/one-more-re-identification-demonstration-and-then-im-out/
http://33bits.org/2015/03/23/one-more-re-identification-demonstration-and-then-im-out/#commentsMon, 23 Mar 2015 16:20:41 +0000http://33bits.org/?p=1176]]>What should we do about re-identification? Back when I started this blog in grad school seven years ago, I subtitled it “The end of anonymous data and what to do about it,” anticipating that I’d work on re-identification demonstrations as well as technical and policy solutions. As it turns out, I’ve looked at the former much more often than the latter. That said, my recent paper A Precautionary Approach to Big Data Privacy with Joanna Huey and Ed Felten tackles the “what to do about it” question head-on. We present a comprehensive set of recommendations for policy makers and practitioners.

One more re-identification demonstration, and then I’m out. Overall, I’ve moved on in terms of my research interests to other topics like web privacy and cryptocurrencies. That said, there’s one fairly significant re-identification demonstration I hope to do some time this year. This is something I started in grad school, obtained encouraging preliminary results on, and then put on the back burner. Stay tuned.

Machine learning and re-identification. I’ve argued that the algorithms used in re-identification turn up everywhere in computer science. I’m still interested in these algorithms from this broader perspective. My recent collaboration on de-anonymizing programmers using coding style is a good example. It uses more sophisticated machine learning than most of my earlier work on re-identification, and the potential impact is more in forensics than in privacy.

Privacy and ethical issues in big data. There’s a new set of thorny challenges in big data — privacy-violating inferences, fairness of machine learning, and ethics in general. I’m collaborating with technology ethics scholar Solon Barocas on these topics. Here’s an abstract we wrote recently, just to give you a flavor of what we’re doing:

How to do machine learning ethically

Every now and then, a story about inference goes viral. You may remember the one about Target advertising to customers who were determined to be pregnant based on their shopping patterns. The public reacts by showing deep discomfort about the power of inference and says it’s a violation of privacy. On the other hand, the company in question protests that there was no wrongdoing — after all, they had only collected innocuous information on customers’ purchases and hadn’t revealed that data to anyone else.

This common pattern reveals a deep disconnect between what people seem to care about when they cry privacy foul and the way the protection of privacy is currently operationalized. The idea that companies shouldn’t make inferences based on data they’ve legally and ethically collected might be disturbing and confusing to a data scientist.

And yet, we argue that doing machine learning ethically means accepting and adhering to boundaries on what’s OK to infer or predict about people, as well as how learning algorithms should be designed. We outline several categories of inference that run afoul of privacy norms. Finally, we explain why ethical considerations sometimes need to be built in at the algorithmic level, rather than being left to whoever is deploying the system. While we identify a number of technical challenges that we don’t quite know how to solve yet, we also provide some guidance that will help practitioners avoid these hazards.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

]]>http://33bits.org/2015/03/23/one-more-re-identification-demonstration-and-then-im-out/feed/0randomwalkerGood and bad reasons for anonymizing datahttp://33bits.org/2014/07/09/good-and-bad-reasons-for-anonymizing-data/
http://33bits.org/2014/07/09/good-and-bad-reasons-for-anonymizing-data/#commentsWed, 09 Jul 2014 16:05:23 +0000http://33bits.org/?p=1143]]>Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.

Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.

Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.

Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)

Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.

Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.

Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.

Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]

Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.

It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.

Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).

Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).

Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)

Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

]]>http://33bits.org/2014/07/09/good-and-bad-reasons-for-anonymizing-data/feed/0randomwalkerHow to prepare a technical talkhttp://33bits.org/2013/11/26/how-to-prepare-a-technical-talk/
http://33bits.org/2013/11/26/how-to-prepare-a-technical-talk/#commentsTue, 26 Nov 2013 14:29:27 +0000http://33bits.org/?p=1138]]>I used to suck at giving technical talks. I would usually confuse my audience, and often confuse myself. By the time I became a prof, I sucked a lot less. These days I enjoy giving technical talks and lectures more than non-technical ones, and my students seem to like them better as well.

So something had changed; I’d developed a process. The other day I sat down to see if I could extract what this process was. It turned out to be surprisingly formulaic, like an algorithm, so I’d like to share it with you. I’m guessing this is obvious to most professors who teach technical topics, but I hope it will be helpful to those who’re relatively new to the game.

There are three steps. They’re simple but not easy.

Identify the atomic concepts

Draw the dependency graph

Find a topological ordering of the graph

Identify atomic concepts. The key word here is atomic. The idea is to introduce only one key concept at one time and give the audience time to internalize the concept before moving on to the next one.

This is hard for two reasons. First, concepts that seem atomic to an expert are often an amalgam of different concepts. Second, it’s audience-specific. You have to have a good mental model of which concepts are already familiar to your audience.

Draw the dependency graph. Occasionally I use a whiteboard for this, but usually it’s in my head. This is a tricky step because it’s easy to miss dependencies. When the topic I’m teaching is the design of a technical system, I ask myself questions like, “what could go wrong in this component?” and “why wasn’t this alternative design used?” This helps me flesh out the internal logic of the system in the form of a graph.

Find a topological ordering. This is just a fancy way of saying we want to order the concepts so that each concept only depends on the ones already introduced. Sometimes this is straightforward, but sometimes the dependency graph has cycles!

Of the topics I’ve taught recently, Bitcoin seems especially difficult in this regard. Each concept is bootstrapped off of the others, but somehow the system magically works when you put everything together. What I do in these cases is introduce intermediate steps that don’t exist in the actual design I’m teaching, and remove them later [1].

Think of a technical topic as a skyscraper. When it’s presented in a paper, it’s analogous to unveiling a finished building. The audience can admire it and check that it’s stable/correct (say, by verifying theorems or other technical arguments.) But just as staring at a building doesn’t help you learn how to build one, the presentation in a typical paper is all but useless for pedagogical purposes. Having dependencies between concepts is perfectly acceptable in papers, because papers are not meant to be read in a single pass.

The instructor’s role, then, is to reverse engineer how the final concept might plausibly be built up step by step. This is analogous to showing the scaffolding of the building and explaining each step in its construction. Talks and lectures, unlike papers, must necessarily have this linear form because the audience can’t keep state in their heads.

[1] This process introduces new nodes in the dependency graph and removes some edges so that it is no longer cyclic.

]]>http://33bits.org/2013/11/26/how-to-prepare-a-technical-talk/feed/0randomwalkerHow to pick your first research projecthttp://33bits.org/2013/11/01/how-to-pick-your-first-research-project/
http://33bits.org/2013/11/01/how-to-pick-your-first-research-project/#commentsFri, 01 Nov 2013 23:58:09 +0000http://33bits.org/?p=1132]]>At Princeton I get to advise many gifted graduate and undergraduate students in doing research. Combining my experience as a mentor with reflecting on my experience as a student, I’d like to offer some guidance on how to pick your first research project.

I’m writing this post because selecting a research problem to work on is significantly harder than actually solving it. I mean the previous sentence quite literally and without exaggeration. As an undergraduate and early-stage graduate researcher, I repeatedly spent months at a time working on research problems only to have to abandon my efforts because I found out I was barking up the wrong tree. Scientific research, it turns out, is largely about learning to ask the right questions.

The good news is that three simple criteria will help you avoid most of the common pitfalls.

1. Novelty. Original research is supposed to be, well, original. There are two components to novelty. The first is to make sure the problem you’re trying to solve hasn’t already been solved. This is way trickier than it seems — you might miss previous research because you’re using different names for concepts compared to the standard terminology. But the issue is deeper: two ideas may be equivalent without sounding at all the same at a superficial level. Your advisor’s help will be crucial here.

The other aspect to novelty is that you should have a convincing answer to the question “why has this problem not been solved yet?” Often this might involve a dataset that only recently became available, or some clever insight you’ve come up with that you suspect others missed. In practice, one often has an insight and then looks for a problem to apply it to. This means you have to put in a good bit of creative thinking even to pick a research question, and you must be able to estimate the difficulty level of solving it.

If your answer to the question is, “because the others who tried it weren’t smart enough,” you should probably think twice. It may not be prudent to have the success of your first project ride on your intellectual abilities being truly superlative.

2. Relevance. You must try to ensure that you select a problem that matters, one whose solution will impact the world directly or indirectly (and hopefully for the better). Again, your advisor’s help will be essential. (That said, professional researchers do produce massive volumes of research papers that no one cares about.) I encourage my students to pick subproblems of my ongoing long-term research projects. This is a safe way to pick a problem that’s relevant.

3. Measurable results. This one becomes automatic as you get experienced, but for beginning researchers it can be confusing. The output of your research should be measurable and reproducible; ideally you should be able to formulate your goals as a testable hypothesis. Measurability means that many interesting projects that are novel and make the world better are nevertheless unsuitable for research. (They may be ideal for a startup or a hobby project instead.) “Build a website for illiterate kids in poor countries to learn effectively” is an example of a task that’s hard to frame as a research question.

Irrelevant criteria. Let me also point out what’s not on this list. First, the general life advice you often hear, to do something you’re passionate about, is unfortunately a terrible way to pick a research problem. If you start from something you’re passionate about, the chance that it will meet the three criteria above is pretty slim. Often one has to consider a dozen or more research ideas before settling on one to work on.

You should definitely pick a research area you’re passionate about. But getting emotionally invested in a specific idea or research problem before you’ve done the due diligence is a classic mistake, one that I made a lot as a student.

Second, the scope or importance of the problem is another criterion you shouldn’t fret much about for your first project. Your goal is as much to learn the process of research as to produce results. You probably have a limited amount of time in which you want to evaluate if this whole research thing is the right fit for you. While you should definitely pick a useful and relevant research task, it should be something that you have a reasonable chance of carrying to fruition. Don’t worry about curing cancer just yet.

Note that the last point is at odds with advice given to more experienced researchers. Richard Hamming, in a famous talk titled “You and your research,” advised researchers to pick the most important problem that they have a shot at solving. I’ve written a version of the current post for those who’re in it for the long haul, and my advice there is to embrace risk and go for the big hits.

]]>http://33bits.org/2013/11/01/how-to-pick-your-first-research-project/feed/0randomwalkerAcademic publishing as (ruinous) competition: Is there a way out?http://33bits.org/2013/07/15/academic-publishing-as-ruinous-competition-is-there-a-way-out/
http://33bits.org/2013/07/15/academic-publishing-as-ruinous-competition-is-there-a-way-out/#commentsMon, 15 Jul 2013 15:13:11 +0000http://33bits.org/?p=1129]]>Aaron Johnson invited me to speak as part of a panel on academic publishing at PETS 2013. This is a rough transcript of my talk, written from memory.

Aaron mentioned he was looking for one more speaker for this panel, so that we could hear the view of someone naive and inexperienced, and asked if I was available. I said, “Great, I do that every day!” So that will be the tone of my comments today. I don’t have any concrete proposals that can be implemented next year or in two years. Instead these are blue-sky thoughts on how things could work someday and hopeful suggestions for moving in that direction. [1]

I just finished my first year as a faculty member at Princeton. It’s still a bit surreal. I wasn’t expecting to have an academic career. In fact, back in grad school, especially the latter half, whenever someone asked me what I wanted to do after I graduated, my answer always was, “I don’t know for sure yet, but there’s one career I’m sure I don’t want — academia.”

I won’t go into the story of why that was and how it changed. But it led to some unusual behavior. I ranted a lot about academia on Twitter, as Aaron already mentioned when he introduced me. Also, many times I “published” stuff by putting up a blog post. For instance I had a series of posts on the ability of a malicious website to deanonymize visitors (1, 2, 3, 4, 5, 6). People encouraged me to turn it into a paper, and I could have done that without much extra effort. But I refused, because my primary goal was to quickly disseminate the information, and I felt my blog posts had accomplished that adequately. True, I wouldn’t get academic karma, but why would I care? I wasn’t going to be an academic!

When I eventually decided I wanted to apply for academic positions, I talked to a professor whose opinion I greatly respected. He expressed skepticism that I’d get any interviews, given that I’d been blogging instead of writing papers. I remember thinking, “oh shit, I’ve screwed up my career, haven’t I?” So I feel extremely lucky that my job search turned out successfully.

At this point a sane person would have decided to quit while they were ahead, and start playing the academic game. But I guess sanity has never really been one of my strong points. So in the last year I’ve been thinking a lot about what the process of research collaboration and publishing would look like if we somehow magically didn’t have to worry at all about furthering our individual reputations.

Polymath

Something that’s very close to my ideal model of collaboration is the Polymath project. I was fascinated when I heard about it a few years ago. It was started by mathematician Tim Gowers in a blog post titled “Is massively collaborative mathematics possible?” [2] He and Terry Tao are the leaders of the project. They’re among the world’s top mathematicians. There have been several of these collaborations so far and they’ve been quite successful, solving previously open math problems. So I’ve been telling computer scientists about these efforts and asking if our community could produce something like this. [3]

To me there are three salient aspects of Polymath. The first is that the collaboration happens online, in blog posts and comments, rather than phone or physical meetings. When I tell people this they are usually enthusiastic and willing to try something like that. The second aspect is that it is open, in that there is no vetting of participants. Now people are a bit unsure, and say, “hmm, what’s the third?” Well, the third aspect is that there’s no keeping score of who contributed what. To which they react, “whoa, whoa, wait, what??!!”

I’m sure we can all see the problem here. Gowers and Tao are famous and don’t have to worry about furthering their careers. The other participants who contribute ideas seem to do it partly altruistically and partly because of the novelty of it. But it’s hard to imagine this process being feasible on a bigger scale.

Misaligned incentives

Let’s take a step back and ask why there’s this gap between doing good research and getting credit for it. In almost every industry, every human endeavor, we’ve tried to set things up so that the incentives for individuals and the broader societal goals of the activity align with each other. But sometimes individual incentives get misaligned with the societal goals, and that leads to problems.

Let’s look at a few examples. Individual traders play the stock market with the hope of getting rich. But at the same time, it helps companies hedge against risk and improves overall financial stability. At least that’s the theory. We’ve seen it go wrong. Similarly, copyright is supposed to align the desire of creators to make money with the goal of the maximum number of people enjoying the maximum number of creative works. That’s gotten out of whack because of digital technology.

My claim is that we’re seeing the same problem in academic research. There’s a metaphor that explains what’s going on in research really well, and to me it is the root of all of the ills that I want to talk about. And that metaphor is publishing as competition. What do I mean by that? Well, peer review is a contest. Succeeding at this contest is the immediate incentive that we as researchers have. And we hope that this will somehow lead to science that benefits humanity.

To be clear, I’m far from the first one to make this observation. Let me quote someone who’s much better qualified to talk about this. Oded Goldreich, I’m sure most of you know of him, has a paper titled “On Struggle and Competition in Scientiﬁc Fields.” Here’s my favorite quote from the paper. He’s talking about the flagship theory conferences.

Eventually, FOCSTOC may become a pure competition, deﬁned as a competition having no aim but its own existence (i.e., the existence of a competition). That is, pure competitions serve no scientiﬁc purpose. Did FOCSTOC reach this point or is close to it? Let me leave this question open, and note that my impression is that things are deﬁnitely evolving towards this direction. In any case, I think we should all be worried about the potential of such an evolution.

I’m don’t know enough about the theory community to have an opinion on how big a problem this is. Still, I’m sure we can agree with the sentiment of the last sentence.

But here’s the very next paragraph. I think it gives us hope.

Other TOC conferences seem to suffer less from the aforementioned phenomena. This is mainly because they “count” less as evidence of importance (i.e., publications in them are either not counted by other competitions or their eﬀect on these competitions is less signiﬁcant). Thus, the vicious cycle described above is less powerful, and consequently these conferences may still serve the intended scientiﬁc purposes.

We see the same thing in the security and privacy community. Something I’ve seen commonly is a situation where you have a neat result, but nothing earth-shattering, and it’s not good enough as it is for a top tier venue. So what do you do? You pad it with bullshit and submit it, and it gets in. Another trend that this encourages is deliberately making a bad or inaccurate model so that you can solve a harder problem. But PETS publications and participants seem to suffer less from these effects. That’s why I’m happy to be discussing this issue with this group of people.

Paper as final output

It seems like we’re at an impasse. We can agree that publishing-as-competition has all these problems, but hiring committees and tenure committees need competitions to identify good research and good researchers. But I claim that publishing as competition fails even at the supposed goal of identifying useful research.

The reason for that is simple. Publishing as competition encourages or even forces viewing the paper as the final output. But it’s not! The hard work begins, not ends when the paper is published. This is unlike the math and theory communities, where the paper is in fact the final output. If publishing-as-competition is so bad for theory, it’s much worse for us.

In security and privacy research, the paper is the starting point. Our goal is not to prove theorems but to more directly impact the world in some way. By creating privacy technologies, for example. For research to have impact, authors have to do a variety of things after publication depending on the nature of the research. Build technology and get people to adopt it. Explain the work to policymakers or to other researchers who are building upon it. Or even just evangelize your ideas. Some people claim that ideas should stand on their own merit and compete with other ideas on a level playing field. I find this quite silly. I lean toward the view expressed in this famous quote you’ve probably heard: “if your ideas are any good you’ll have to shove them down people’s throats.”

The upshot of this is that impact is heavily shortchanged in the publication-as-competition model. This is partly because of what I’ve talked about, we have no incentive to do any more work after getting the paper published. But an equally important reason is that the community can’t judge the impact of research at the point of publication. Deciding who “wins the prizes” at the point of publication, before the ideas have a chance to prove themselves, has disastrous consequences.

So I hope I’ve convinced you that publication-as-competition is at the root of many of our problems. Let me give one more example. Many of us like the publish-then-filter model, where reviews are done in the open on publicly posted papers with anyone being able to comment. One major roadblock to moving to this model is that it screws up the competition aspect. The worry is that papers that receive a lot of popular attention will be reviewed favorably, and so forth. We want papers to be reviewed on a level playing field. But if the worth of a paper can’t be judged at publication time, that means all this fairness is toward an outcome that is meaningless anyway. Do we still want to keep this model at all costs?

A way forward?

So far I’ve done a lot of complaining. Let me offer some suggestions now. I want to give two sets of suggestions that are complementary. The first is targeted at committees, whether tenure committees, hiring committees, award communities, or even program committees to an extent, and to the community in general. The second is targeted at authors.

Here’s my suggestion for committees and the community: we can and should develop ways to incentivize and measure real impact. Let me give you a four examples. I have more that I’d be happy to discuss later. First, retrospective awards. That is, “best paper from this conference 10 years ago” or some such. I’ve been hearing more about these of late, and I think that’s good news. The idea is that impact is easier to evaluate 10 years after publication.

Second, overlay journals. These are online journals that are a way of “blessing” papers that have already been published or made public. There is a lag between initial publication and inclusion in the overlay journal, and that’s a good thing. Recently the math community has come up with a technical infrastructure for running overlay journals. I’m very excited about this. [4]

There are two more that are related. These are specific to our research field. For papers that are about a new tool, I think we should look at adoption numbers as an important component of the review process. Finally, such papers should also have an “incentives” section or subsection. Because all too often we write papers that we imagine unspecified parties will implement and deploy, but it turns out there isn’t the slightest economic incentive for any company or organization to do so.

I think we should also find ways to measure contributions through blog posts and sharing data and code in publications. This seems more tricky. I’d be happy to hear suggestions on how to do it.

Next, this is what I want to say to authors: the supposed lack of incentives for nontraditional ways of publishing is greatly exaggerated. I say this from my personal experience. I said earlier that I was very lucky that my job search turned out well. That’s true, but it wasn’t all luck. I found out to my surprise that my increased visibility through blogging and especially the policy work that came out of it made a huge difference to my prospects. If I’d had three times as many publications and no blog, I probably would have had about the same chances. I’m sure some departments didn’t like my style, but there are definitely others that truly value it.

My Bitcoin experiment

I have one other personal experience to share with you. This is an experiment I’ve been doing over the last month or so. I’d been thinking about the possibility of designing a prediction market on top of Bitcoin that doesn’t have a central point of control. Some of you may know the sad story of Intrade. So I tweeted my interest in this problem, and asked if others had put thought into it. Several people responded. I started an email thread for this group, and we went to work.

12,000 words and several conference calls later, we’re very happy with where we are, and we’ve started writing a paper presenting our design. What’s even better is who the participants are — Jeremy Clark at Carleton, Joe Bonneau who did his Ph.D. with Ross Anderson and is currently at Google, and Andrew Miller at UMD who is Jon Katz’s Ph.D. student. All these people are better qualified to write this paper than I am. By being proactive and reaching out online, I was able to assemble and work with this amazing team. [5]

But this experiment didn’t go all the way. While I used Twitter to find the participants and was open to accepting anyone, the actual collaboration is being done through traditional channels. My original intent was to do it in public, but I realized quite early on that we had something publication-worthy and became risk-averse.

I plan to do another experiment, this time with the explicit goal of doing it in public. This is again a Bitcoin-related paper that I want to write. Oddly enough, there is no proper tutorial of Bitcoin, nor is there a survey of the current state of research. I think combining these would make a great paper. The nature of the project makes it ideal to do online. I haven’t figured out the details yet, but I’m going to launch it on my blog and see how it goes. You’re all welcome to join me in this experiment. [6]

So that’s basically what I wanted to share with you today. I think the current model of publication as competition has gone too far, and the consequences are starting to get ruinous. It’s time we put a stop to it. I believe that committees on one hand, and authors on the other both have the incentive to start changing things unilaterally. But if the two are combined, the results can be especially powerful. In fact, I hope that it can lead to a virtuous cycle. Thank you.

[1] Aaron didn’t actually say that, of course. You probably got that. But who knows if nuances come across in transcripts.

[2] At this point I polled the room to see who’d heard of Polymath before. Only three hands went up (!)

[3] There is one example that’s closer to computer science that I’m aware of: this book on homotopy type theory written in a similar spirit as the Polymath project.

]]>http://33bits.org/2013/07/15/academic-publishing-as-ruinous-competition-is-there-a-way-out/feed/10randomwalkerPersonalized coupons as a vehicle for perfect price discriminationhttp://33bits.org/2013/06/25/personalized-coupons-price-discrimination/
http://33bits.org/2013/06/25/personalized-coupons-price-discrimination/#commentsTue, 25 Jun 2013 15:09:42 +0000http://33bits.org/?p=1124]]>Given the pervasive tracking and profiling of our shopping and browsing habits, one would expect that retailers would be very good at individualized price discrimination — figuring out what you or I would be willing to pay for an item using data mining, and tailoring prices accordingly. But this doesn’t seem to be happening. Why not?

This mystery isn’t new. Mathematician Andrew Odlyzko predicted a decade ago that data-driven price discrimination would become much more common and effective (paper, interview). Back then, he was far ahead of his time. But today, behavioral advertising at least has gotten good enough that it’s often creepy. The technology works; the impediment to price discrimination lies elsewhere. [1]

It looks like consumers’ perception of unfairness of price discrimination is surprisingly strong, which is why firms balk at overt price discrimination, even though covert price discrimination is all too common. But the covert form of price discrimination is not only less efficient, it also (ironically) has significant social costs — see #3 below for an example. Is there a form of pricing that allows for perfect discrimination (i.e., complete tailoring to individuals), in a way that consumers find acceptable? That would be the holy grail.

In this post, I will argue that the humble coupon, reborn in a high-tech form, could be the solution. Here’s why.

1. Coupons tap into shopper psychology. Customers love them.

Coupons, like sales, introduce unpredictability and rewards into shopping, which provides a tiny dopamine spike that gets us hooked. JC Penney’s recent misadventure in trying to eliminate sales and coupons provides an object lesson:

“It may be a decent deal to buy that item for $5. But for someone like me, who’s always looking for a sale or a coupon — seeing that something is marked down 20 percent off, then being able to hand over the coupon to save, it just entices me. It’s a rush.”

Some startups have exploited this to the hilt, introducing “gamification” into commerce. Shopkick is a prime example. I see this as a very important trend.

2. Coupons aren’t perceived as unfair.

Given the above, shoppers have at best a dim perception of coupons as a price discrimination mechanism. Even when they do, however, coupons aren’t perceived as unfair to nearly the same degree as listing different prices for different consumers, even if the result in either case is identical. [2]

3. Traditional coupons are not personalized.

While customers may have different reasons for liking coupons, from firms’ perspective the way in which traditional coupons aid price discrimination is pretty simple: by forcing customers to waste their time. Econ texts tend to lay it out bluntly. For example, R. Preston McAfee:

Individuals generally value their time at approximately their wages, so that people with low wages, who tend to be the most price-sensitive, also have the lowest value of time. … A thrifty shopper may be able to spend an hour sorting through the coupons in the newspaper and save $20 on a $200 shopping expedition … This is a good deal for a consumer who values time at less than $20 per hour, and a bad deal for the consumer that values time in excess of $20 per hour. Thus, relatively poor consumers choose to use coupons, which permits the seller to have a price cut that is approximately targeted at the more price-sensitive group.

Clearly, for this to be effective, coupon redemption must be deliberately made time-consuming.

To the extent that there is coupon personalization, it seems to be for changing shopper behavior (e.g., getting them to try out a new product) rather than a pricing mechanism. The NYT story from last year about Target targeting pregnant women falls into this category. That said, these different forms of personalization aren’t entirely distinct, which is a point I will return to in a later article.

4. The traditional model doesn’t work well any more.

Paper coupons have a limited future. As for digital coupons, there is a natural progression toward interfaces that make it easier to acquire and redeem them. In particular, as more shoppers start to pay using their phones in stores, I anticipate coupon redemption being integrated into payment apps, thus becoming almost frictionless.

An interesting side-effect of smartphone-based coupon redemption is that it gives the shopper more privacy, avoiding the awkwardness of pulling out coupons from a purse or wallet. This will further open up coupons to a wealthier demographic, making them even less effective at discriminating between wealthier shoppers and less affluent ones.

5. The coupon is being reborn in a data-driven, personalized form.

With behavioral profiling, companies can determine how much a consumer will pay for a product, and deliver coupons selectively so that each customer’s discount reflects what they are willing to pay. They key difference is what while in the past, customers decided whether or not to look for, collect, and use a coupon, in the new model companies will determine who gets which coupons.

In the extreme, coupons will be available for all purchases, and smart shopping software on our phones or browsers will automatically search, aggregate, manage, and redeem these coupons, showing coupon-adjusted prices when browsing for products. More realistically, the process won’t be completely frictionless, since that would lose the psychological benefit. Coupons will probably also merge with “rewards,” “points,” discounts, and various other incentives.

There have been rumblings of this shift here and there for a few years now, and it seems to be happening gradually. Google’s acquisition of Incentive Targeting a few months ago seems significant, and at the very least demonstrates that tech companies are eyeing this space as well, and not just retailers. As digital feudalism takes root, it could accelerate the trend of individualized shopping experiences.

In summary, personalized coupons offer a vehicle for realizing the full potential of data mining for commerce by tailoring prices in a way that consumers seem to find acceptable. Neither coupons nor price discrimination should be viewed in isolation — together with rewards and various other incentive schemes, they are part of the trend of individualized, data mining-driven commerce that’s here to stay.

Footnotes

[1] Since I’m eschewing some academic terminology in this post, here are a few references and points of clarification. My interest is in first-degree price discrimination. Any price discrimination requires market power; my assumption is that is the case in practice because competition is always imperfect, and we should expect quite a bit of first-degree price discrimination. The observed level is puzzlingly low.

The impact of technology on the ability to personalize prices is complex, and behavioral profiling is only one aspect. Technology also makes competition less perfect by allowing firms to customize products to a greater degree, so that there are no exact substitutes. Finally, technology hinders first-degree price discrimination to an extent by allowing consumers to compare prices between different retailers more easily. The interaction between these effects is analyzed in this paper.

Technology also increases the incentive to price discriminate. As production becomes more and more automated, marginal costs drop relative to fixed costs. In the extreme, digital goods have essentially zero marginal cost. When marginal production costs are low, firms will try to tailor prices since any sale above marginal cost increases profits.

My use of the terms overt and covert is rooted in the theory of price fairness in psychology and behavioral economics, and relates to the presentation of the transaction. While it is somewhat related to first- vs. second/third-degree price discrimination, it is better understood as a separate axis, one that is not captured by theories of rational firms and consumers.

[2] An exception is when non-coupon customers are made aware that others are getting a better deal. This happens, for example, when there is a prominent coupon-code form field in an online shopping checkout flow. See here for a study.

Thanks to Sebastian Gold for reviewing a draft, and to Justin Brickell for interesting conversations that led me to this line of thinking.

]]>http://33bits.org/2013/06/25/personalized-coupons-price-discrimination/feed/8randomwalkerReidentification as Basic Sciencehttp://33bits.org/2013/05/27/reidentification-as-basic-science/
http://33bits.org/2013/05/27/reidentification-as-basic-science/#commentsMon, 27 May 2013 14:16:02 +0000http://33bits.org/?p=1116]]>This essay originally appeared on the Bill of Health blog as part of a conversation on the law, ethics and science of reidentification demonstrations.

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.

Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.

It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.

It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paper, digital cameras, RFID tags, scanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.

Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.

Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.

If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.

In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.

There is more, but I’ll stop here. The point is that these algorithms are everywhere.

If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.

Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.

My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?

Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.

I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.

]]>http://33bits.org/2013/05/27/reidentification-as-basic-science/feed/0randomwalkerPrivacy technologies course roundup: Wiki, student projects, HotPETshttp://33bits.org/2013/05/23/privacy-technologies-course-roundup-wiki-student-projects-hotpets/
http://33bits.org/2013/05/23/privacy-technologies-course-roundup-wiki-student-projects-hotpets/#commentsThu, 23 May 2013 22:14:23 +0000http://33bits.org/?p=1113]]>In earlier posts about the privacy technologies course I taught at Princeton during Fall 2012, I described how I refuted privacy myths, and presented an annotated syllabus. In this concluding post I will offer some additional tidbits about the course.

Wiki. I referred to a Wiki a few times in my earlier post, and you might wonder what that was about. The course included an online Wiki discussion component, and this was in fact the centerpiece. Students were required to participate in the online discussion of the day’s readings before coming to class. The in-class discussion would use the Wiki discussion as a starting point.

The advantages of this approach are: 1. it gives the instructor a great degree of control in shaping the discussion of each paper, 2. the instructor can more closely monitor individual students’ progress 3. class discussion can focus on particularly tricky and/or contentious points, instead of rehashing the obvious.

Student projects. Students picked a variety of final projects for the class, and on the whole exceeded my expectations. Here are two very different projects, in brief.

Nora Taranto, a History of Science major, wrote a policy paper about the privacy implications of the shift to electronic medical records. Nora writes:

I wrote a paper about the privacy implications of patient-care institutions (in the United States) using electronic medical record (EMR) systems more and more frequently. This topic had particular relevance given the huge number of privacy breaches that occurred in 2012 alone. Meanwhile, there is a simultaneous criticism coming from care providers about the usability of such EMR systems. As such, many different communities—in the information privacy sphere, in the medical community, in the general public, etc.—have many different things to say. But, given the several privacy breaches that occurred within a couple of weeks in April 2012 and together implicated over a million individuals, concerns have been raised in particular about how secure EMR systems are. These concerns are especially worrisome given the federal government’s push for their adoption nationwide beginning in 2009 when the American Recovery and Reinvestment Act granting funds to hospitals explicitly for the purpose of EMR implementation.

So I looked into the benefits and costs of such systems, with a particular slant towards the privacy benefits/costs. Overall, these systems do have a number of protective mechanisms at their disposal, some preventative and others reactive. While these protective barriers are all necessary, they are not sufficient to guarantee the patient his or her privacy rights in the modern day. These protective mechanisms—authentication schemes, encryption, and data logs/anomaly-detection—need to be expanded and further developed to provide an adequate amount of protection for personal health information. While the government is, at the moment, encouraging the adoption of EMR systems for maximal penetration, medical institutions ought to use caution in considering which systems to implement and ought to hold themselves to a higher standard. Moreover, greater regulatory oversight of EMR systems on the market would help institutions maintain this cautious approach.

Abu Saparov, Ajay Roopakalu, and Raﬁ Shamim, also undergraduates, designed an implemented an alternative to centralized key distribution. They write:

Our project for the course was to create and implement a decentralized public key distribution protocol and show how it could be used. One of the initial goals of our project was to experience first-hand some of the things that made the design of a usable and useful privacy application so hard. Early on in the process, we decided to try to build some type of application that used cryptography to enhance the privacy of communication with friends. Some of the reasons that we chose this general topic were the fact that all of us had experience with network programming and that we thought some of the things that cryptography can achieve are uniquely cool. We were also somewhat motivated by the prospect of using our application to talk with each other and our other friends after we graduate. We eventually gravitated towards two ideas: (1) a completely peer-to-peer chat system that is encrypted from end-to-end, and (2) a “dumb” social network that allows users to share posts that only their friends (and not the server) can see. During the semester, our focus shifted to designing and implementing the underlying key distribution mechanism upon which these two systems could be built.

When we began to flesh out the designs for our two ideas, we realized that the act of retrieving a friend’s public cryptographic keys was the first challenge to solve. Certificate authorities are the most common way to obtain public keys, but require a large degree of trust to be placed in a small number of authorities. Web of Trust is another option, and is completely decentralized, but often proves difficult in practice because of the need for manual key signing. We decided to make our own decentralized protocol that exposes an easily usable API for clients to use in order to obtain public keys. Our protocol defines an overlay network that features regular nodes, as well as supernodes that are able to prove their trustworthiness, although the details of this are controllable through a policy delegate. The idea is for supernodes to share the task of remembering and verifying public keys through a majority vote of neighboring supernodes. Users running other nodes can ask the supernodes for a friend’s public key. In order to trick someone, an adversary would have to control over half of the supernodes from which a user requested a key. Our decision to go with an overlay network created a variety of issues such as synchronizing information between supernodes, being able to detect and report malicious supernodes, and getting new nodes incorporated into the network. These and the countless other design problems we faced definitely allowed us to appreciate the difficulty of writing a privacy application, but unfortunately, we were not fully able to test every element of our protocol and its implementation. After creating the protocol, we implemented small, bare-bones applications for our initial ideas of peer-to-peer chat and an encrypted social network.

]]>http://33bits.org/2013/05/23/privacy-technologies-course-roundup-wiki-student-projects-hotpets/feed/1randomwalkerWhat Happened to the Crypto Dream? Now in a new and improved paper form!http://33bits.org/2013/04/29/what-happened-to-the-crypto-dream-now-in-a-new-and-improved-paper-form/
http://33bits.org/2013/04/29/what-happened-to-the-crypto-dream-now-in-a-new-and-improved-paper-form/#commentsMon, 29 Apr 2013 20:06:53 +0000http://33bits.org/?p=1110]]>Last October I gave a talk titled “What Happened to the Crypto Dream?” where I looked at why crypto seems to have done little for personal privacy. The reaction from the audience (physical and online) was quite encouraging — not that everyone agreed, but they seemed to find it thought provoking — and several people asked me if I’d turn it into a paper. So when Prof. Alessandro Acquisti invited me to contribute an essay to the “On the Horizon” column in IEEE S&P magazine, I jumped at the chance, and suggested this topic.

Thanks to some fantastic feedback from colleagues and many improvements to the prose by the editors, I’m happy with how the essay has turned out. Here it is in two parts: Part 1, Part 2.

While I’m not saying anything earth shaking, I do make a somewhat nuanced argument — I distinguish between “crypto for security” and “crypto for privacy,” and further subdivide the latter into a spectrum between what I call “Cypherpunk Crypto” and “Pragmatic Crypto.” I identify different practical impediments that apply to those two flavors (in the latter case, a complex of related factors), and lay out a few avenues for action that can help privacy-enhancing crypto move in a direction more relevant to practice.

I’m aware that this is a contentious topic, especially since some people feel that the time is ripe for a resurgence of the cypherpunk vision. I’m happy to hear your reactions.

]]>http://33bits.org/2013/04/29/what-happened-to-the-crypto-dream-now-in-a-new-and-improved-paper-form/feed/0randomwalkerPrivacy technologies: An annotated syllabushttp://33bits.org/2013/04/16/privacy-technologies-an-annotated-syllabus/
http://33bits.org/2013/04/16/privacy-technologies-an-annotated-syllabus/#commentsTue, 16 Apr 2013 12:02:57 +0000http://33bits.org/?p=1105]]>Last semester I taught a course on privacy technologies here at Princeton. Earlier I discussed how I refuted privacy myths that students brought into class. In this post I’d like to discuss the contents of the course. I hope it will be useful to other instructors who are interested in teaching this topic as well as for students undertaking self-study of privacy technologies. Beware: this post is quite long.

What should be taught in a class on privacy technologies? Before we answer that, let’s take a step back and ask, how does one go about figuring out what should be taught in any class?

I’ve seen two approaches. The traditional, default, overwhelmingly common approach is to think of it in terms of “covering content” without much consideration to what students are getting out of it. The content that’s deemed relevant is often determined by what the fashionable research areas happen to be, or historical accident, or some combination thereof.

A contrasting approach, promoted by authors like Bain, applies a laser focus on skills that students will acquire and how they will apply them later in life. On teaching orientation day at Princeton, our instructor, who clearly subscribed to this approach, had each professor describe what students would do in the class they are teaching, then wrote down only the verbs from these descriptions. The point was that our thinking had to be centered around skills that students would take home.

I prefer a middle ground. It should be apparent from my description of the traditional approach above that I’m not a fan. On the other hand, I have to wonder what skills our teaching coach would have suggested for a course on cosmology — avoiding falling into black holes? Alright, I’m exaggerating to make a point. The verbs in question are words like “synthesize” and “evaluate,” so there would be no particular difficulty in applying them to cosmology. But my point is that in a cosmology course, I’m not sure the instructor should start from these verbs.

Sometimes we want students to be exposed to knowledge primarily because it is beautiful, and being able to perceive that beauty inspires us, instills us with a love of further learning, and I dare say satisfies a fundamental need. To me a lot of the crypto “magic” that goes into privacy technologies falls into that category (not that it doesn’t have practical applications).

With that caveat, however, I agree with the emphasis on skills and life impact. I thought of my students primarily as developers of privacy technologies (and more generally, of technological systems that incorporate privacy considerations), but also as users and scholars of privacy technologies.

I organized the course into sections, a short introductory section followed by five sections that alternated in the level of math/technical depth. Every time we studied a technology, we also discussed its social/economic/political aspects. I had a great deal of discretion in guiding where the conversation around the papers went by giving them questions/prompts on the class Wiki. Let us now jump in. The italicized text is from the course page, the rest is my annotation.

0. Intro

Goals of this section: Why are we here? Who cares about privacy? What might the future look like?

In addition to helping flesh out the foundational assumptions of this course that I discussed in the previous post, pairing these opposing views with each other helped make the point that there are few absolutes in this class, that privacy scholars may disagree with each other, and that the instructor doesn’t necessarily agree with the viewpoints in the assigned reading, much less expects students to.

1. Cryptography: power and limitations

Goals. Travel back in time to the 80s and early 90s, understand the often-euphoric vision that many crypto pioneers and hobbyists had for the impact it would have. Understand how cryptographic building blocks were thought to be able to support this restructuring of society. Reason about why it didn’t happen.

Understand the motivations and mathematical underpinnings of the modern research on privacy-preserving computations. Experiment with various encryption tools, discover usability problems and other limitations of crypto.

I think the Chaum paper is a phenomenal and underutilized resource for teaching. My goal was to really immerse students in an alternate reality where the legal underpinnings of commerce were replaced by cryptography, much as Chaum envisioned (and even going beyond that). I created a couple of e-commerce scenarios for Wiki discussion and had them reason about how various functions would be accomplished.

My own views on this topic are set forth in this talk (now a paper; coming soon). In general I aimed to shield students from my viewpoints, and saw my role as helping them discover (and be able to defend) their own. At least in this instance I succeeded. Some students took the position that the cypherpunk dream is just around the corner.

Goals. Jump forward in time to the present day and immerse ourselves in the world of ubiquitous data collection and surveillance. Discover what kinds of data collection and data mining are going on, and why. Discuss how and why the conversation has shifted from Government surveillance to data collection by private companies in the last 20 years.

This section is rather self-explanatory. After the math-y flavor of the first section, this one has a good amount of economics, behavioral economics, and policy. One of the thought exercises was to project current trends into the future and imagine what ubiquitous tracking might lead to in five or ten years.

3. Anonymity and De-anonymization

Important note: communications anonymity (e.g., Tor) and data anonymity/de-anonymization (e.g., identifying people in digital databases) are technically very different, but we will discuss them together because they raise some of the same ethical questions. Also, Bitcoin lies somewhere in between the two.

Tor and Bitcoin (especially the latter) were the hardest but also the most rewarding parts of the class, both for them and for me. Together they took up 4 classes. Bitcoin is extremely challenging to teach because it is technically intricate, the ecosystem is rapidly changing, and a lot of the information is in random blog/forum posts.

In a way, I was betting on Bitcoin by deciding to teach it — if it had died with a whimper, their knowledge of it would be much less relevant. In general I think instructors should choose to make these such bets more often; most curricula are very conservative. I’m glad I did.

It was a challenge to figure out which deanonymization paper to assign. I went with the DNA one because I wanted them to see that deanonymization isn’t a fact about data, but a fact about the world. Another thing I liked about this paper is that they’d have to extract the not-too-complex statistical methodology in this paper from the bioinformatics discussion in which it is embedded. This didn’t go as well as I’d hoped.

I’ve co-authored a few deanonymization papers, but they’re not very well written and/or are poorly suited for pedagogical purposes. The Kaggle paper is one exception, which I made optional.

This is another pair of papers with opposing views. Since the latter paper is optional, knowing that most of them wouldn’t have read it, I used the Wiki prompts to raise many of the issues that the author raises.

4. Lightweight Privacy Technologies and New Approaches to Information Privacy

While cryptography is the mechanism of choice for cypherpunk privacy and anonymity tools like Tor, it is too heavy a weapon in other contexts like social networking. In the latter context, it’s not so much users deploying privacy tools to protect themselves against all-powerful adversaries but rather a service provider attempting to cater to a more nuanced understanding of privacy that users bring to the system. The goal of this section is to consider a diverse spectrum of ideas applicable to this latter scenario that have been proposed in recent years in the fields of CS, HCI, law, and more. The technologies here are “lightweight” in comparison to cryptographic tools like Tor.

This final section doesn’t have a coherent theme (and I admitted as much in class). My goal with the first two papers was to contrast a privacy problem which seems amenable to a purely or primarily technological formulation and solution (statistical queries over databases of sensitive personal information) with one where such attempts have been less successful (the decentralized, own-your-data approach to social networking and e-commerce).

These two essays aren’t directly related to privacy. One of the recurring threads in this course is the debate between purely technological and legal or other approaches to privacy; the theme here is to generalize it to a context broader than privacy. The Barlow essay asserts the exceptionalism of Cyberspace as an unregulable medium, whereas the Grimmelmann paper provides a much more nuanced view of the relationship between the law and new technological frontiers.

I’m making available the entire set of Wiki discussion prompts for the class(HTML/PDF). I consider this integral to the syllabus, for it shapes the discussion very significantly. I really hope other instructors and students find this useful as a teaching/study guide. For reference, each set of prompts (one set per class) took me about three hours to write on average.

There are many more things I want to share about this class: the major take-home ideas, the rationale for the Wiki discussion format, the feedback I got from students, a description of a couple of student projects, some thoughts on the sociology of different communities studying privacy and how that impacted the class, and finally, links to similar courses that are being taught elsewhere. I’ll probably close this series with a round-up post including as many of the above topics as I can.

]]>http://33bits.org/2013/04/16/privacy-technologies-an-annotated-syllabus/feed/2randomwalkerHow I utilized “expectation failure” to refute privacy mythshttp://33bits.org/2013/04/11/how-i-utilized-expectation-failure-to-refute-privacy-myths/
http://33bits.org/2013/04/11/how-i-utilized-expectation-failure-to-refute-privacy-myths/#commentsThu, 11 Apr 2013 12:50:18 +0000http://33bits.org/?p=1101]]>Last semester I taught a course on privacy technologies. Since it was a seminar, the class was a small, self-selected group of very motivated students. Based on the feedback, it seems to have been a success; it was certainly quite personally gratifying for me. This is the first in a series of posts on what I learnt from teaching this course. In this post I will discuss some major misconceptions about privacy, how to refute them, and why it is important to do this right at the beginning of the course.

Privacy’s primary pitfalls

Instructors are often confronted with breaking down faulty mental models that students bring into class before actual learning can happen. This is especially true of the topic at hand. Luckily, misconceptions about privacy are so pervasive in the media and among the general public that it wasn’t too hard to identify the most common ones before the start of the course. And it didn’t take much class discussion to confirm that my students weren’t somehow exempt from these beliefs.

One cluster of myths is about the supposed lack of importance of privacy. 1. “There is no privacy in the digital age.” This is the most common and perhaps the most grotesquely fallacious of the misconceptions; more on this below. 2. “No one cares about privacy any more” (variant: young people don’t care about privacy.) 3. “If you haven’t done anything wrong you have nothing to hide.”

A second cluster of fallacious beliefs is very common among computer scientists and comes from the tendency to reduce everything to a black-and-white technical problem. In this view, privacy maps directly to access control and cryptography is the main technical mechanism for achieving privacy. It’s a view in which the world is full of adversaries and there is no room for obscurity or nontechnical ways of improving privacy.

The first step in learning is to unlearn

Why is it important to spend time confronting faulty mental models? Why not simply teach the “right” ones? In my case, there was a particularly acute reason — to the extent that students believe that privacy is dead and that learning about privacy technologies is unimportant, they are not going to be invested in the class, which would be really bad. But even in the case of misconceptions that don’t lead to students doubting the fundamental premise of the class, there is a surprising reason why unlearning is important.

A famous experiment in the ’80s (I really really recommend reading the linked text) demonstrated what we now know about the ineffectiveness of the “information transmission” model of teaching. The researchers interviewed students after any of four introductory physics courses, and determined that they hadn’t actually learned what had been taught, such as Newton’s laws of motion; instead they just learned to pass the tests. When the researchers sat down with students to find out why, here’s what they found:

What they heard astonished them: many of the students still refused to give up their mistaken ideas about motion. Instead, they argued that the experiment they had just witnessed did not exactly apply to the law of motion in question; it was a special case, or it didn’t quite fit the mistaken theory or law that they held as true.

A special case! Ha. What’s going on here? Well, learning new facts is easy. On the other hand, updating mental models is so cognitively expensive that we go to absurd lengths to avoid doing so. The societal-scale analog of this extreme reluctance is well-illustrated by the history of science — we patched the Ptolemaic model of the Universe, with the Earth at the center, for over a millennium before we were forced to accept that the Copernican system fit observations better.

The instructor’s arsenal

The good news is that the instructor can utilize many effective strategies that fall under the umbrella of active learning. Ken Bain’s excellent book (which the preceding text describing the experiment is from) lays out a pattern in which the instructor creates an expectation failure, a situation in which existing mental models of reality will lead to faulty expectations. One of the prerequisites for this to work, according to the book, is to get students to care.

Bain argues that expectation failure, done right, can be so powerful that students might need emotional support to cope. Fortunately, this wasn’t necessary in my class, but I have no doubt of it based on my personal experiences. For instance, back when I was in high school, learning how the Internet actually worked and realizing that my intuitions about the network had to be discarded entirely was such a disturbing experience that I remember my feelings to this day.

Let’s look at an example of expectation failure in my privacy class. To refute the “privacy is dying” myth, I found it useful to talk about Fifty Shades of Grey — specifically, why it succeeded even though publishers initially passed on it. One answer seems to be that since it was first self-published as an e-book, it allowed readers to be discreet and avoid the stigma associated with the genre. (But following its runaway success in that form, the stigma disappeared, and it was released in paper form and flew off the shelves.)

The relative privacy of e-books from prying strangers is one of the many ways in which digital technology affords more privacy for specific activities. Confronting students with an observed phenomenon whose explanation involves a fact that seems starkly contrary to the popular narrative creates an expectation failure. Telling personal stories about how technology has either improved or eroded privacy, and eliciting such stories from students, gets them to care. Once this has been accomplished, it’s productive to get into a nuanced discussion of how to reconcile the two views with each other, different meanings of privacy (e.g., tracking of reading habits), how the Internet has affected each, and how society is adjusting to the changing technological landscape.

I’m quite new to teaching — this is only my second semester at Princeton — but it’s been exciting to internalize the fact that learning is something that can be studied scientifically and teaching is an activity that can vary dramatically in effectiveness. I’m looking forward to getting better at it and experimenting with different methods. In the next post I will share some thoughts on the content of my course and what I tried to get students to take home from it.

]]>http://33bits.org/2013/04/11/how-i-utilized-expectation-failure-to-refute-privacy-myths/feed/0randomwalkerUnlikely Outcomes? A Distributed Discussion on Decentralized Personal Data Architectureshttp://33bits.org/2013/03/27/unlikely-outcomes-a-distributed-discussion-on-decentralized-personal-data-architectures/
http://33bits.org/2013/03/27/unlikely-outcomes-a-distributed-discussion-on-decentralized-personal-data-architectures/#commentsWed, 27 Mar 2013 15:44:35 +0000http://33bits.org/?p=1098]]>In recent years there has been a mushrooming of decentralized social networks, personal data stores and other such alternatives to the current paradigm of centralized services. In the academic paper A Critical Look at Decentralized Personal Data Architectures last year, my coauthors and I challenged the feasibility and desirability of these alternatives (I also gave a talk about this work). Based on the feedback, we realized it would be useful to explicate some of our assumptions and implicit viewpoints, add context to our work, clarify some points that were unclear, and engage with our critics on some of the more contentious claims.

We found the perfect opportunity to do this via an invitation from Unlike Us Reader, produced by the Institute of Network Cultures — it’s a magazine run by a humanities-oriented group of people, with a long-standing interest in digital culture, but they also attract some politically oriented developers. The Unlike Us conference, from which this edited volume stems, is also very interesting. [1]

Three of the five original authors — Solon, Vincent and I — teamed up with the inimitable Seda Gürses for an interview-style conversation (PDF). Seda is unique among privacy researchers — one of her interests is to understand and reconcile the often maddeningly divergent viewpoints of the different communities that study privacy, so she was the ideal person to play the role of interlocutor. Seda solicited feedback from about two dozen people in the hobbyist, activist and academic communities, and synthesized the responses into major themes. Then the three of us took turns responding to the prompts, which Solon, with Seda’s help, organized into a coherent whole. A majority of the commenters consented to making their feedback public, and Seda has collected the discussion into an online appendix.

This was an unusual opportunity, and I’m grateful to everyone who made it happen, particularly Seda and Solon who put in an immense amount of work. My participation was very enjoyable. Research proceeds at such a pace that we rarely have the opportunity to look back and cogitate about the process; when we do, we’re often surprised by what we find. For example, here’s something I noted with amusement in one of my responses:

My interest in decentralized social networking apparently dates to 2009, as I just discovered by digging through my archives. I’d signed up to give a talk on pitfalls of social networking privacy at a Stanford workshop, and while preparing for it I discovered the rich academic literature and the various hobbyist efforts in the decentralized model. My slides from that talk seem to anticipate several of the points we made about decentralized social networking in the paper (albeit in bullet-point form), along with the conclusion that they were “unlikely to disrupt walled gardens.” Funnily enough, I’d completely forgotten about having given this talk when we were writing the paper.

I would recommend reading this text as a companion to our original paper. Read it for extra context and clarifications, a discussion of controversial points, and as a way of stimulating thinking about the future prospects of alternative architectures. It may also be an interesting read as an example of how people writing an article together can have different views, and as a bit of a behind-the-scenes look at the research process.

[1] In particular, the latest edition of the conference that just concluded had a panel titled “Are you distributed? The Federated Web Show” moderated by Seda, with Vincent as one of the participants. It touched upon many of the same themes as our work.

]]>http://33bits.org/2013/03/27/unlikely-outcomes-a-distributed-discussion-on-decentralized-personal-data-architectures/feed/1randomwalkerThe job talk is a performancehttp://33bits.org/2013/02/09/the-job-talk-is-a-performance/
http://33bits.org/2013/02/09/the-job-talk-is-a-performance/#commentsSat, 09 Feb 2013 16:43:07 +0000http://33bits.org/?p=1091]]>This is the second in a series of posts with advice for computer science academic job candidates.

One shot, one opportunity

The philosopher Marshall Mathers once asked rhetorically, “Look, if you had one shot, or one opportunity / To seize everything you ever wanted in one moment / Would you capture it or just let it slip?”

He added, “Yo.” [1]

I don’t mean to imply that an academic position is everything you ever wanted, but it’s a pretty good life (although not for these reasons). Like it or not, it’s set up so that your career up until this point comes down to one moment. After years of hard work, your ability as a researcher will be judged primarily based on how you sell yourself in the fleeting span of an hour. Of course, you’ll (hopefully) give your talk at many places, but it’s going to be the same talk!

There’s a reason I’m saying this, and it’s not to stress you out even more. Rather, if at any point the level of preparedness that I suggest seems excessive or disproportionate, remember the wise words quoted above.

Public speaking is a performance

My first piece of advice is to read the book Confessions of a Public Speaker.[2] As in, don’t even think about giving your job talk without having read it. You can read it in a sitting; putting it into practice will of course take longer. I cannot overstate the impact this book had on my talk (and my public speaking in general). There are probably other books that capture much of the wisdom in Confessions, and I’d love to hear other recommendations, but if you’re going to read one book it would have to be this one.

There are numerous very useful little details in the book, but it has one central idea that can be boiled down to the phrase “public speaking is a performance.” Job talks are are even more of a performance than public speaking in general, since the audience is specifically there to judge you.[3] This is a generative metaphor — it allows you predict things about your job talk based on what you know about performing. Fully appreciating the metaphor will require reading the book, but here are two such predictions that might otherwise be surprising.

Your first priority is to entertain

Certainly you must both entertain and inform, but the point is that you don’t really have a shot at the latter if you fail at the former. Sitting in a lecture, as everyone who remembers their student days is surely aware, can be excruciating; it’s an extremely unnatural situation from an evolutionary perspective (again, read Confessions to appreciate why.) The chart below from the book What’s the Use of Lectures? shows students’ heart rate over time as they sat in a lecture. It’s only a drop of a few beats per minute, but it translates to an enormous difference in alertness.If you don’t do anything different in your talk and simply present your material, your audience’s attention level will be greatly diminished by the half-hour mark, and by the end of your talk people will basically be comatose. Anything you can do to break the routine, linear, hyper-boring pattern of a lecture will help jolt the audience out of their stupor. (That includes asking questions — I usually asked two or three in my talk.) Otherwise they won’t be excited about you nor remember much after the talk.

Rehearse, rehearse, rehearse

You may have heard “practice, practice, practice.” I’d rather cast it in the language of performance, as there are some subtle differences. For example, when people tell you to practice, they tell you not to overdo it because you’d lose your spontaneity. I disagree. In a rehearsal, everything is practiced down to the last detail. In fact, the apparently spontaneous things that I said my talk were the most well-rehearsed parts.

Rehearsal should include videotaping yourself and watching it. Yes, it’s painful and majorly cringe-inducing, but it’s absolutely, absolutely essential. In addition to all the obvious facets of good presentation style that I won’t repeat, one of the subtle but important things you should watch for is nervous tics or other repetitive behaviors — almost everyone has one or more of those, and they can almost derail your talk by distracting your audience.

The reason rehearsal makes such a huge difference is that when you’re delivering a rehearsed talk, your every word and gesture is subconscious, freeing up your mental bandwidth for observing and reacting in real-time to the facial expressions of your audience. There are never more than 40-50 people in these talks, a small enough number that you can instantly notice if someone looks confused, skeptical, or bored. But this won’t be possible if you have to think through your slides instead. The reduction in cognitive load also minimizes the chance of “hitting the wall,” a phenomenon of sudden mental fatigue that’s a serious danger in long-ish talks and can leave you helpless.

Let me close with an example of a little theatrical thing I did that shows the value of rehearsal and the performance metaphor. One of the goals in my location privacy project is to minimize smartphone power consumption. When I got to that part, I’d say, “those of you with Android phones know how bad the battery life is. In fact I usually carry a spare battery around… actually, I think I have it on me.” Then I’d pull a smartphone battery out of my jacket pocket with a bit of a dramatic touch. Somehow the use of a physical prop seemed to reframe their thinking from “yet another academic paper” to “solving a real problem.” It would also usually elicit a laugh and elevate their attention level.

There is so much more to say about job talks, not to mention other aspects of the job interview. I might do follow up posts on a mathematical model of audience behavior and/or an explanation of why slide transitions are (by far) the most important part of your slides.

[1] This post was written while listening to Lose Yourself in a loop.

[2] If it needs to be said, I have no stake in the book, financial or otherwise.

[3] I hasten to add that teaching is very different from public speaking and is emphatically not a performance.

]]>http://33bits.org/2013/02/09/the-job-talk-is-a-performance/feed/4randomwalkerheartratePrice Discrimination and the Illusion of Fairnesshttp://33bits.org/2013/01/22/price-discrimination-and-the-illusion-of-fairness/
http://33bits.org/2013/01/22/price-discrimination-and-the-illusion-of-fairness/#commentsTue, 22 Jan 2013 18:24:26 +0000http://33bits.org/?p=1088]]>In my previous article I pointed out that online price discrimination is suspiciously absent in directly observable form, even though covert price discrimination is everywhere. Now let’s talk about why that might be.

By “covert” I don’t mean that the firm is trying to keep price discrimination a secret. Rather, I mean that the differential treatment isn’t made explicit — e.g., by not basing it directly on a customer attribute — and thereby avoiding triggering the perception of unfairness or discrimination. A common example is selective distribution of coupons instead of listing different prices. Such discounting may be publicized, but it is still covert.

The perception of fairness

The perception of fairness or unfairness, then, is at the heart of what’s going on. Going back to the WSJ piece, I found it interesting to see the reaction of the customer to whom Staples quoted $1.50 more for a stapler based on her ZIP code: “How can they get away with that?” she asks. To which my initial reaction was, “Get away with what, exactly? Supply and demand? Econ 101?”

Even though some of us might not feel the same outrage, I think all of us share at least a vague sense of unease about overt price discrimination. So I decided to dig deeper into the literature in psychology, marketing, and behavioral economics on the topic of price fairness and understand where this perception comes from. What I found surprised me.

First, the fairness heuristic is quite elaborate and complex. In a vast literature spanning several decades, early work such as the “principle of dual entitlement” by Kahneman and coauthors established some basics. Quoting Anderson and Simester: “This theory argues that customers’ have perceived fairness levels for both ﬁrm proﬁts and retail prices. Although ﬁrms are entitled to earn a fair proﬁt, customers are also entitled to a fair price. Deviations from a fair price can be justiﬁed only by the ﬁrm’s need to maintain a fair proﬁt. According to this argument, it is fair for retailers to raise the price of snow shovels if the wholesale price increases, but it is not fair to do so if a snowstorm leads to excess demand.”

Much later work has added to and refined that model. A particularly impressive and highly cited 2004 paper reviews the literature and proposes an elaborate framework with four different classes inputs to explain how people decide if pricing is fair or unfair in various situations. Some of the findings are quite surprising. For example: in case of differential pricing to the buyer’s disadvantage, “trust in the seller has a U-shaped effect on price fairness perceptions.”

The illusion of fairness

Sounds like we have a well-honed and sophisticated decision procedure, then? Quite the opposite, actually. The fairness heuristic seems to be rather fragile, even if complex.

Let’s start with an example. Andrew Odlyzko, in his brilliant essay on price discrimination — all the more for the fact that it was published back in 2003 [1] — has this to say about Coca Cola’s ill-fated plans for price-adjusting vending machines: “In retrospect, Coca Cola’s main problem was that news coverage always referred to its work as leading to vending machines that would raise prices in warm weather. Had it managed to control publicity and present its work as leading to machines that would lower prices in cold weather, it might have avoided the entire controversy.”

We know how to explain the public’s reaction to the Coca Cola announcement using behavioral economics — the way it was presented (or framed), customers take the lower price as the “reference price,” and the price increase seems unfair, whereas the Odlyzko’s suggested framing would anchor the higher price as the reference price. Of course, just because we can explain how the fairness heuristic works doesn’t make it logical or consistent, let alone properly grounded in social justice.

More generally, every aspect of our mental price fairness assessment heuristic seems similarly vulnerable to hijacking by tweaking the presentation of the transaction without changing the essence of price discrimination. Companies have of course gotten wise to this; there’s even academic literature on it. One of the techniques proposed in this paper is “reference group signaling” — getting a customer to change the set of other customers to whom they mentally compare themselves. [2]

The perception of fairness, then, can be more properly called the illusion of fairness.

The fragility of the fairness heuristic becomes less surprising considering that we apparently share it with other primates. This hilarious clip from a TED talk shows a capuchin monkey reacting poorly, to put it mildly, to differential treatment in a monkey-commerce setting (although the jury may still be out on the significance of this experiment). If our reaction to pricing schemes is partly or largely due to brain circuitry that evolved millions of years ago, we shouldn’t expect it to fare well when faced with the complexities of modern business.

Lose-lose

Given that the prime impediment to pervasive online price discrimination is a moral principle that is fickle and easily circumventable, one can expect that companies to do exactly that, since they can reap most of the benefits of price discrimination without the negative PR. Indeed, it is my belief that more covert price discrimination is going on than is generally recognized, and that it is accelerating due to some technological developments.

This is a problem because price discrimination does raise ethical concerns, and these concerns are every bit as significant when it is covert. [3] However, since it is much less transparent, there’s less of an opportunity for public debate.

There are two directions in which I want to take this series of articles from this point: first a look at how new technology is enabling powerful forms of tailoring and covert price discrimination, and second, a discussion of what can be done to make price discrimination more transparent and how to have an informed policy discussion about its benefits and dangers.

[1] I had the pleasure of sitting next to Professor Odlyzko at a conference dinner once, and I expressed my admiration of the prescience of his article. He replied that he’d worked it all out in his head circa 1996 but took a few years to put it down on paper. I could only stare at him wordlessly.

[2] I’m struck by the similarities between price fairness perceptions and privacy perceptions. The aforementioned 2004 price fairness framework can be seen as serving a roughly analogous function to contextual integrity, which is (in part) a theory of consumer privacy expectations. Both these theories are the result of “reverse engineering,” if you will, of the complex mental models in their respective domains using empirical behavioral evidence. Continuing the analogy, privacy expectations are also fragile, highly susceptible to framing, and liable to be exploited by companies. Acquisti and Grossklags, among others, have done some excellent empirical work on this.

[3] In fact, crude ways of making customers reveal their price sensitivity lead to a much higher social cost than overt price discrimination. I will take this up in more detail in a future post.

]]>http://33bits.org/2013/01/22/price-discrimination-and-the-illusion-of-fairness/feed/10randomwalkerOnline price discrimination: Conspicuous by its absencehttp://33bits.org/2013/01/08/online-price-discrimination-conspicuous-by-its-absence/
http://33bits.org/2013/01/08/online-price-discrimination-conspicuous-by-its-absence/#commentsTue, 08 Jan 2013 12:57:56 +0000http://33bits.org/?p=1082]]>The mystery about online price discrimination is why so little of it seems to be happening.

Consumer advocates and journalists among others have been trying to find smoking gun evidence of price discrimination — the overt kind where different customers are charged different prices for identical products based on how much they are willing to pay. (By contrast, examples of covert or concealed price discrimination abound; see, for example, my 2011 article.) Back in 2000 Amazon tried a short-lived experiment where prices of DVDs for new and for regular users were different. But that remains essentially the only example.

This should be surprising. Tailoring prices to individuals is far more technically feasible online than offline, since shoppers are either identified or at least have loads of behavioral data associated with their pseudonymous cookies. The online advertising industry claims that this is highly effective for targeting ads; estimating consumers’ willingness to pay shouldn’t be much harder. Clearly, price discrimination has benefits to firms engaging in it by allowing them to capture more of the “consumer surplus.” (Whether or not it is beneficial to consumers is a more controversial question that I will defer to a future post.) In fact, based on technical feasibility and economic benefits, one might expect the practice to be pervasive.

The evidence (or lack thereof)

A study out of Spain last year took a comprehensive look at online merchants, by far the most thorough analysis of its kind. They created two “personas” with different browsing histories — one of which visited discount sites and the other visited sites for luxury products. Each persona then browsed 200 e-commerce sites as well as search engines to see if they were treated differently. Here’s what the authors found:

There is evidence for search discrimination or steering where the high- and low-income personas are shown ads for high-end and low-end products respectively. In my opinion, the line between this practice and plain old behavioral advertising is very, very slim. [1]

There is no evidence for price discrimination based on personas/browsing histories.

Three of the 200 retailers including Staples varied prices based on the user’s location, but necessarily not in a way that can’t be explained by costs of doing business.

Visitors coming from one particular deals site (nextag.com) saw lower prices at various retailers. (Discounting and “deals” are very common forms of concealed price discrimination.)

A new investigation by the Wall Street Journal analyzes Staples in more detail. While the Spain study found geographic variation in prices, the WSJ study goes further and shows a strong correlation between lower prices and consumers’ ability to drive to competitors’ stores, which is an indicator of willingness to pay. I’m not 100% convinced that they’ve ruled out alternative hypotheses, but it does seem plausible that Staples’ behavior constitutes actual price discrimination, even though geography is a far cry from utilizing behavioral data about individuals.

Other findings in the WSJ piece are websites that offer discounts for mobile users and location-dependent pricing on Lowe’s and Home Depot’s websites but with little evidence of being based on anything but costs of doing business.

So there we have it. Both studies are very thorough, and I commend the authors, but I consider their results to be mostly negative — very few companies are varying prices at all and none are utilizing anywhere near the full extent of data available about users. Other price discrimination controversies include steering by Orbitz and a hastily-retracted announcement by Coca Cola for vending machines that would tailor prices to demand. Neither company charged or planned to charge different prices for the same product based on who the consumer was.

In short, despite all the hubbub, I find overt price discrimination conspicuous by its absence. In a follow-up post I will propose an explanation for the mystery and see what we can learn from it.

[1] This is an automatic consequence of collaborative recommendation that suggests products to users based on what similar users have clicked on/purchased in the past. It does not require that any explicit inference of the consumer’s level of affluence be made by the system. In other words, steering, bubbling etc. are inherent features of collaborative filtering algorithms which drive personalization, recommendation and information retrieval on the Internet. This fact greatly complicates attempts to define, detect or regulate unfair discrimination online.

]]>http://33bits.org/2013/01/08/online-price-discrimination-conspicuous-by-its-absence/feed/4randomwalkerEmbracing failure: How research projects are like startupshttp://33bits.org/2013/01/02/embracing-failure-how-research-projects-are-like-startups/
http://33bits.org/2013/01/02/embracing-failure-how-research-projects-are-like-startups/#commentsWed, 02 Jan 2013 16:09:27 +0000http://33bits.org/?p=1077]]>As an academic who’s spent time in the startup world, I see strong similarities between the nature of a scientific research project and the nature of a startup. This boils down the fact that most research projects fail (in a sense that I’ll describe), and even among the successful projects the variance is extremely high — most of the impact is concentrated in a few big winners.

Of course, research projects are clearly unlike startups in some important ways: in research you don’t get to capture the economic benefit of your work; your personal gain from success is not money but academic reputation (unless you commercialize your research and start an actual startup, but that’s not what this post is about at all.) The potential personal downside is also lower for various reasons. But while the differences are obvious, the similarities call for some analysis.

I hope this post is useful to grad students in particular in acquiring a long-term vision for how to approach their research and how to maximize the odds of success. But perhaps others including non-researchers will also find something useful here. There are many aspects of research that may appear confusing or pathological, and at least some of them can be better understood by focusing on the high variance in research impact.

1. Most research projects fail.

To me, publication alone does not constitute success; rather, the goal of a research project is to impact the world, either directly or by influencing future research. Under this definition, the vast majority of research ideas, even if published, are forgotten in a few years. Citation counts estimate impact more accurately [1], but I think they still significantly underestimate the skew.

The fact that most research projects don’t make a meaningful lasting impact is OK — just as the fact that most startups fail is not an indictment of entrepreneurship.

A researcher might choose to take a self-interested view and not care about impact, but even in this view, merely aiming to get papers published is not a good long-term strategy. For example, during my recent interview tour, I got a glimpse into how candidates are evaluated, and I don’t think someone with a slew of meaningless publications would have gotten very far. [2]

2. Grad students: diversify your portfolio!

Given that failure is likely (and for reasons you can’t necessarily control), spending your whole Ph.D. trying to crack one hard problem is a highly risky strategy. Instead, you should work on multiple projects during your Ph.D., at least at the beginning. This can be either sequential or parallel; the former is more similar to the startup paradigm (“fail-fast”).

I achieved diversity by accident. Halfway through my Ph.D. there were at least half a dozen disparate research topics where I’d made some headway (some publications, some works in progress, some promising ideas). Although I felt I was directionless, this turned out to be the right approach in retrospect. I caught a lucky break on one of them — anonymity in sanitized databases — because of the Netflix Prize dataset, and from then on I doubled down to focus on deanonymization. This breadth-then-depth approach paid off.

3. Go for the big hits.

Paul Graham’s fascinating essay Black Swan Farming is about how skewed the returns are in early-stage startup investing. Just two of the several hundred companies that YCombinator has funded are responsible for 75% of the returns, and in each batch one company outshines all the rest.

The returns from research aren’t quite as skewed, but they’re skewed enough to be highly counterintuitive. This means researchers must explicitly account for the skew in selecting problems to work on. Following one’s intuition and/or the crowd is likely to lead to a mediocre career filled with incremental, marginally publishable results. The goal is to do something that’s not just new and interesting, but which people will remember in ten years, and the latter can’t necessarily be predicted based on the amount of buzz a problem is generating in the community right now. Breakthroughs often come from unsexy problems (more on that below).

There’s a bit of a tension between going for the hits and diversifying your portfolio. If you work on too few projects, you incur the risk that none of them will pan out. If you work on too many, you spread yourself too thin, the quality of each one suffers, and lowers the chance that at least one of them will be a big hit. Everyone must find their own sweet spot. One piece of advice given to junior professors is to “learn to say no.”

4. Find good ideas that look like bad ideas.

How do you predict if an idea you have is likely to lead to success, especially a big one? Again let’s turn to Paul Graham in Black Swan Farming:

“the best startup ideas seem at first like bad ideas. … if a good idea were obviously good, someone else would already have done it. So the most successful founders tend to work on ideas that few beside them realize are good.”

Something very similar is true in research. There are some problems that everyone realizes are important. If you want to solve such a problem, you have to be smarter than most others working on it and be at least a little bit lucky. Craig Gentry, for example, invented Fully Homomorphic Encryption mostly by being very, very smart.

Then there are research problems that are analogous to Graham’s good ideas that initially look bad. These fall into two categories: 1. research problems that no one has realized are important 2. problems that everyone considers prohibitively difficult but which turn out to have a back door.

If you feel you are in a position to take on obviously important problems, more power to you. I try to work on problems that everyone seems to think are bad ideas (either unimportant or too difficult), but where I have some “unfair advantage” that leads me to think otherwise. Of course, a lot of the time they are right, but sometimes they are not. Let me give two examples.

I consider Adnostic (online behavioral advertising without tracking) to be moderately successful: it has had an impact on other research in the area, as well as in policy circles as an existence proof of behavioral-advertising-with-privacy.[3] Now, my coauthors started working on it before I joined them, so I can take none of the credit for problem selection. But it’s a good illustration of the principle. The main reason they decided this problem was important was that privacy advocates were up in arms about online tracking. Almost no one in the computer science community was studying the topic, because they felt that simply blocking trackers was an adequate solution. So this was a case of picking a problem that people didn’t realize was important. Three years later it’s become a very crowded research space.

Another example is my work with Shmatikov on deanonymizing social networks by being able to find a matching between the nodes of two social graphs. Most people I talked to at the time thought this was impossible — after all, it’s a much harder version of graph isomorphism, and we’re talking about graphs with millions of nodes. Here’s the catch: people intuitively think graph isomorphism is “hard,” but it is in fact not NP-complete and on real-world graphs it embarrassingly easy. We knew this, and even though the social network matching problem is harder than graph isomorphism, we thought it was still doable. In the end it took months of work, but fortunately it was just within the realm of possibility.

5. Most researchers are known for only one or two things.

Let me end with an interesting side effect of the high-skew theory: a successful researcher may have worked on many successful projects during their career, but the top one or two of those will likely be far better known than the rest. This seems to be borne out empirically, and a source of much annoyance for many researchers to be pigeonholed as “the person who did X.” Let’s take Ron Rivest who’s been prolific for several decades not just in cryptography but also in algorithms and lately in voting. Most computer scientists will recall that he’s the R in RSA, but knowledge of his work drops off sharply after that. This is also reflected in the citation counts (the first entry is a textbook, not a research paper). [4]

In summary, if you’re a researcher, think carefully about which projects to work on and what the individual and overall chances of success are. And if you’re someone who’s skeptical about academia because your friend who dropped out of a Ph.D. after their project failed convinced you that all research is useless, I hope this post got you to think twice.

I may do a follow-up post examining whether ideas are as valuable as they are held to be in the research community, or whether research ideas are more similar to startup ideas in that it’s really execution and selling that lead to success.

[1] For example, a quarter of my papers are responsible for over 80% of my citations.
[2] That said, I will get a much better idea in the next few months from the other side of the table :)
[3] Specifically, it undermines the “we can’t stop tracking because it would kill our business model” argument that companies love to make when faced with pressure from privacy advocates and regulators.
[4] To be clear, my point is that Rivest’s citation counts drop off relative to his most well-known works.

]]>http://33bits.org/2013/01/02/embracing-failure-how-research-projects-are-like-startups/feed/0randomwalkerNew Developments in Deanonymizationhttp://33bits.org/2012/12/17/new-developments-in-deanonymization/
http://33bits.org/2012/12/17/new-developments-in-deanonymization/#commentsMon, 17 Dec 2012 16:59:31 +0000http://33bits.org/?p=1072]]>This post is a roundup of developments in deanonymization in the last few months. Let’s start with two stories relating to how a malicious website can silently discover the identity of a visitor, which is an insidious type of privacy breach that I’ve written about quite a bit (1, 2, 3, 4, 5, 6).

Firefox bug exposed your identity. The first is a vulnerability resulting from a Firefox bug in the implementation of functions like exec and test. The bug allows a website to learn the URL of an embedded iframe from some other domain. How can this lead to uncovering the visitor’s identity? Because twitter.com/lists redirects to twitter.com/<username>/lists. This allows a malicious website to open a hidden iframe pointing to twitter.com/lists, query the URL after redirection, and learn the visitor’s Twitter handle (if they are logged in). [1,2]

This is very similar to a previous bug in Firefox that led to the same type of vulnerability. The URL redirect that was exploited there was google.com/profiles/me → user-specific URL. It would be interesting to find and document all such generic-URL → user-specific-URL redirects in major websites. I have a feeling this won’t be the last time such redirection will be exploited.

Visitor deanonymization in the wild. The second story is an example of visitor deanonymization happening in the wild. It appears that the technique utilizes a tracking cookie from a third-party domain to which the visitor previously gave their email and other info., in other words, #3 in my five-fold categorization of ways in which identity can be attached to browsing logs.

I don’t consider this instance to be particularly significant — I’m sure there are other implementations in the wild — and it’s not technically novel, but this is the first time as far as I know that it’s gotten significant attention from the public, even if only in tech circles. I see this as a first step in a feedback loop of changing expectations about online anonymity emboldening more sites to deanonymize visitors, thus further lowering the expectation of privacy.

Deanonymization of mobility traces. Let’s move on to the more traditional scenario of deanonymization of a dataset by combining it with an auxiliary, public dataset which has users’ identities. Srivatsa and Hicks have a new paper with demonstrations of deanonymization of mobility traces, i.e., logs of users’ locations over time. They use public social networks as auxiliary information, based on the insight that pairs of people who are friends are more likely to meet with each other physically. The deanonymization of Bluetooth contact traces of attendees of a conference based on their DBLP co-authorship graph is cute.

This paper adds to the growing body of evidence that anonymization of location traces can be reversed, even if the data is obfuscated by introducing errors (noise).

So many datasets, so little time. Speaking of mobility traces, Jason Baldridge points me to a dataset containing mobility traces (among other things) of 5 million “anonymous” users in the Ivory Coast recently released by telecom operator Orange. A 250-word research proposal is required to get access to the data, which is much better from a privacy perspective than a 1-click download. It introduces some accountability without making it too onerous to get the data.

In general, the incentive for computer science researchers to perform practical demonstrations of deanonymization has diminished drastically. Our goal has always been to showcase new techniques and improve our understanding of what’s possible, and not to name and shame. Even if the Orange dataset were more easily downloadable, I would think that the incentive for deanonymization researchers would be low, now that the Srivatsa and Hicks paper exists and we know for sure that mobility traces can be deanonymized, even though the experiments in the paper are on a far smaller scale.

Head in the sand: rational?! I gave a talk at a privacy workshop recently taking a look back at how companies have reacted to deanonymization research. My main point was that there’s a split between the take-your-data-and-go-home approach (not releasing data because of privacy concerns) and the head-in-the-sand approach (pretending the problem doesn’t exist). Unfortunately but perhaps unsurprisingly, there has been very little willingness to take a middle ground, engaging with data privacy researchers and trying to adopt technically sophisticated solutions.

Interestingly, head-in-the-sand might be rational from companies’ point of view. On the one hand, researchers don’t have the incentive for deanonymization anymore. On the other hand, if malicious entities do it, naturally they won’t talk about it in public, so there will be no PR fallout. Regulators have not been very aggressive in investigating anonymized data releases in the absence of a public outcry, so that may be a negligible risk.

Some have questioned whether deanonymization in the wild is actually happening. I think it’s a bit silly to assume that it isn’t, given the economic incentives. Of course, I can’t prove this and probably never can. No company doing it will publicly talk about it, and the privacy harms are so indirect that tying them to a specific data release is next to impossible. I can only offer anecdotes to explain my position: I have been approached multiple times by organizations who wanted me to deanonymize a database they’d acquired, and I’ve had friends in different industries mention casually that what they do on a daily basis to combine different databases together is essentially deanonymization.

[1] For a discussion of why a social network profile is essentially equivalent to an identity, see here and the epilog here.
[2] Mozilla pulled Firefox 16 as a result and quickly fixed the bug.

I’ve just about settled into a rhythm at Princeton — classes started two weeks ago — and next year’s academic job search cycle is already underway! Indeed, I started my job search in earnest almost exactly a year ago. So I guess surprise number zero from this whole process is how time-consuming it was. There’s been no ‘normal’ or ‘routine’ during this year; each month has been unlike the previous. If you’re starting your academic job search, buckle up, it’s gonna be a wild ride!

There’s lots of advice online about the process; you should read all of it. Instead of duplicating what’s been said, I will focus on the things that surprised me in spite of having prepared as well as I possibly could. So if the rest of this post appears a bit contrarian, it’s just selection bias.

1. You’ll need someone to hold your hand. I can’t overstate how much of a difference it makes to have someone who’s been through the process whom you can talk to on a regular basis during your job search. Whether it’s achieving the right depth-breadth balance in your job talk, or wording emails strategically/diplomatically, or knowing how to best space out your interviews, you can’t figure it out by yourself or by reading online advice.

Typically the person helping you will be your advisor, but if they are busy you should find someone else. The good news is that many people will be willing to help out and pay it forward. It doesn’t have to be one person, you can split it between two or three people. I know that I would have screwed it up many times over if it hadn’t been for my advisor and everyone else who helped me out.

Some candidates networked extensively both to compare notes with other job searchers and to obtain and share privileged information. I avoided this entirely because I didn’t want the stress associated with it, and I’m very happy with my decision. That said, maybe I missed out in some way, I don’t know.

2. It’s not an interview. Perhaps this should have been obvious, but I was taken aback during my first “interview.” People already assume you’re an expert in your subfield, and so they aren’t trying to assess your technical competence. At all. I was tested exactly once in my whole tour — a professor asked me to state and sketch a proof of any theorem (of my choice) from my Netflix paper. Another professor apparently found this egregious, so he later wrote me a rather apologetic email. I found the whole thing rather amusing.

Best as I can tell, what they’re trying to assess is your personality (more bluntly, they want to make sure you’re not an asshole), and whether they can collaborate with you. So everyone was extremely polite to me and never asked adversarial questions or gave me much pushback.

One consequence of this interview style is that no one reads your papers, because they don’t need to. I already knew that no one read my papers in the normal course of things, but before the interview season I thought, “Finally, a few dozen people are going to read my papers!” Didn’t happen. Maybe it has something to do with me, but I think a big part of the reason is that in computer science we don’t seem to have a culture of reading beyond the abstract or introduction of papers (except in reviews, or reading groups, or when directly extending previous work.) Knowing this, authors don’t have an incentive to write in a readable manner, and the cycle is self-perpetuating. But I digress.

A happy side-effect of non-interview interviews was that the process wasn’t mentally exhausting. It was sort of like meeting a bunch of people for coffee and chatting for half an hour with each one. Since all the advice I’d gotten suggested that interviews would leave me dead tired, I was initially worried that I might be doing something wrong ­— maybe I wasn’t having sufficiently technical conversations? I suspect the real difference is that most people get exhausted due to being pumped full of adrenaline; due to a biological luck of the draw I don’t generate any noticeable adrenaline in these situations (including right before talks, which I’ve found a bit surprising).

3. You don’t have to interview them. Everyone else dispensing online advice seems to think, “you’re not just being interviewed; you’re also interviewing them.” I disagree. In one or two cases it was obvious during my interview that the school or department wouldn’t be a good fit for me without even having to ask them specific questions, but absent any obvious issues, I’m skeptical of how much you can determine by asking. Of course you should discuss areas of possible collaboration with people you meet, but to determine things like how good the students are, how effective the administrative staff are, etc., asking directly is not very useful.

The reason is that people will always spin things in a favorable way (this is not a criticism — most of the time they do it because they’ve been there long enough that they’ve adjusted to the situation and they actually see things the way they spin them.) And you’re not experienced enough to parse what you’re told to figure out what the reality is. You should definitely ask them questions lest they think you’re uninterested, but receive all information with a skeptical ear.

Instead, what was extremely useful for me is to talk to people who’d previously been at the departments I was considering, ask them what they didn’t like, then present that information back to people in my interview loop and ask them for their take, and finally try to reconcile the two views.

4. The job talk is a strategic piece of communication. There is so much subtext it blew my mind. For one, your job talk is all about telling people how awesome your work is, but of course you can’t state that directly. You’ll need to humblebrag without being obvious. For example, in my closing slide, I put up a collage of 24 faces, and said, “Finally, I’m incredibly grateful to my amazing co-authors without whom none of my work would have been possible.” That statement was certainly true, but also important was the subtext: “I collaborate like it’s going out of style.” This one was balanced precariously on the obvious threshold — I got called out in one of my talks!

But there’s more. You have to consider every single thing that you say from the point of view of someone in your field, someone not in your field but familiar with it, and someone not familiar with your field. It has to make sense at different levels to all of them. Also, you have to consider how each statement will sound to someone who spaced out for a bit and just started paying attention. And so on.

Overall, this isn’t going to be like any talk you’ve given. I think I spent 3-4 weeks working primarily (albeit not exclusively) on my talk, with regular tweaking afterwards.

5. You will fall sick. Airports and airplanes spread germs, plus you’re much more susceptible to infection when your sleep and diet are irregular, as they likely will be during your tour. Assuming you have a moderately busy schedule, falling sick is just a matter of time. I mentioned being sick to about 4-5 people, and each of them recalled how they had fallen sick during their own job search.

Naturally, then, you should treat proper sleep and diet as a priority. You should schedule your interviews so that you have time to recover when it happens. Also, try not to schedule your two most important interviews too close together. Finally, there’s a lot you can do in terms of symptom relief (e.g., benzocaine cough drops instead of menthol) to minimize the impact on your thinking and speaking during your interviews, so be medically prepared ahead of time.

That’s it for now. There are several topics that I’d like to address in more detail in separate posts, time permitting: 1. how to prepare for and deliver the talk; 2. what to say in your 1-on-1 meetings; 3. travel tips, and 4. suggestions for interviewers from the point of view of a candidate. If you’ve been through this process recently, I’d love to hear how your experiences matched or differed from mine.

Finally, Princeton CS is hiring this year, and our searches aren’t targeted by subfield, so if you’re on the market you should apply!

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

]]>http://33bits.org/2012/10/01/five-surprises-from-the-computer-science-academic-job-search/feed/6randomwalkerIt’s official — I’m heading to Princeton!http://33bits.org/2012/06/14/its-official-im-heading-to-princeton/
http://33bits.org/2012/06/14/its-official-im-heading-to-princeton/#commentsThu, 14 Jun 2012 13:37:40 +0000http://33bits.org/?p=1060]]>After a long, exciting and exhausting interview season, I’m thrilled to be starting at Princeton this fall as a tenure-track faculty member! My appointment will be in computer science with a CITP (Center for Information Technology Policy) affiliation. This is a dream position for me in just about every way. I’m looking forward to joining Ed Felten and other amazing people at Princeton, and to the challenges of research and teaching.

I feel lucky and privileged to be writing this, and thankful for the advice and support of people close to me throughout the process. I’m particularly indebted to Vitaly, my advisor, for his regular doses of wisdom.

I will miss Stanford and sunny California. I’ve thoroughly enjoyed both the academic environment and the entrepreneurial culture of the Bay area, and Dan Boneh has been a great mentor. But I feel mentally ready to explore new directions, both career-wise and in life.

While some things have changed, others will remain the same, such as this blog. I expect to continue to cover the same mix of topics — information privacy, tech policy, and meta-issues of academia and research. And speaking of the blog, I have several pages of notes of my impressions of the interview process, which I’m planning to turn into a series of posts. Stay tuned.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

In the first installment of the Tracking Not Required series, we discussed a relatively straightforward case: frequency capping. Now let’s get to the 800-pound gorilla, behaviorally targeted advertising, putatively the main driver of online tracking. We will show how to swap a little functionality for a lot of privacy.

Admittedly, implementing behavioral targeting on the client is hard and will require some technical wizardry. It doesn’t come for “free” in that it requires a trade-off in terms of various privacy and deployability desiderata. Fortunately, this has been a fertile topic of research over the past several years, and there are papers describing solutions at a variety of points on the privacy-deployability spectrum. This post will survey these papers, and propose a simplification of the Adnostic approach — along with prototype code — that offers significant privacy and is straightforward to implement.

Goals. Carrying out behavioral advertising without tracking requires several things. First, the user needs to be profiled and categorized based on their browsing history. In nearly all proposed solutions, this happens in the user’s browser. Second, we need an algorithm for selecting targeted ads to display each time the user visits a page. If the profile is stored locally and not shared with the advertising company, this is quite nontrivial. The final component is for reporting of ad impressions and clicks. This component must also deal with click fraud, impression fraud and other threats.

Existing approaches

The chart presents an overview of existing and proposed architectures.

“Cookies” refers to the status quo of server-side tracking; all other architectures are presented in research papers summarized in the Do Not Track bibliography page. CoP stands for “Client-only Profiles,” the architecture proposed by Bilenko and Richardson.

Several points of note. First, everything except PrivAd — which uses an anonymizing proxy — reveals the IP address, and typically the User Agent and Referer to the ad company as part of normal HTTP requests. Second, everything except CoP (and the status quo of tracking cookies) requires software installation. Opinions vary on just how much of a barrier this is. Third, we don’t take a stance on whether PrivAd is more deployable than ObliviAd or vice-versa; they both face significant hurdles. Finally, Adnostic can be used in one of two modes, hence it is listed twice.

There is an interesting technological approach, not listed above, that works by exposing more limited referer information. Without the referer header (or an equivalent), the ad server may identify the user but will not learn the first-party URL, and thus will not be able to track. This will be explored in more depth in a future article.

New approach. In the solution we propose here, the server is recruited for profiling, but doesn’t store the profile. This avoids the need for software installation and allows easy deployability. In addition, non-tracking is externally verifiable, to the extent that IP address + User-Agent is not nearly as effective for tracking as cookie-based unique identifiers.[1] Like CoP, and unlike Adnostic, each ad company can only profile users during visits to pages that it has a third-party presence on, rather than all pages.

Profiling algorithm.

1. The user visits a page that has embedded content from the ad company.
2. JavaScript in the ad company’s content sends the top-level URL to a special classifier service run by the ad company. (The classifier is run on a separate domain. It does not have any cookies or other information specific to the user.)
3. The classifier returns a topic classification of the page.
4. The ad company’s JavaScript receives the page classification and uses it to update the user’s behavioral profile in HTML5 storage. The JavaScript may also consider other factors, such as how long the user stayed on the page.

There is a fair degree of flexibility in steps 3 and 4 — essentially any profiling algorithm can be implemented by appropriately splitting it into a server-side component that classifies individual web pages and a client-side component that analyzes the user’s interaction with these pages.

Ad serving and accounting.

The ad serving process in our proposal is the same as in Adnostic — the server sends a list of ads along with metadata describing each ad, and the client-side component picks the ad that best matches the locally stored profile. To avoid revealing which ad was displayed, the client can either download all (say, 10) ads in the list while displaying only one, or the client downloads only one ad, but ads are served from a different domain which does not share cookies with the tracking domain. Note the similarity to our frequency capping approach, both in terms of the algorithm and its privacy properties.

Accounting, i.e., billing the right advertiser is also identical to Adnostic for the cost-per-click and cost-per-impression models; we refer the reader there. Discussing the cost-per-action model is deferred to a future post.

Implementation. We implemented our behavioral targeting algorithm using HTML 5 local storage. As with our frequency capping implementation, we found performance was exceptionally fast in modern desktop and mobile browsers. For simplicity, our implementation uses a static local database mapping websites to interest segments and a binary threshold for determining interests. In practice, we expect implementers would maintain the mapping server-side and apply more sophisticated logic client-side.

We also present a different work-in-progress implementation that’s broader in scope, encompassing retargeting, behavioral targeting and frequency capping.

Conclusion. Certainly there are costs to our approach — a “thick-client” model will always be slightly more inconvenient to deploy and maintain than a server-based model, and will probably have a lower targeting accuracy. However, we view these costs as minimal compared to the benefits. Some compromise is necessary to get past the current stalemate in web tracking.

Technological feasibility is necessary, but not sufficient, to change the status quo in online tracking. The other key component is incentives. That is why Do Not Track, standards and advocacy are crucial to the online privacy equation.

[1] The engineering and business reasons for this difference in effectiveness will be discussed in a future post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

]]>http://33bits.org/2012/06/11/tracking-not-required-behavioral-targeting/feed/0randomwalkerWeb Privacy Measurement: Genesis of a Communityhttp://33bits.org/2012/06/04/web-privacy-measurement-genesis-of-a-community/
http://33bits.org/2012/06/04/web-privacy-measurement-genesis-of-a-community/#commentsMon, 04 Jun 2012 16:47:17 +0000http://33bits.org/?p=1044]]>Last week I participated in the Web Privacy Measurement conference at Berkeley. It was a unique event because the community is quite new and this was our very first gathering. The WSJ Data Transparency hackathon is closely related; the Berkeley conference can be thought of as an academic counterpart. So it was doubly fascinating for me — both for the content and because of my interest in the sociology of research communities.

A year ago I explained that there is an information asymmetry when it comes to online privacy, leading to a “market for lemons.” The asymmetry exists for two main reasons: one is that companies don’t disclose what data they collect about you and what they do with it; the second is that even if they do, end users don’t have the capacity to aggregate and process that information and make decisions on the basis of it.

The Web Privacy Measurement community essentially exists to mitigate this asymmetry. The primary goal is to ferret out what is happening to your data online, and a secondary one is making this information useful by pushing for change, building tools for opt-out and control, comparison of different players, etc. The size of the community is an indication of how big the problem has gotten.

Before anyone starts trotting out the old line, “see, the market can solve everything!”, let me point out that the event schedule demonstrates, if anything, the opposite. The majority of what is produced here is intended wholly or partly for the consumption of regulators. Like many others, I found the “What privacy measurement is useful for policymakers?” panel to be the most interesting one. And let’s not forget that most of this is Government-funded research to begin with.

This community is very different from the others that I’ve belonged to. The mix of backgrounds is extraordinary: researchers mainly from computing and law, and a small number from other disciplines. Most of the researchers are academics, but a few work for industrial research labs, a couple are independent, and one or two work in Government. There were also people from companies that make privacy-focused products/services, lawyers, hobbyists, scholars in the humanities, and ad-industry representatives. Overall, the community has a moderately adversarial relationship with industry, naturally, and a positive relationship with the press, regulators and privacy advocates.

The make-up is somewhat similar to the (looser-knit) group of researchers and developers building decentralized architectures for personal data, a direction that my coauthors and I have taken a skeptical view of in this recent paper. In both cases, the raison d’être of the community is to correct the imbalance of power between corporations and the public. There is even some overlap between the two groups of people.

The big difference is that the decentralization community, typified by Diaspora, mostly tries to mount a direct challenge and overthrow the existing order, whereas our community is content to poke, measure, and expose, and hand over our findings to regulators and other interested parties. So our potential upside is lower — we’re not trying to put a stop to online tracking, for example — but the chance that we’ll succeed in our goals is much higher.

Exciting times. I’m curious to see how things evolve. But this week I’m headed to PLSC, which remains my favorite privacy-related conference.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

]]>http://33bits.org/2012/06/04/web-privacy-measurement-genesis-of-a-community/feed/0randomwalkerSelfish Reasons to do Peer Review, and Other Program Committee Observationshttp://33bits.org/2012/05/02/selfish-reasons-to-do-peer-review-and-other-program-committee-observations/
http://33bits.org/2012/05/02/selfish-reasons-to-do-peer-review-and-other-program-committee-observations/#commentsWed, 02 May 2012 17:37:10 +0000http://33bits.org/?p=1039]]>I’ve been on several program committees in the last year and a half. As I’ve written earlier, getting a behind-the-scenes look at how things work significantly improved my perception of research and academia. This post is a more elaborate set of observations based on my experience. It is targeted both at my colleagues with the hope of starting a discussion, as well as at outsiders as a continuation of my series on explaining how the scientific community functions (that began with the post linked above) .

Benefits of doing peer review. Peer review is often considered a burden that one grudgingly accepts in order to keep the system working. But in my experience, especially for a junior researcher, the effort is well worth the time.

The most obvious advantage of being on a PC is that it forces you to read papers. Now if you’re the type that never needs external motivation to get things accomplished, this wouldn’t matter to you — you’d do literature study on a regular basis anyway. But many of us aren’t that disciplined; I’m certainly not.

There are also insights you get that you can’t reproduce by having perfect self-discipline. PC work gives you a raw, unfiltered look into the research that people have chosen to work on. This is a 6-month-or-so head start for getting on top of emerging trends compared to only reading published papers. You also get a better idea of common pitfalls to avoid.

Finally, peer review is one of the rare opportunities to read papers critically (it is harder with published work because it doesn’t have as many loopholes). This is not a natural skill for most people — our cognitive biases predispose us to confuse good rhetoric with sound logic.

Which type of meeting? I’ve been on PCs with all three types of discussions: physical meetings, phone meetings and online. I think it’s important to have a meeting, whether physical or phone. I learn a lot, and the outcome feels fairer. Besides, quite often one reviewer is able to point out something the others have missed. Chairs of online-only PCs do try to elicit some interaction between reviewers, but for hard-to-explain but easy-to-understand reasons, the bandwidth in an interactive meeting tends to be much higher.

Phone meetings are suitable for smaller conferences and workshops. In my experience, members mostly tend to go on mute and tune out except when the papers they reviewed are being discussed. I don’t necessarily see a problem with this.

In physical meetings, I’ve found that members often make comments or voice opinions on papers they haven’t really read. I don’t think this is in the best interest of fair reviewing (although I’ve heard a contrary opinion). I wonder if a strategy involving smaller breakout groups would be more effective.

The one advantage of not having a meeting is of course that it saves time. I’ve found that the time commitment for the meeting is about a third of the reviewing time (for both physical and phone meetings), which I don’t consider to be too much of a burden given the improved outcomes.

Overall, my experience from these meetings is that members act professionally for the most part without egos or emotions getting in the way. While there is inevitably some randomness in the process, I believe that the horror stories of careless reviewers — everyone has at least one to narrate — are exaggerated. One possible reason for this misunderstanding is that there is a lot that’s discussed at meetings after the reviews are written, and often this feedback doesn’t make it into the reviews.

Problem areas. Finally, here are some aspects of PCs that I think could be improved. I have deliberately omitted the most common problems (such as an untenable number of submissions and low acceptance rates) that everybody knows and talks about. Instead, these are less frequently discussed but yet (IMO) fairly important issues.

Lost reviews. Since reviewers aren’t perfect, sometimes bad papers with persistent authors manage to get published by being resubmitted to other venues until they hit a relatively sloppy panel of reviewers. The reason this works (when it does) is that past reviews of a recycled paper are “lost”. This is a shame; it wastes reviewer effort and lowers the overall quality of publications.

Community boundaries. As a reviewer I’ve started to realize how difficult it is to publish in other communities’ venues. As an example, at security conferences we often see papers by outsiders that have something useful to say, but are unfortunately inadequately familiar with the “central dogma” of crypto/security research, namely adversarial thinking. [1] While I can see the temptation to reject these papers with a cursory note, I think we should be patient with these people, explain how we do things and if possible offer to work with them to improve the paper.

Unfruitful directions. Sometimes research directions don’t pan out, either because the world has moved on and the underlying assumptions are no longer true, or because the technical challenges are too hard. But researchers naturally resist having to change their research area, and so there are lots of papers written on topics that stopped being relevant years ago. The reason these papers keep getting published is that they are assigned for review to other people working in the same area. I’ve seen program chairs make an effort to push back on this, but the current situation is far from optimal.

In conclusion, my opinion is that peer review in my community is a relatively well-functioning process, albeit with a lot of scope for improvement. I believe this improvement can be accomplished in an evolutionary way without having to change anything too radically.

[1] The crypto/security community essentially derives its identity from adversarial thinking. Incidentally, I feel that it is not always suitable for privacy, which is why I believe computer scientists who study privacy should stop viewing ourselves as a subset of the security community.

]]>http://33bits.org/2012/05/02/selfish-reasons-to-do-peer-review-and-other-program-committee-observations/feed/0randomwalkerA Critical Look at Decentralized Personal Data Architectureshttp://33bits.org/2012/02/21/a-critical-look-at-decentralized-personal-data-architectures/
http://33bits.org/2012/02/21/a-critical-look-at-decentralized-personal-data-architectures/#commentsTue, 21 Feb 2012 16:27:54 +0000http://33bits.org/?p=1033]]>I have a new paper with the above title, currently under peer review, with Vincent Toubiana, Solon Barocas, Helen Nissenbaum and Dan Boneh (the Adnostic gang). We argue that distributed social networking, personal data stores, vendor relationship management, etc. — movements that we see as closely related in spirit, and which we collectively term “decentralized personal data architectures” — aren’t quite the panacea that they’ve been made out to be.

The paper is only a synopsis of our work so far — in our notes we have over 80 projects, papers and proposals that we’ve studied, so we intend to follow up with a more complete analysis. For now, our goal is to kick off a discussion and give the community something to think about. The paper was a lot of fun to write, and we hope you will enjoy reading it. We recognize that many of our views and conclusions may be controversial, and we welcome comments.

Abstract:

While the Internet was conceived as a decentralized network, the most widely used web applications today tend toward centralization. Control increasingly rests with centralized service providers who, as a consequence, have also amassed unprecedented amounts of data about the behaviors and personalities of individuals.

Developers, regulators, and consumer advocates have looked to alternative decentralized architectures as the natural response to threats posed by these centralized services. The result has been a great variety of solutions that include personal data stores (PDS), infomediaries, Vendor Relationship Management (VRM) systems, and federated and distributed social networks. And yet, for all these efforts, decentralized personal data architectures have seen little adoption.

This position paper attempts to account for these failures, challenging the accepted wisdom in the web community on the feasibility and desirability of these approaches. We start with a historical discussion of the development of various categories of decentralized personal data architectures. Then we survey the main ideas to illustrate the common themes among these efforts. We tease apart the design characteristics of these systems from the social values that they (are intended to) promote. We use this understanding to point out numerous drawbacks of the decentralization paradigm, some inherent and others incidental. We end with recommendations for designers of these systems for working towards goals that are achievable, but perhaps more limited in scope and ambition.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

]]>http://33bits.org/2012/02/21/a-critical-look-at-decentralized-personal-data-architectures/feed/3randomwalkerIs Writing Style Sufficient to Deanonymize Material Posted Online?http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/
http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/#commentsMon, 20 Feb 2012 17:40:20 +0000http://33bits.org/?p=1023]]>I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:

Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.

Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.[1]

Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?

Experiments. Our experimental methodology is set up to directly address this question. Our primary data source was the ICWSM 2009 Spinn3r Blog Dataset, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random k (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the same context. We call this post-to-blog matching.

In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel. But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from different contexts. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.

We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field.[2] We are thankful to Daniele Perito for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.

Results. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.

But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.

We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.

An important capability is confidence estimation, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of precision and recall. We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.

We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.

We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.

Techniques. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the Writeprints paper did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.

An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.

In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “hyperplane” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.

We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.

In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.

Concluding thoughts. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.

The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.

[1] A team from Israel have studied authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the news recently for some related work.

[2] Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

]]>http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/feed/7randomwalkerAn Update on Career Plans and Some Observations on the Nature of Researchhttp://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/
http://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/#commentsTue, 07 Feb 2012 19:05:56 +0000http://33bits.org/?p=1021]]>I’ve had a wonderful time at Stanford these last couple of years, but it’s time to move on. I’m currently in the middle of my job search, looking for faculty and other research positions. In the next month or two I will be interviewing at several places. It’s been an interesting journey.

My Ph.D. years in Austin were productive and blissful. When I finished and came West, I knew I enjoyed research tremendously, but there were many aspects of research culture that made me worry if I’d fit in. I hoped my postdoc would give me some clarity.

Happily, that’s exactly what happened, especially after I started being an active participant in program committees and other community activities. It’s been an enlightening and humbling experience. I’ve come to realize that in many cases, there are perfectly good reasons why frequently-criticized aspects of the culture are just the way they are. Certainly there are still facets that are far from ideal, but my overall view of the culture of scientific research and the value of research to society is dramatically more positive than it was when I graduated.

Let me illustrate. One of my major complaints when I was in grad school was that almost nobody does interdisciplinary research (which is true — the percentage of research papers that span different disciplines is tiny). Then I actually tried doing it, and came to the obvious-in-retrospect realization that collaborating with people who don’t speak your language is hard.

Make no mistake, I’m as committed to cross-disciplinary research as I ever was (I just finished writing a grant proposal with Prof’s Helen Nissenbaum and Deirdre Mulligan). I’ve gradually been getting better at it and I expect to do a lot of it in my career. But if a researcher makes a decision to stick to their sub-discipline, I can’t really fault them for that.

As another example, consider the lack of a “publish-then-filter” model for research papers, a whole two decades after the Web made it technologically straightforward. Many people find this incomprehensibly backward and inefficient. Academia.edu founder Richard Price wrote an article two days ago arguing that the future of peer review will look like a mix of Pagerank and Twitter. Three years ago, that could have been me talking. Today my view is very different.

Science is not a popularity contest; Pagerank is irrelevant as a peer-review mechanism. Basically, scientific peer review is the only process that exists for systematically separating truths from untruths. Like democracy, it has its problems, but at least it works. Social media is probably the worst analogy — it seems to be better at amplifying falsehoods than facts. Wikipedia-style crowdsourcing has its strengths, but it can hit-or-miss.

To be clear, I think peer review is probably going to change; I would like it to be done in public, for one. But even this simple change is fraught with difficulty — how would you ensure that reviewers aren’t influenced by each others’ reviews? This is an important factor in the current system. During my program committee meetings, I came to realize just how many of these little procedures for minimizing bias are built into the system and how seriously people take the spirit of this process. Revamping peer review while keeping what works is going to be slow and challenging.

Moving on, some of my other concerns have been disappearing due to recent events. Restrictive publisher copyrights are a perfect example. I have more of a problem with this than most researchers do — I did my Master’s in India, which means I’ve been on the other side of the paywall. But it looks like that pot may finally have boiled over. I think it’s only a matter of time now before open access becomes the norm in all disciplines.

There are certainly areas where the status quo is not great and not getting any better. Today if a researcher makes a discovery that’s not significant enough to write a paper about, they choose not to share that discovery at all. Unfortunately, this is the rational behavior for a self-interested researcher, because there is no way to get credit for anything other than published papers. Michael Neilsen’s excellent book exploring the future of networked science gives me some hope that change may be on the horizon.

I hope this post has given you a more nuanced appreciation of the nature of scientific research. Misconceptions about research and especially about academia seem to be widespread among the people I talk to both online and offline; I harbored a few myself during my Ph.D., as I said earlier. So I’m thinking of doing posts like this one on a semi-regular basis on this blog or on Google+. But that will probably have to wait until after my job search is done.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

]]>http://33bits.org/2012/02/07/an-update-on-career-plans-and-some-observations-on-the-nature-of-research/feed/2randomwalkerPrinter Dots, Pervasive Tracking and the Transparent Societyhttp://33bits.org/2011/10/18/printer-dotspervasive-tracking-and-the-transparent-society/
http://33bits.org/2011/10/18/printer-dotspervasive-tracking-and-the-transparent-society/#commentsTue, 18 Oct 2011 19:35:51 +0000http://33bits.org/?p=1005]]>So far in the fingerprinting series, we’ve seen how a variety of objects and physical devices [1, 2, 3, 4], often even supposedly identical ones, can be uniquely fingerprinted. This article is non-technical; it is an opinion on some philosophical questions about tracking and surveillance.

Here’s a fascinating example of tracking that’s all around you but that you’re probably unaware of:

Color laser printers and photocopiers print small yellow dots on every page for tracking purposes.

The dots are not normally visible, but can be seen by a variety of methods such as shining a blue LED flashlight, magnification under a microscope or scanning the document with a commodity scanner. The pattern of dots typically encodes the device serial number and a timestamp; some parts of the code are yet unidentified. There are interesting differences between the codes used by different manufacturers. [1] Some examples are shown in the pictures. There’s a lot more information in the presentation.

Pattern of dots from three different printers: Epson, HP LaserJet and Canon.

Schoen says the dots could have been the result of the Secret Service pressuring printer manufacturers to cooperate, going back as far as the 1980s. The EFF’s Freedom of Information Act request on the matter from 2005 has been “mired in bureaucracy.”

The EFF as well as theSeeing Yellow project would like to see these dots gone. The EFF has consistently argued against pervasive tracking. In this article on biometric surveillance, they say:

EFF believes that perfect tracking is inimical to a free society. A society in which everyone’s actions are tracked is not, in principle, free. It may be a livable society, but would not be our society.

Eloquently stated. You don’t have to be a privacy advocate to see that there are problems with mass surveillance, especially by the State. But I’d like to ask the question: can we really hope to stave off a surveillance society forever, or are efforts like the Seeing Yellow project just buying time?

My opinion is that it impossible to put the genie back into the bottle — the cost of tracking every person, object and activity will continue to drop exponentially. I hope the present series of articles has convinced you that even if privacy advocates are successful in preventing the deployment of explicit tracking mechanisms, just about everything around you is inherently trackable. [2]

And even if we can prevent the State from setting up a surveillance infrastructure, there are undeniable commercial benefits in tracking everything that’s trackable, which means that private actors will deploy this infrastructure, as they’ve done with online tracking. If history is any indication, most people will happily allow themselves to be tracked in exchange for free or discounted services. From there it’s a simple step for the government to obtain the records of any person of interest.

If we accept that we cannot stop the invention and use of tracking technologies, what are our choices? Our best hope, I believe, is a world in which the ability to conduct tracking and surveillance is symmetrically distributed, a society in which ordinary citizens can and do turn the spotlight on those in power, keeping that power in check. On the other hand, a world in which only the government, large corporations and the rich are able to utilize these technologies, but themselves hide under a veil of secrecy, would be a true dystopia.

Another important principle is for those who do conduct tracking to be required to be transparent about it, to have social and legal processes in place to determine what uses are acceptable, and to allow opting out in contexts where that makes sense. Because ultimately what matters in terms of societal freedom is not surveillance itself, but how surveillance affects the balance of power. To be sure, the society I describe — pervasive but transparent tracking, accessible to everyone, and with limited opt-outs — would be different from ours, and would take some adjusting to, but that doesn’t make it worse than ours.

I am hardly the first to make this argument. A similar position was first prominently articulated by David Brin his 1999 book Transparent Society. What the last decade has shown is just how inevitable pervasive tracking is. For example, Brin focused too much on cameras and assumed that tracking people indoors would always be infeasible. That view seems almost quaint today.

Let me be clear: I have absolutely no beef with efforts to oppose pervasive tracking. Even if being watched all of the time is our eventual destiny, society won’t be ready for it any time soon — these changes take decades if not generations. The pace at which the industry wants us to make us switch to “living in public” is far faster than we’re capable of. Buying time is therefore extremely valuable.

That said, embracing the Transparent Society view has important consequences for civil libertarians. It suggests working toward an achievable if sub-optimal goal instead of an ideal but impossible one. It also suggests that the “democratization of surveillance” should be encouraged rather than feared.

Let me close by calling out one battle in particular. Throughout this series, we’ve seen that fingerprinting techniques have security-enhancing applications (such as forensics), as well as privacy-infringing ones, but that most research papers on fingerprinting consider only the former question. I believe the primary reason is that funding is for the most part available only for the former type of research and not for the latter. However, we need a culture of research into privacy-infringing technologies, whether funded by federal grants or otherwise, in order to achieve the goals of symmetry and transparency in tracking.

[1] Note that this is just an encoding and not encryption. The current system allows anyone to read the dots; public-key encryption would allow at least nominally restricting the decoding ability to only law-enforcement personnel, but there is no evidence that this is being done.

[2] This is analogous to the cookies-vs-fingerprinting issue in online tracking, and why cookie-blocking alone is not sufficient to escape tracking.

Previous articles in this series looked at fingerprinting of blank paper, digital cameras and RFID chips. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting.

To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to the device that scanned it. Scanners capture images via a process similar to digital cameras, so the underlying principle used in fingerprinting is the same: characteristic ‘pattern noise’ in the sensor array as well as idiosyncracies of the algorithms used in the post-processing pipeline. The former is device-specific whereas the latter is make/model specific.

There are two important differences, however, that make scanner fingerprinting more difficult: first, scanner sensor arrays are one-dimensional (the sensor moves along the length of the device to generate the image), which means that there is much less entropy available from sensor imperfections. Second, the paper may not be placed in the same part of the scanner bed each time, which rules out a straightforward pixel-wise comparison.

A group at Purdue has been very active in this area, as well as in printer identification, which I will discuss later in this article. These twopapers are very relevant for our purposes. The application they have in mind is forensics; in this context, it can be assumed that the investigator has physical possession of the scanner to generate a fingerprint against which a scanned image of unknown or uncertain origin can be tested.

To extract 1-dimensional noise from a 2-dimensional scanned image, the authors first extract 2-dimensional noise, in a process similar to what is used in camera fingerprinting, and then they collapse each noise pattern into a single row, which is the average of all the rows. Simple enough.

Dealing with the other problem, the lack of synchronicity, is trickier. There are broadly two approaches: 1. try to synchronize the image by trying various alignments 2. extract fingerprints using statistical features of the image that are robust against desynchronization. The authors use the latter approach, mainly moment-based features of the noise vector.

Here are the results. At the native resolution of scanners, 1200–4800 dpi, they were able to distinguish between 4 scanners with an average accuracy of 96%, including a pair with identical make and model. In subsequent work, they improved the feature extraction to be able to handle images that are reduced to 200 dpi, which is typically the resolution used for saving and emailing images. While they achieved 99.9% accuracy in classifying 10 scanners, they can no longer distinguish devices of identical make and model.

The authors claim that a correlation based approach — searching for the right alignment between two images, and then directly comparing the noise vectors — won’t work. I am skeptical about this claim. The fact that it hasn’t worked so far doesn’t mean it can’t be made to work. If it does work, it is likely to give far higher accuracies and be able to distinguish between a much larger number of devices.

The privacy implications of scanner fingerprinting are of an analogous nature to digital camera fingerprinting: a whistleblower exposing scanned documents may be deanonymized. However, I would judge the risk to be much lower: scanners usually aren’t personal devices, and a labeled corpus of images scanned by a particular device is typically not available to outsiders.

The Purdue group have also worked on printer identification, both laser and inkjet. In laser printers, one prominent type of observable signature arising from printer artifacts is banding — alternating light and dark horizontal bands. The bands are subtle and not noticeable to the human eye. But they are easily algorithmically detectable, constituting a 1–2% deviation from average intensity.

Fourier Transform of greyscale amplitudes of a background fill (printed with an HP LaserJet)

Banding can be demonstrated by printing a constant grey background image, scanning it, measuring the row-wise average intensities and taking the Fourier Transform of the resulting 1-dimensional vector. One such plot is shown here: the two peaks (132 and 150 cycles/inch) constitute the signature of the printer. The amount of entropy here is small — the two peak frequencies — and unsurprisingly the authors believe that the technique is good enough to distinguish between printer models but not individual printers.

Detecting banding in printed text is difficult because the power of the signal dominates the power of the noise. Instead the authors classify individual letters. By extracting a set of statistical features and applying an SVM classifier, they show that instances of the letter ‘e’ from 10 different printers can be correctly classified with an accuracy of over 90%.

Needless to say, by combining the classification results from all the ‘e’s in a typical document, they were able to match documents to printers 100% of the time in their tests. Presumably the same method would apply for all other characters, but wasn’t tested due to the additional manual effort required for different shapes.

Vertical lines printed by three different inkjet printers

Inkjet printers seem to be even more variable than laser printers; an example is shown in the picture taken from this paper. I found it a bit hard to discern exactly what the state of the art is, but I’m guessing that if it isn’t already possible to detect different printer models with essentially perfect accuracy, it will soon be.

The privacy implications of printer identification, in the context of a whistleblower who wishes to print and mail some documents anonymously, would seem to be minimal. If you’re printing from the office, printer logs (that record a history of print jobs along with user information) would probably be a more realistic threat. If you’re using a home printer, there is typically no known set of documents that came from your printer to compare against, unless law enforcement has physical possession of your printer.

]]>http://33bits.org/2011/10/11/everything-has-a-fingerprint-%e2%80%94-dont-forget-scanners-and-printers/feed/1randomwalkerLaser printer signatureInkjet printers: vertical linesFingerprinting of RFID Tags and High-Tech Stalkinghttp://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/
http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/#commentsTue, 04 Oct 2011 21:20:19 +0000http://33bits.org/?p=989]]>Previous articles in this series looked at fingerprinting of blank paper and digital cameras. This article is about fingerprinting of RFID, a domain where research has directly investigated the privacy threat, namely tracking people in public.

The principle behind RFID fingerprinting is the same as with digital cameras:

The basics. First let’s get the obvious question out of the way: why are we talking about devious methods of identifying RFID chips, when the primary raison d’être of RFID is to enable unique identification? Why not just use them in the normal way?

The answer is that fingerprinting, which exploits the physical properties of RFID chips rather than their logical behavior, allows identifying them in unintended ways and in unintended contexts, and this is powerful. RFID applications, for example in e-passports or smart cards, can often be cloned at the logical level, either because there is no authentication or because authentication is broken. Fingerprinting can make the system (more) secure, since fingerprints arise from microscopic randomness and there is no known way to create a tag with a given fingerprint.

If sensor patterns in digital cameras are a relatively clean example of fingerprinting, RF (and anything to do with the electromagnetic spectrum in general) is the opposite. First, the data is an arbitrary waveform instead of an fixed-size sequence of bits. This means that a simple point-by-point comparison won’t work for fingerprint verification; the task is conceptually more similar to algorithmically comparing two faces. Second, the probe signal itself is variable. RFID chips are passive: they respond to the signal produced by the reader (and draw power from it).[1] This means that the fingerprinting system is in full control of what kind of signal to interrogate the chip with. It’s a bit like being given a blank canvas to paint on.

Techniques. A group at ETH Zurich has done some impressive work in this area. In their 2009 paper, they report being able to compare an RFID card with a stored fingerprint and determine if they are the same, with an error rate of 2.5%–4.5% depending on settings.[2] They use two types of signals to probe the chip with — “burst” and “sweep” — and extract features from the response based on the spectrum.

Chip response to different signals. Fingerprints are extracted from characteristic features of these responses.

Other papers have demonstrated different ways to generate signals/extract features. A University of Arkansas team exploited the minimum power required to get a response from the tag at various frequencies. The authors achieved a 94% true-positive rate using 50 identical tags, with only a 0.1% false-positive rate. (About 6% of the time, the algorithm didn’t produce an output.)

Yet other techniques, namely the energy and Q factor of higher harmonics were studied in a couple of papers out of NIST. In the latter work, they experimented with 20 cards which consisted of 4 batches of 5 ‘identical’ cards in each. The overall identification accuracy was 96%.

It seems safe to say that RFID fingerprinting techniques are still in their infancy, and there is much room for improvement by considering new categories of features, by combining different types of features, or by using different classification algorithms on the extracted features.

Privacy. RF fingerprinting, like other types of fingerprinting, shows a duality between security-enhancing and privacy-infringing applications, but in a less direct way. There are two types of RFID systems: “near-field” based on inductive coupling, used in contactless smartcards and the like, and “far field” based on backscatter, used in vehicle identification, inventory control, etc. The papers discussed so far pertain to near-field systems. There are no real privacy-infringing applications of near-field RF fingerprinting, because you can’t get close enough to extract a fingerprint without the owner of the tag knowing about it. Far-field systems, to which we will now turn, are ideally suited to high-tech stalking.

Fingerprinting provides the ability to enhance the security of near-field RFID systems and to infringe privacy in the context of far-field RFID chips.

In a recent paper, the Zurich team mentioned earlier investigated the possibility of tracking a people in a shopping mall based on strategically placed sensors, assuming that shoppers have several (far-field) RFID tags on them. The point is that it is possible to design chips that prevent tracking at the logical level by authenticating the reader, but this is impossible at the physical level.

Why would people have RFID tags on them? Tags used for inventory control in stores, and not deactivated at the point-of-sale are one increasingly common possibility — they would end up in shopping bags (or even on clothes being worn, although that’s less likely). RFID tags in wallets and medical devices are another source; these are tags that the user wants to be present and functional.

What makes the tracking device the authors built powerful is that it is low-cost and can be operated surreptitiously at some distance from the victim: up to 2.75 meters, or 9 feet. They show that 5.4 bits of entropy can be extracted from a single tag, which means that 5 tags on a person gives 22 bits, easily enough to distinguish everyone who might be in a particular mall.

To assess the practical privacy risk, technological feasibility is only one dimension. We also need to ask who the adversary is and what the incentives are. Tracking people, especially shoppers, in physical space has the strongest incentive of all: selling products. While online tracking is pervasive, the majority of shopping dollars are still spent offline, and there’s still no good way to automatically identify people when they are in the vicinity in order to target offers to them. Facial recognition technology is highly error-prone and creeps people out, and that’s where RF fingerprinting comes in.

That said, RF fingerprinting is only one of the many ways of passively tracking people en masse in physical space — unintentional leaks of identifiers from smartphones and logical-layer identification of RFID tags seem more likely — but it’s probably the hardest to defend against. It is possible to disable RFID tags, but this is usually irreversible and it’s difficult to be sure you haven’t missed any. RFID jammers are another option but they are far from easy to use and are probably illegal in the U.S. One of the ETH Zurich researchers suggests tinfoil wrapping when going out shopping :-)

[1] Active RFID chips exist but most commercial systems use passive ones, and that’s what the fingerprinting research has focused on.

[2] They used a population of 50 tags, but this number is largely irrelevant since the experiment was one of binary classification rather than 1-out-of-n identification.

]]>http://33bits.org/2011/10/04/fingerprinting-of-rfid-tags-and-high-tech-stalking/feed/0randomwalkerrfidtinfoilNo Two Digital Cameras Are the Same: Fingerprinting Via Sensor Noisehttp://33bits.org/2011/09/19/digital-camera-fingerprinting/
http://33bits.org/2011/09/19/digital-camera-fingerprinting/#commentsMon, 19 Sep 2011 17:25:56 +0000http://33bits.org/?p=980]]>The previous article looked at how pieces of blank paper can be uniquely identified. This article continues the fingerprinting theme to another domain, digital cameras, and ends by speculating on the possibility of applying the technique on an Internet-wide scale.

For various kinds of devices like digital cameras and RFID chips, even supposedly identical units that come out of a manufacturing plant behave slightly differently in characteristic ways, and can therefore be distinguished based on their output or behavior. How could this be? The unifying principle is this:

Digital camera identification belongs to a class of techniques that exploits ‘pattern noise’ in the ‘sensor arrays’ that capture images. The same techniques can be used to fingerprint a scanner by analyzing pixel-level patterns in the images scanned by it, but that’ll be the focus of a later article.

A long-exposure dark frame [source]. Click image to see full size. Three ‘hot pixels’ and some other sensor noise can be seen.

A photo taken in the absence of any light doesn’t look completely black; a variety of factors introduce noise. There is random noise that varies in every image, but there is also ‘pattern noise’ due to inherent structural defects or irregularities in the physical sensor array. The key property of the latter kind of noise is that it manifests the same way every image taken by the camera.[1] Thus, the total noise vector produced by a camera is not identical between images, nor is it completely independent.

The pixel-level noise components in images taken by the same camera are correlated with each other.

Nevertheless, separating the pattern noise from random noise and the image itself — after all, a good camera will seek to minimize the strength or ‘power’ of the noise in relation to the image — is a very difficult task, and is the primary technical challenge that camera fingerprinting techniques must address.

Security vs. privacy. A quick note about the applications of camera fingerprinting. We saw in the previous article that there are security-enhancing and privacy-infringing applications of document fingerprinting. In fact, this is almost always the case with fingerprinting techniques. [2]

Camera fingerprinting can be used on the one hand for detecting forgeries (e.g., photoshopped images), and to aid criminal investigations by determining who (or rather, which camera) might have taken a picture. On the other hand, it could potentially also be used for unmasking individuals who wish to disseminate photos anonymously online.

Sadly, most papers studying fingerprinting study only the former type of application, which is why we’ll have to speculate a bit on the privacy impact, even though the underlying math of fingerprinting is the same.

Most fingerprinting techniques have both security-enhancing and privacy-infringing applications. The underlying principles are the same but they are applied slightly differently.

Another point to note is that because of the focus on forensics, most of the work in this area so far has studied distinguishing different camera models. But there are some preliminary results on distinguishing ‘identical’ cameras, and it appears that the same techniques will work.

In more detail. Let’s look at what I think is the most well-known paper on sensor pattern noise fingerprinting, by Binghamton University researchers Jan Lukáš,Jessica Fridrich, and Miroslav Golja. [3] Here’s how it works: the first step is to build a reference pattern of a camera from multiple known images taken from it, so that later an unsourced image can be compared against these reference patterns. The authors suggest using at least 50, but for good measure, they use 320 in their experiments. In the forensics context, the investigator probably has physical possession of the camera and therefore can generate an unlimited number of images. We’ll discuss what this requirement means in the privacy-breach context later.

There are two steps to build the reference pattern. First, for each image, a denoising filter is applied, and the denoised image is subtracted from the original to leave only the noise. Next, the noise is averaged across all the reference images — this way the random noise cancels out and leaves the pattern noise.

Comparing a new image to a reference pattern, to test if it came from that camera, is easy: extract the noise from the test image, and compare this noise pixel-by-pixel with the reference noise. The noise from the test image includes random noise, so the match won’t be close to perfect, but nevertheless the correlation between the two noise patterns will be roughly equal to the contribution of pattern noise towards the total noise in the test image. On the other hand, if the test image didn’t come from the same camera, the correlation will be close to zero.

The authors experimented with nine cameras, of which two were from the same brand and model (Olympus Camedia C765). In addition, two other cameras had the same type of sensor. There was not a single error in their 2,700 tests, including those involving the two ‘identical’ cameras — in each case, the algorithm correctly identified which of the nine cameras a given image came from. By extrapolating the correlation curves, they conservatively estimate that for a False Accept Rate of 10-3, their method achieves a False Reject Rate of anywhere between 10-2 to 10-10 or even less depending on the camera model and camera settings.

The takeaway from this seems to be that distinguishing between cameras of different models can be performed with essentially perfect accuracy. Distinguishing between cameras of the same model also seems to have very high accuracy, but it is hard to generalize because of the small sample size.

Improvements. Impressive as the above numbers are, there are at least two major ways in which this result can, and has been improved. First, the Binghamton paper is focused on a specific signal, sensor noise. But there are several stages in image acquisition and processing pipeline in the camera, each of which could leave idiosyncratic effects on the image. This paper out of Turkey incorporates many such effects by considering all patterns of certain types that occur in the lower order (least significant) bits of the image, which seems like a rather powerful technique.

The effects other than sensor noise seem to help more with identifying the camera model than the specific device, but to the extent that the former is a component of the latter, it is useful. They achieve a 97.5% accuracy among 16 test cameras — but with cellphone cameras with pictures at a resolution of just 640×480.

Second is the effect of the scene itself on the noise. Denoising transformations are not perfect — sharp boundaries look like noise. The Binghamton researchers picked their denoising filter (a wavelet transform) to minimize this problem, but a recent paper by Chang-Tsun Li claims to do it better, and shows even better numerical results: with 6 cameras (all different models), accurate (over 99%) identification for image fragments cropped to just 256 x 512.

What does this mean for privacy? I said earlier that there is a duality between security and privacy, but let’s examine the relationship in more detail. In privacy-infringing applications like mass surveillance, the algorithm need not always produce an answer, and it can occasionally be wrong when it does. The penalty for errors is much lower. On the other hand, the matching algorithm in surveillance-like applications needs to handle a far larger number of candidate cameras. The key point is:

The parameters of fingerprinting algorithms can usually be tweaked to handle a larger number of classes (i.e., devices) at the expense of accuracy.

My intuition is that state-of-the-art techniques, configured slightly differently, should allow probabilistic deanonymization from among tens of thousands of different cameras. A Flickr or Picasa profile with a few dozen images should suffice to fingerprint a camera.[4] Combined with metadata such as location, this puts us within striking distance of Internet-scale source-camera identification from anonymous images. I really hope there will be some serious research on this question.

Finally, a word defenses. If you find yourself in a position where you wish to anonymously publicize a sensitive photograph you took, but your camera is publicly tied to your identity because you’ve previously shared pictures on social networks (and who hasn’t), how do you protect yourself?

Compressing the image is one possibility, because that destroys the ‘lower-order’ bits that fingerprinting crucially depends on. However, it would have to be way more aggressive than most camera defaults (JPEG quality factor ~60% according to one of the studies, whereas defaults are ~95%). A different strategy is rotating the image slightly in order to ‘desynchronize’ it, throwing off the fingerprint matching. An attack that defeats this will have to be much more sophisticated and will have a far higher error rate.

The deanonymization threat here is analogous to writing-style fingerprinting: there are simple defenses, albeit not foolproof, but sadly most users are unaware of the problem, let alone solutions.

[1] That was a bit simplified; mathematically, there is an additive component (dark signal nonuniformity) and a multiplicative component (photoresponse nonuniformity). The former is easy to correct for, and higher-end cameras do, but the latter isn’t.

[2] Much has been said about the tension between security and privacy at a social/legal/political level, but I’m making a relatively uncontroversial technical statement here.

[3] Fridrich is incidentally one of the pioneers ofspeedcubing i.e., speed-solving the Rubik’s cube.

[4] The Binghamton paper uses 320 images per camera for building a fingerprint (and recommends at least 50); the Turkey paper uses 100, and Li’s paper 50. I suspect that if more than one image taken from the unknown camera is available, then the number of reference images can be brought down by a corresponding factor.

This article is the first in a series that looks at “fingerprinting” techniques and the implications for privacy.

Unique-identification techniques similar to fingerprints have been applied in an astonishing variety of contexts in recent decades. Biometrics like iris and DNA profiling are well known, but there are lesser known methods like hand geometry, as well as “behavioral biometrics” like voice, handwriting, typing patterns, and even gait analysis. Many techniques for deanonymization, the principal topic of this blog, work by “fingerprinting” people’s preferences, habits, or style.

This is what paper looks like up close — far from being smooth, it has a rich natural structure. Even considering this, the state-of-the-art study on fingerprinting of physical documents, by Will Clarkson and colleagues at Princeton, achieves something remarkable: they show how to extract fingerprints from paper using just commodity scanners, and no microscopic technology. The fingerprint survives when the document/paper is printed on, written or scribbled on, or even soaked in water.

A small (10mm tall) region of paper scanned from two different angles — top-to-bottom and left-to-right

The image above, taken from the Princeton paper, shows what the output of a scanner looks like. Not quite the resolution of the microscopic image, but a lot of structure is still visible. The key technique is: by scanning the paper at different orientations and comparing the images, the height at each point is estimated from which a 3-D map of the not-so-flat surface of the paper is constructed.

These 3-D maps can be used as fingerprints, but for efficiency they look at the maps of only about 100 randomly picked small “patches” on the paper. To further compress the extracted information, they do a “dimensionality reduction,” resulting in a 400 byte “feature vector” for each piece of paper, which is the fingerprint.

To verify or compare an observed fingerprint against a stored one, they simply look at the Hamming distance between the two bit-vectors. Why does this simple comparison technique succeed? Comparison of two human fingerprints is a lot more difficult, after all. It’s because a rectangular piece of paper has a nice property that human skin doesn’t: when the objects being fingerprinted have a precise, fixed geometry, fingerprint verification is easy — it is just a pointwise comparison of the corresponding features.

The result of such comparisons is this: two fingerprints from different pieces of paper match in roughly 50% of the bits, almost always in the 45%–55% range. Two fingerprints from the same piece of paper, on the other hand, differ in less than 5% of the bits, and occasionally up to 20% of bits if it has been handled particularly badly, such as by soaking. Therefore it is straightforward to infer whether or not two fingerprints came from the same piece of paper.

Readers familiar with the “33 bits of entropy” concept might notice that the fingerprint here is 400 bytes long, or 3200 bits, which is ridiculously high. There are surely less than 250 pieces of paper in the world — that’s a million for every person — which means that these fingerprints should easily be able to uniquely identify every piece of paper in the world. [2] The authors estimate that the chance of an error is no more than 1 in 10148. In other words, they achieve perfect accuracy.

What are the implications? As the authors point out, document identification “has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods.” On the negative side, it “could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.”

[1] This is often referred to as device fingerprinting, but I find that a poor choice of terminology and will use reserve that term for a different concept in this series.

[2] It is hard to estimate entropy exactly in cases like this, but the feature vector is obtained via Principal Component Analysis, which makes it likely that the entropy is close to the maximum value of 3200 bits.