Archive for May, 2008

Metadata is data about your data. A filename is the simplest kind of metadata; it is data that goes with the file but is different from the data in the file itself. Depending on the software you are using, a file’s metadata might include the time it was created and last modified, the registered name of the owner of the computer on which it was created, the name of some other file from which this file was derived, and the name of the software that was used to create it.

A file’s metadata can be revealing once the file gets into circulation. If you are organizing your sister’s birthday party and you send her the RSVP list, forgetting that you named the file “my_stupid_sisters_stupid_party.doc,” sis may draw some inferences from the metadata beyond what she learns from the file itself about who is coming.

In Chapter 3 of Blown to Bits we give some embarrassing examples of this kind. But today’s news brings us a whopper. The foolish computer user in this case user seems not to be some hapless birthday-giving brother, but Google. Talk about people who should know better!

The story is set in Australia, where Ebay is planning to shift its payment system to Paypal only, eliminating the credit and debit card option. Ebay owns Paypal, and in Australia, this sort of thing requires public comment. Among the comments received was an anonymous 38-page document, giving all the reasons why Ebay should not be allowed to do this — it would be anticompetitive, etc., etc.

Anonymous, but perhaps not too anonymous. The document was a PDF, but the “Title” property was “Microsoft Word – 204481916_1_ACCC Submission by Google re eBay Public _2_.DOC.” (If you use Acrobat Reader to open a PDF document, then use the “Properties” menu item, you may be able to find this kind of information as part of the “Description.”) I wonder if someone at Google would really use Microsoft Word or put “Google” into the filename. But even if not, the document could still be Google’s — it might have been written by an outside counsel, or consultant, or summer intern even.

Google seems neither to be confirming nor denying that it is the source of the anonymous document. Theoretically, it could be a third party trying to embarrass Google. And¬†Google isn’t currently competing for the Paypal market in Australia. But it does make you wonder if Google is venturing a bit beyond its “You can make money without doing evil” philosophy.

For the whole story, and links to the document itself, check out this item on¬†TechCrunch.

I know the answer about these two Harvard dropouts, because I taught and graded them both. I also had some outside-the-classroom interactions with each of them while they were students. I gave Gates the “pancake problem,” which is the source of his sole publication in a scholarly journal. (Careful; that’s a 5MB file if you download it.) A few months before founding Facebook, Zuckerberg put up a prototype social network in which the edges denoted “being mentioned in the same Crimson story,” and I was at the center.

The answer to the question? Hate to disappoint, but due to professional ethics and¬†FERPA requirements, I’m not telling! I will only say that I have no evidence that anything they say in this interview about their episodic study habits is inaccurate.

After moaning about surveillance and privacy a few days ago, I wanted to acknowledge the other side. The electronic traces we now routinely leave behind during our daily lives are also left by criminals, and the data is now valuable for solving crimes.

Neil Entwistle is the British-born man who allegedly killed his wife and 9-month-old daughter in Hopkinton, Massachusetts with a gun in 2006. As the notorious case moves to trial, some aspects of the prosecution’s evidence are being published. Entwistle Googled how to kill with a¬†‚Äúknife in the neck‚Äù and also visited service-providing web sites with names such as¬†blondebeautyescorts, halfpriceescorts and hotlocalescorts. Based on previous reporting, it appears that this information was culled from Entwistle’s home computer, rather than from Google. (Check your web browser’s “History” menu on your own computer to see how this information might have been retrieved.)

In other news, the Italians have again proved that they are smarter about tracing digital breadcrumbs than Americans are at hiding them. In Blown to Bits, we explain how an Italian blogger managed to uncover sensitive military information from the official US Army report on the shooting by American troops of an Italian intelligence agent, Nicola Calipari. Today, Italian security experts reveal that they were able to link the CIA to the abduction in Milan of radical Imam¬†¬†Hassan Mustafa Osama Nasr, simply by noting which cell phones were in use in the vicinity of the site of the kidnapping. (Sorry, of the “extraordinary rendition”; that’s the official US term.) The cell phones reported their location to nearby cell phone towers, as cell phones are constantly doing, and the Italians were able to sort through the stored location data after the fact to identify the culprits. The Italians seem almost contemptuous that the CIA would provide so little challenge to their electronic sleuthing abilities.

Bits changed everything. We are so familiar with the transformation that most of us barely remember the old way. Before bits, only people could transform information, re-arrange it so that it served a different purpose. The phonebook listed names in alphabetical order. Want to know the name that belonged to a number and you were out of luck. Want to know the phone number of the person who lives at a particular address ‚Äì out of luck again. Not so now that bits have arrived. Digital information can be rearranged and repurposed. Type a phone number into Google and bingo, the name appears.

For the most part the ability to manipulate data is wonderful. Sometimes, though, it‚Äôs a bit creepy.

We‚Äôve written about the difference between information that is available and information that is accessible. I came across a mashup the other day ‚Äì the combination of a couple of existing components ‚Äì that definitely fell into the creepy category. Federal Elections Commission data has been available for years, and the tools to search that database have been getting better and better. Combine FEC data and Google maps and you get Fundrace. Take a look at http://fundrace.huffingtonpost.com

Just like the phonebook example, rearranging the data and presenting it in new ways transformed the experience. No need to search just by name any more. Want to see who your neighbors are supporting ‚Äì just look at the map. How about searching by employer? Color coding by candidate, dots that correspond to the size of the donation – pretty soon data become information, and what once seemed to be a relatively private activity becomes public and accessible.

Sara Rimer has a nice piece in the Memorial Day New York Times about sustainability houses on college campuses‚Äìresidences where students time their showers, use the drained water to flush their toilets, and so on. Some reported behaviors, such as not bathing at all for extended periods, remind me of ’60s naturalism. Other activities are timelessly collegiate, and unlikely to last a day beyond graduation‚Äìsuch as plastering a picture of John Edwards to the shower stall ceiling as an encouragement to shorter showers.

But one sentence in this story is strikingly modern. “By¬†next fall, the house‚Äôs 24-hour energy-use monitoring system will be fully up and running. Every turn of the faucet, every switch of a light, will be recorded, room by room.”

‚ÄúIt‚Äôs not about telling people, ‚ÄòYou have to do this, you have to do that,‚Äô¬†‚Äù explains one of the students. Not today, at least. I’m betting that the monitoring technology will become more widespread and more coercive‚Äìperhaps not through direct government surveillance, but through economic incentives and social pressures. And all the standard problems with bits will arise with that information about faucet turns and light-switch-flips: who has access to the data, what will it be used for, is it deidentified, will it leak?

Today’s New York Times has a lovely account of corporate surveillance that gives a flavor of the sort of thing that can go wrong. Deutsche Telekom, a large German phone service provider, irritated by repeated leaks about layoff plans, decided to use the data at its disposal to figure out if the leaks were coming from its board of directors. So it turned a lot of call records from 2005 and 2006 over to a third party to check for conversations between directors and reporters. (You may recall that almost exactly the same thing happened at HP not long ago.) Happily, the Germans seem not to be taking this privacy violation lightly. But it’s another example of a general fact about bits: Once they are collected for one reason (in this case, billing, or perhaps traffic analysis), it’s easy to hang onto them just in case they might come in handy later. With the passage of time, the odds go up that someone with access to the data will hatch a bright new idea about how to use it.

Microsoft dropped its book digitization project, stating “Based on our experience, we foresee that the best way for a search engine to make book content available will be by crawling content repositories created by book publishers and libraries.¬†With our investments, the technology to create these repositories is now available at lower costs for those with the commercial interest or public mandate to digitize book content.”

Brewster Kahle of the Internet Archive is “disappointed” and plans to keep up his book-digitizing efforts. But along with Microsoft’s thus far unsuccessful struggles to absorb Yahoo!, the death of Microsoft’s book-digitizing project is another sign that the company that defined the software industry is having a hard time shifting to the new economy defined by bits themselves rather than the computer programs that manipulate bits.¬†

Harvard’s University Librarian, Robert Darnton, has a good piece in the New York Review of Books on the future of research libraries. It begins, “Information is exploding so furiously around us and information technology is changing at such bewildering speed that we face a fundamental problem: How to orient ourselves in the new landscape? What, for example, will become of research libraries in the face of technological marvels such as Google?”

Nice metaphor, Professor Darnton! (Full disclosure: We were far from the first to use it. “Information Explosion” is the title of a paper by Latanya Sweeney, and the image surely wasn’t original with her either.)

While we’re at it, a tip of the hat to my colleague Stuart Shieber, the architect of Harvard’s open-access policy for research papers. He’s just been named head of Harvard’s newly created Office of Scholarly Communications.

Some estimates of the value of Facebook run as high as $15 billion. How can that be? It’s just some software and some people, right?

Wrong. It’s data about who hundreds of millions of people know, and who those people know, and how often they communicate, and what they are interested in. Every time someone agrees to be your Facebook friend, the two of you have established a link in Facebook’s gigantic friendship graph. Even the fact that you asked that person is probably recorded somewhere, even if he or she ignores you.

As far as I know, the connections between Reverend Wright and Barack Obama, and between Reverend Hagee and John McCain, were not discovered by electronic sleuthing. But such connections are going to be easier to discover in the future than in the past. Facebook data would be a gold mine, but it won’t help much if you decide to stay off such social networking sites. It’s easy for computers to connect people whose names appeared together in old newspaper articles. Photos and videos will be subject to face recognition, so it will be possible to build a huge “appears-in-the-same-image-with” graph automatically. Public figures will have to worry more and more about their associations, as it looks like the public interest in their circle of acquaintances will not diminish anytime soon.

And the power of the government to create such structures of social connections will be even greater than what can be gathered from public sources. The UK may implement a massive data aggregation system, including data on every phone call, email, and instant message in the nation. The fight against terror demands such ubiquitous surveillance, goes the claim.

Would we live our lives differently, fearing that our everyday social contacts, and our adventurous escapades, are all going to wind up in the government’s great social network? How will the world change when clumsy attempts at romantic outreach, phone calls placed to wrong numbers, and group photos snapped at parties all turn into contextless edges in that permanent, all-encompassing social graph?

This gentleman unwisely posted¬†some photos of himself waving a $20 bills as part of a Craigslist ad, and now believes that copyright law, as well as criminal fraud statutes, will come to his aid in encouraging Gawker to take them down. Gawker doesn’t seem to agree.

What’s interesting here is the gentleman’s confusion between public and private spaces, the conceit that the photos he posted on Craigslist were still “his” to control. Theoretically, Craigslist might have an argument with Gawker, since the Craigslist¬†terms of service state, “You ‚Ä¶ agree not to reproduce, duplicate or copy Content from the Service without the express written¬†consent of craigslist.” As a practical matter, Gawker is right: “Craigslist is a public place.”

Also interesting are the gentleman’s threats of legal action to respond to what might kindly be called a personal misjudgment. What people think might be done about the problems they have created for themselves has changed, not only with the litigiousness of society in general, but with the litigiousness about bits in particular. Before the RIAA and the MPAA started going after teenagers for music downloading, people like this might never even have heard of copyright law, much less have thought (however mistakenly) that it could protect their reputation. Another thing for which the recording industries can be thanked, I suppose.