Fri June 21, 2013

Calling It 'Metadata' Doesn't Make Surveillance Less Intrusive

"This is just metadata. There is no content involved." That was how Sen. Dianne Feinstein defended the NSA's blanket surveillance of Americans' phone records and Internet activity. Before those revelations, not many people had heard of metadata, the term librarians and programmers use for the data that describes a particular document or record it's linked to. It's the data you find on a card in a library catalog, or the creation date and size of a file in a folder window. It's the penciled note on the back of a snapshot: "Kathleen and Ashley, Lake Charles, 1963." Or it could be the times, numbers and GPS locations attached to the calls in a phone log.

"Metadata" was bound to break out sooner or later, riding the wave of "data" in all its forms and combinations. "Big data" and "data mining" are the reigning tech buzzwords these days, and university faculties are scrambling to meet the surge in demand for courses in the hot new field of data science. It's as if "data" is usurping "information" as a byword. Up to now, "data" has played a supporting role in the information age. There's a popular definition of data as the raw material that becomes information when it's processed and made meaningful. That puts information at the center of the modern tech world, but it isn't how anybody actually uses the two words. I have this image of somebody working a spreadsheet as a manager leans over and says, "Is it information yet?"

But the shift in focus from information to data reflects a genuine difference between the two. "Information" brings to mind the knowledge that's gathered in libraries, encyclopedias, newspapers and journals — stuff that has an independent existence in the world. "Data" is always connected to particular things and events. It comes from experiments, sensors, official records. Or it's the scuff marks we leave behind as we click on websites, make calls, go through the E-ZPass tollbooths, visit an ATM. It's all out there, accumulating in ginormabytes, overflowing the server farms.

When you're focused on information in that stand-alone sense, metadata plays a subordinate role. In the old days, it was just a tool for getting to the stuff you were really interested in. Think how much metadata you had to wade through back then to find a passage about drunkenness in Alexis de Tocqueville's Democracy in America — looking up the book in the library card catalog, writing down its call number, finding it on the shelves, searching for "drunkenness" in the index, then finally turning to the page you're after. Now that that kind of information is online, metadata can seem almost irrelevant. No need for catalogs or indexes: You just enter a query, and when the book comes up, you barrel in sideways. That's probably why Google was so careless about metadata when they digitized major library collections for Google Books. Literally millions of books are misdated or misclassified: It's not odd to run into a Web browser manual dated 1939 that lists Sigmund Freud as its author or a copy of Madame Bovary attributed to Henry James and filed under "antiques and collectibles." The faulty metadata prompted some grumbles from academics, and Google has been trying to fix it. But it doesn't bother most of the people who use Google Books — they get at its information in other ways.

But metadata gets a lot more respect in other corners of the Google campus, not to mention from its competitors up and down U.S. 101. Their focus isn't on information in the abstract but on collecting specific data about their users, and for that they need to get the metadata right: Who's visited this page, how long did they stay, where did they go next, who did they e-mail, call or text? All so that advertisers can ensure that the seersucker jacket I clicked on yesterday will stalk me to the end of my days.

That's the same kind of metadata that the NSA insists it needs to trawl. Its defenders maintain we have to be willing to trade some privacy for security, and right now we're all arguing about where to put the boundaries. But some advocates of the surveillance have also tried to soft-pedal its intrusiveness. You hear people pronouncing "metadata" as a soothing incantation, as if your right to privacy ends as soon as you lick and seal the envelope. Sifting through the metadata, the president said, involves just "modest encroachments on privacy." James Clapper, the director of National Intelligence, compared the programs to combing through a library with millions of volumes and sorting them by their Dewey decimal numbers, without actually opening and reading them.

That's not quite apt. If you're going to compare this to rummaging around in an old-fashioned library, it's more like opening the back covers of all the books to see whose names are on the borrowers' cards. Whether or not you think the government should be sweeping this stuff up, calling it metadata doesn't make the process any less intrusive. Tell me where you've been and who you've been talking to, and I'll tell you about your politics, your health, your sexual orientation, your finances. Why don't we just let the word "metadata" sink back into the nerdy cubicles it came from? When it comes to privacy, the "meta-" doesn't matter. In the post-information age, it's just data all the way down.

Copyright 2013 NPR. To see more, visit http://www.npr.org/.

Transcript

DAVID BIANCULLI, HOST:

This is FRESH AIR. What effect of the revelations about NSA surveillance has been to bring the unfamiliar word metadata into the spotlight. Our linguist, Geoff Nunberg, explains what metadata is, how it's related to the big data revolution and whether it makes any difference - or, as he puts it: does meta matter?

GEOFF NUNBERG, BYLINE: This is just metadata. There is no content involved. That was how Senator Dianne Feinstein the NSA's blanket surveillance of Americans' phone records and Internet activity. Before those revelations, not many people had heard of metadata, the term librarians and programmers use for the data that describes a particular document or record it's linked to. It's the data you find on a card in a library catalog, or the creation date and size of a file in a folder window. It's the penciled note on the back of a snapshot: Kathleen and Ashley, Lake Charles, 1963. Or it could be the times, numbers and GPS locations attached to the calls in a phone log.

Metadata was bound to break out sooner or later, riding the wave of data in all its forms and combinations. Big data and data mining are the reigning tech buzzwords these days, and university faculties are scrambling to meet the surge in demand for courses in the hot new field of data science. It's as if data is usurping information as a byword. Up to now, data has played a supporting role in the information age. There's a popular definition of data as the raw material that becomes information when it's processed and made meaningful. That puts information at the center of the modern tech world, but it isn't how anybody actually uses the two words. I have this image of somebody working on a spreadsheet as a manager leans over and says, is it information yet?

But the shift in focus from information to data reflects a genuine difference between the two. Information brings to mind the knowledge that's gathered in libraries, encyclopedias and journals - stuff that has an independent existence in the world. Data is always connected to particular things and events. It comes from experiments and sensors and official records, or from the scuff marks we leave behind as we click on websites, make calls, go through the E-Z Pass toll booths, visit an ATM. It's all out there, accumulating in ginormabytes, overflowing the server farms.

When you're focused on information in that stand-alone sense, metadata plays a subordinate role. In the old days, it was just a tool for getting to the stuff you were really interested in. Think how much metadata you had to wade through to find a passage about drunkenness in Tocqueville's "Democracy in America."

Looking up the book in the library card catalogue, writing down its call number, finding it on the shelves, searching for drunkenness in the index, then finally turning it to the page.

Now that that kind of information is online, metadata can seem almost irrelevant. No need for catalogues or indexes - you just enter a query and when the book comes up, you barrel in sideways. That's probably why Google was so careless about metadata when they digitized major library collections for Google Books. Literally millions of books are mis-dated or misclassified. It's not odd to run into a web browser manual dated 1939 that lists Sigmund Freud as its author.

Or a copy of "Madame Bovary" attributed to Henry James and filed under antiques and collectibles. The faulty metadata prompted some grumbles from academics, and Google's been working on fixing it. But it doesn't bother most of the people who use Google Books. They get at its information in other ways.

But metadata gets a lot more respect in other corners of the Google Campus, not to mention from its competitors up and down U.S. 101. Their focus is not information in the abstract, but on collecting specific data about their users. And for that, they need to get the metadata right. Who's visited this page? How long did they stay? Where did they go next? Who did they email, call, or text?

All so that advertisers can ensure that the seersucker jacket I clicked on yesterday will stalk me to the end of my days. That's the same kind of metadata the NSA has been trawling. Its defenders maintain that we have to be willing to trade some privacy for some security, and right now we're all arguing about where to put the boundaries.

But some advocates of the surveillance have also tried to soft pedal its intrusiveness. You hear people pronouncing metadata as a soothing incantation, as if your right to privacy ends as soon as you lick and seal the envelope. Sifting through the metadata, the president said, involves just modest encroachments on privacy.

James Clapper, the director of National Intelligence, compared the programs to combing through a library with millions of volumes and sorting them by their Dewey decimal numbers, without actually opening and reading them. But if you're going to compare this to rummaging around in an old-fashioned library, it's more like opening the back covers of all the books to see whose names are on the borrowers' cards.

Whether or not you think the government should be sweeping this stuff up, calling it metadata doesn't make the process any less intrusive. Tell me where you've been and who you've been talking to, and I'll tell you about your politics, your health, your sexual orientation, your finances. So maybe we should let the word sink back into the nerdy cubicles it came from. When it comes to privacy, the meta doesn't matter. In the post-information age, it's just data all the way down.

BIANCULLI: Jeff Nunberg is a linguist who teaches at the University of California Berkeley School of Information. Coming up, Ken Tucker reviews "Yeezus," the new album from Kanye West. This is FRESH AIR.