Monthly Archives: September 2009

In America, the enemy is Terrorism. It used to be the Russians, or more generically Communists. We discussed the history of this concept in class today. And then I asked: In the state-controlled Chinese media, who is the enemy today?

Anything that’s hard to put into words is hard to put into Google. What are the right keywords if I want to learn about 18th century British aristocratic slang? What if I have a picture of someone and I want to know who it is? How to I tell Google to count the number of web pages that are written in Chinese?

We’ve all lived with Google for so long that most of us can’t even conceive of other methods of information retrieval. But as computer scientists and librarians will tell you, boolean keyword search is not the end-all. There are other classic search techniques, such as latent semantic analysis which tries to return results which are “conceptually similar” to the user’s query, even if the relevant documents don’t contain any of the search terms. I also believe that full-scale maps of the online world are important, I would like to know which web sites act as bridges between languages, and I want tools to track the source of statements made online. These sorts of applications might be a huge advance over keyword search, but large-scale search experiments are, at the moment, prohibitively expensive.

The problem is that the web is really big, and only a few companies have invested in the hardware and software required to index all of it. A full crawl of the web is expensive and valuable, and all of the companies who have one (Google, Yahoo, Bing, Ask, SEOmoz) have so far chosen to keep their databases private. Essentially, there is a natural monopoly here. We would like a thousand garage-scale search ventures to bloom in the best Silicon Valley tradition, but it’s just too expensive to get into the business.

DotBot is the only open web index project I am aware of. They are crawling the entire web and making the results available for download via BitTorrent, because

We believe the internet should be open to everyone. Currently, only a select few corporations have access to an index of the world wide web. Our intention is to change that.

Bravo! However, a web crawl is a truly enormous file. The first part of the DotBot index, with just 600,000 pages, clocks in at 3.2 gigabytes. Extrapolating to the more than 44 billion pages so far crawled, I estimate that they currently have 234 terabytes of data. At today’s storage technology prices of about $100 per terabyte, it would cost $24,000 just to store the file. Real-world use also requires backups, redundancy, and maintenance, all of which push data center costs to something closer to $1000 per terabyte. And this says nothing of trying to download a web crawl over the network — it turns out that sending hard drives in the mail is still the fastest and cheapest way to move big data.

Full web indices are just too big to play with casually; there will always be a very small number of them.

I think the solution to this is to turn web indices and other large quasi-public datasets into infrastructure: a few large companies collect the data and run the servers, other companies buy fine-grained access at market rates. We’ve had this model for years in the telecommunications industry, where big companies own the lines and lease access to anyone who is willing to pay.

The key to the whole proposition is a precise definition of access. Google’s keyword “access” is very narrow. Something like SQL queries would expand the space of expressible questions, but you still couldn’t run image comparison algorithms or do the computational linguistics processing necessary for true semantic search. The right way to extract the full potential of a database is to run arbitrary programs on it, and that means the data has to be local.

The only model for open search that works both technologically and financially is to store the web index on a cloud, let your users run their own software against it, and sell the compute cycles.

It is my hope that this is what DotBot is up to. The pieces are all in place already: Amazon and others sell cheap cloud-computing services, and the basic computer science of large-scale parallel data processing is now well understood. To be precise, I want an open search company that sells map-reduce access to their index. Map-reduce is a standard framework for breaking down large computational tasks into small pieces that can be distributed across hundreds or thousands of processors, and Google already uses it internally for all their own applications — but they don’t currently let anyone else run it on their data.

I really think there’s money to be made in providing open search infrastructure, because I really think there’s money to be made in better search. In fact I see an entire category of applications that hasn’t yet been explored outside of a few very well-funded labs (Google, Bellcore, the NSA): “information engineering,” the question of what you can do with all of the world’s data available for processing at high speed. Got an idea for better search? Want to ask new questions of the entire internet? Working on an investigative journalism story that requires specialized data-mining? Code the algorithm in map-reduce, and buy the compute time in tenth-of-a-second chunks on the web index cloud. Suddenly, experimentation is cheap — and anyone who can figure out something valuable to do with a web index can build a business out of it without massive prior investment.

The business landscape will change if web indices do become infrastructure. Most significantly, Google will lose its search monopoly. Competition will probably force them to open up access their web indices, and this is good. As Google knows, the world’s data is exceedingly valuable — too valuable to leave in the hands of a few large companies. There is an issue of public interest here. Fortunately, there is money to be made in selling open access. Just as energy drives change in physical systems, money drives changes in economic systems. I don’t know who is going to do it or when, but open search infrastructure is probably inevitable. If Google has any sense, they’ll enter the search infrastructure market long before they’re forced (say, before Yahoo and Bing do it first.)

Let me know when it happens. There are some things I want to do with the internet.

Digg, YouTube, Slashdot, and many other sites employ user voting to generate collaborative rankings for their content. This is a great idea, but simply counting votes is a horrible way to do it. Fortunately, the fix is simple.

A basic ranking system allows each user to add a vote to the items they like, then builds a “top rated” list by counting votes. The problem with this scheme is that users can only vote on items they’ve seen, and they are far more likely to see items near the top of the list. In fact, anything off the front page may get essentially no views at all — and therefore has virtually no chance of rising to top.

This is rather serious if the content being rated is serious. It’s fine for Digg to have weird positive-feedback popularity effects, but it’s not fine if we are trying to decide what goes on the front page of a news site. Potentially important stories might never make it to the top simply because they started a little lower in the rankings for whatever reason.

Slightly more sophisticated systems allow users to rate items on a scale, typically 1-5 stars. This seems better, but still introduces weird biases. Adding up the stars assigned by all users to a single item doesn’t work, because users still have to see an item to vote on it. Averaging all the ratings assigned to a single item doesn’t work either, because it can push something permanently to the bottom of the list, if the first user to view it rates it only one star.

There are lots of subtle hacks that one can make to try to fix the system, but it turns out there might actually be a right way to do things.

If every item was rated by every user, there would be no problem with popularity feedback effects.

That’s completely impractical with thousands or even millions of items. But we can actually get close to the same result with much less work, if we take random samples. Like a telephone poll, the opinion of a small group of randomly selected people will be an accurate indicator, to within a few percent, of the result that we would get if we asked everyone.

In practice, this would mean adding a few select “sampling” stories to each front page served, different every time. Items can then by ranked simply their average rating, with no skewing due to who got to the front page first. (In fact, basic sampling math will tell us which items have the most uncertain ratings and need to be seen with the highest priority.) In effect, we are distributing the work of rating a huge body of items across a huge body of users — true collaborative filtering, using sampling methods to remove the “can’t see it can’t vote on it” bias.

This is not an end-all solution to the problem of distributed agenda-setting. User ratings are not necessarily the ideal criterion for measuring “relevance.” One problem is that not every user is going to take the trouble to assign a rating, so you will only be sampling from particularly motivated individuals. Other metrics such as length of time on page might be better — did this person read the whole thing?

Even more fundamentally, it’s not clear that popularity, however defined, is really the right way to set a news agenda in the public interest.

However, any attempt to use user polling for collaborative agenda setting needs to be aware of basic statistical bias issues. Sampling is a simple and very well-developed way to think about such problems.

If we deliver to each person only what they say they want to hear, maybe we end up with a society of narrow-minded individualists. It’s exciting to contemplate news sources that (successfully) predict the sorts of headlines that each user will want to read, but in the extreme case we are reduced to a journalism of the Daily Me: each person isolated inside their own little reflective bubble.

The good news is, specialized maps can show us what we are missing. That’s why I think they need to be standard on all information delivery systems.

For the first time in history, it is possible to map with some accuracy the information that free-range consumers choose for themselves. A famous example is the graph of political booksales produced by orgnet.com:

Here, two books are connected by a line if consumers tended to buy both. What we see is what we always suspected: a stark polarization. For the most part, each person reads either liberal or conservative books. Each of us lives in one information world but not the other. Despite the Enlightenment ideal of free debate, real-world data shows that we do not seek out contradictory viewpoints.

Which was fine, maybe, when the front page brought them to us. When information distribution was monopolized by a small number of newspapers and broadcasters, we had no choice but to be exposed to stories that we might not have picked for ourselves. Whatever charges one can press against biased editors of the past, most of them felt that they had a duty to diversity.

In the age of disaggregation, maybe the money is in giving people what they want. Unfortunately, there is a real possibility that we want is to have our existing opinions confirmed. You and I and everyone else are going to be far more likely to click through from a headline that confirms what we already believe than from one which challenges us. “I don’t need to read that,” we’ll say, “it’s clearly just biased crap.” The computers will see this, and any sort of recommendation algorithm will quickly end up as a mirror to our preconceptions.

It’s a positive feedback loop that will first split us along existing ideological cleavages, then finer and finer. In the extreme, each of us will be alone in a world that never presents information to the contrary.

We could try to design our systems to recommend a more diverse range of articles (an idea I explored previously) but the problem is, how? Any sort of agenda-setting system that relies on what our friends like will only amplify polarities, while anything based on global criteria is necessarily normative — it makes judgements on what everyone should be seeing. This gets us right back into all the classic problems of ideology and bias — how do we measure diversity of viewpoint? And even if we could agree on a definition of what a “healthy” range sources is, no one likes to be told what to read.

I think that maps are the way out. Instead of trying to decide what someone “should” see, just make clear to them what they could see.

An information consumption system — an RSS reader, online newspapers, Facebook — could include a map of the infosphere as a standard feature. There are many ways to draw such a map, but the visual metaphor is well-established: each node is an information item (an article, video, etc.) while the links between items indicate their “similarity” in terms of worldview.

This is less abstract than it seems, and with good visual design these sorts of pictures can be immediately obvious. Popular nodes could be drawn larger; closely related nodes could be clustered. The links themselves could be generated from co-consumption data: when one user views two different items, the link between those items gets slightly stronger. There are other ways of classifying items as related — as belonging to similar worldviews — but co-consumption is probably as good a metric as any, and in fact co-purchasing data is at the core of Amazon’s successful recommendation system.

The concepts involved are hardly new, and many maps have been made at the site level where each node is an entire blog, such as the map of the Iranian blogosphere above. However, we have never had a map of individual news items, and never in real-time for everyone to see.

Each map also needs a “you are here” indicator.

This would be nothing more than some way of marking items that the user has personally viewed. Highlight them, center them on the map, and zoom in. But don’t zoom in too much. The whole purpose of the map is to show each of us how small, how narrow and unchallenging our information consumption patterns actually are. We will each discover that we live in a particular city-cluster of information sources, on a particular continent of language, ideology, or culture. A map literally lets you see this at a glance — and you can click on far-away nodes for instant travel to distant worldviews.

Giving people only what they like risks turning journalism into entertainment or narcissism. Forcing people to see things that they are not interested in is a losing strategy, and we there isn’t any obvious way to decide what we should see. Showing people a map of the broader world they live in is universally acceptable, and can only encourage curiosity.

Oh Front Page, your days are clearly numbered. For generations all eyes were upon you; you set the public agenda, and advertisers loved you best. In the tumult of the world, your voice carried above all others, and we needed you. You told us when the war ended, and when The Beatles came to town.

But you are in your autumn now.

We know that your children killed you, though they did not mean it. In the age of the scribe, it seemed that anyone could own a printing press. But now, Front Page, we talk online about the monopoly you once claimed. Some will pine for newsprint, but paper is just too expensive, too heavy and static.

But this is not about paper. This is about the way you lived your life, your insistence on a space that you and you alone controlled. You tried to move online, Front Page, but your model would not yield and your children ate your lunch. Google News chooses from the best, while Digg lets us choose for ourselves. There will always be reporters — those who assemble the narratives — but there may not always be editors. Your stubborn insistence on one for all made us question your purpose.

We loved you and you ignored us! Advertisers deserted you first; they were very quick to understand that reader information could be leveraged into relevance. Google itself was built on this model. Meanwhile Amazon and iTunes grasped that efficiencies of delivery had moved the money to the infinite niche. But you admitted none of this, Front Page, and also you did not see that people live in networks, that our friends know what is important to us.

Why would you not give us what we wanted? No one questions your integrity, the standards of journalism you uphold. No one questions that we, the public, need to be told at least as much as we need to be listened to. But suddenly we could talk back, and you weren’t listening. You insisted that we go to you instead of just coming to us. Why did you not use our input to customize the agenda? You could have spawned Facebook applications and iPhone applications and even innovative social RSS readers that determined our interests and automatically delivered ten million personalized headlines! (And their ads.)

You had everything you needed, and this was your unforgivable sin. A hundred years ago you built the Associated Press to feed you, the prototype of distributed journalism. This could have been the beginning, if you had embraced more than the cream of international stories, if you had realized how cheap local reporting could be. Those long tail stories could be vastly cheaper, Front Page, if you embraced more sources, if you fought for transparency instead of access, if you taught citizens to be journalists instead of insisting that they can’t. You could have set the standards and franchised the platform. But instead of finding innovative ways to gather the news and innovative ways to deliver it to us, even now you fight hard to be seen less!

Instead of owning the aggregators and bringing to them the wisdom of an old hand, you scoffed at Digg, at Google, at Memeorandum. Why are there still so many news sites without a panel of “Share This” links beneath each story? Why are we not allowed to speak to the New York Times with user ratings buttons? Your mannerisms are quaint as hoop-skirts, Front Page.

We know also that your less reputable cousin is only slightly younger, and the world will never listen to Television as their parents did. The internet will devour Broadcast too; in only a few more years bandwidth will be cheap enough for anyone to run their own station. We know that upcoming content analysis algorithms will soon make video search a reality, and we know that the RSS future will soon disaggregate Television News just as it only recently disaggregated you.

Front Page, your children are brash, but they are filled with the energy of youth. They have inherited a world you never foresaw, and they are hopeful in a way you are not. It is their world now. You must guide them, but you must let them have it.

In the editorial “New Tweets, Old Needs” experienced journalist Roger Cohen says that Twitter isn’t journalism, and that Iran “has gone opaque” without its mainstream media correspondents. He may be right about the recent paucity of good journalism out of Iran, but he misses some really crucial points about how information flows in the absence of a distribution monopoly (like a printing press.) In particular, he seems to assume that only professional journalists can be capable of producing professional journalism.

It is absolutely true that journalism is much more than random tweeting or blogging. I have been particularly inspired by the notion that “journalism is a discipline of verification,” and a tweet or a blog post neither requires nor endures the fact-checking and truthfulness standards that we expect of our more traditional news media. I also agree that search engines are simply not a substitute for being there. Someone must be a witness. Someone has to feed their experience into the maw of the internet at some point.

However, when Cohen says “the mainstream media — expelled, imprisoned, vilified — is missed” he is implicitly arguing that only the mainstream media can produce good journalism. Traditionally, “journalist” was a distinct, easily defined class: a journalist was someone who worked for a news organization. There weren’t many such organizations, because a distribution monopoly is an expensive thing. All this has changed with the advent of nearly free and truly democratic information distribution, and we are seeing a rapid erosion of the the distinction between professional and amateur or “citizen” journalists. The result is confusion, uncertainty and fear — especially on the part of those who have staked their careers or their fortunes on the clarity of this distinction.

But I see a big difference between journalists and journalism, and this is where Cohen and I part ways.

In my view the failure of journalism in Iran was not the failure of the mainstream media to hold their ground (or their funding, or their audiences) but rather the failure of the journalism profession to educate the public about what exactly it does, and how to do it. When Cohen asks questions such as

But who is there to investigate these deaths — or allegations of wholesale rape of hundreds of arrested men and women — and so shed light?

my answer is, the Iranians, of course!

Naturally, a young activist-turned-reporter does not have the experience or connections of an old-school foreign correspondent. But such a person is there, and they care enormously. What they lack is guidance. What is and is not journalism, exactly? What are the expected standards and daily, on-the-ground procedures of verification? Where can someone turn to for advice on covering the struggles they are immersed in? And what, actually, differentiates the New York Times from a blogger? We need clear answers, because the newspapers are no longer the only ones declaiming the news.

Perhaps the mainstream media couldn’t be in Iran, but they could have been mentoring and collaborating from afar, and yes, publishing the journalism of non-career journalists. And such a project needs to begin long before times of crisis, in every region, so that those who are there are ready.

If “citizen journalism” has so far been somewhat underwhelming, it is because we have not taught our citizens to be journalists.

The New York Times Magazine and Wired both have major articles this week on recent empirical work in social networks, including significant research on how things like obesity, smoking, and even happiness spread between among groups of people. The Wired piece has better pictures

while the NYT piece is more thorough and thoughtful, and covers both the potential and the pitfalls of this kind of analysis.

For decades, sociologists and philosophers have suspected that behaviors can be “contagious.” … Yet the truth is, scientists have never successfully demonstrated that this is really how the world works. None of the case studies directly observed the contagion process in action. They were reverse-engineered later, with sociologists or marketers conducting interviews to try to reconstruct who told whom about what — which meant that people were potentially misrecalling how they were influenced or whom they influenced. And these studies focused on small groups of people, a few dozen or a few hundred at most, which meant they didn’t necessarily indicate much about how a contagious notion spread — if indeed it did — among the broad public. Were superconnectors truly important? How many times did someone need to be exposed to a trend or behavior before they “caught” it? Certainly, scientists knew that a person could influence an immediate peer — but could that influence spread further? Despite our pop-cultural faith in social contagion, no one really knew how it worked.

Representative Joe Wilson yelled “you lie!” at the president, and the papers loved it. Unfortunately, by a count of more than three to one, the major media articles covering the event did not bother to comment on the substance of issue of that provoked Wilson’s outburst: whether or not illegal immigrants would be provided health care under proposed reforms. There is no health care debate in the mainstream American press. There is only political drama.

The president did not lie. All of the proposed health care reform bills contain language excluding those residing illegally in the US from government-subsidized coverage. This single-sentence fact check was entirely absent from 50 of the 70 articles mentioning “wilson” and “lie” on the New York Times and Washington Post websites as of Monday night. Of the 20 which discussed actual policy, only nine articles mentioned it in the first two paragraphs. (Spreadsheet here.)

Wilson’s outburst will be forgotten long after millions of Americans are insured — or not — under Obama’s plan. It’s just noise and heat. Yet some of the most reputable newspapers in the world have lead with it for the last five days. In fact, the press has in some cases actively dodged the underlying issue. Consider this exchange from an online Q&A session with Dana Milbank of the Washington Post:

Cincinnati: Are you saying the President wasn’t lying when he said illegal immigrants won’t be covered? Why not look at the House bill and tell us whether or not it allows illegals to be covered? The Congressional Research service issued a report last week saying there was NOTHING in the House bill that excludes illegals from receiving government-run health care. In other words, be a REPORTER instead of a hack for Barack.

Dana Milbank: Actually I wasn’t addressing the factual nature of Obama’s speech. The issue wasn’t that Wilson thought the president wasn’t telling the truth; part of the presidential job description calls for expertise in truth shading. The issue was shouting “you lie!” at the president on the House floor during an address to a joint session of Congress.

(For the record, the CRS report in question notes that HR 3200 says “Nothing in this subtitle shall allow Federal payments for affordability credits on behalf of individuals who are not lawfully present in the United States.” Which has, oddly, been spun as meaning that illegals would be subsidized!)

It should be no surprise that there is actually substance to the question of coverage for illegal immigrants. Only nine of the 70 pieces get into it: yes, a few undocumented workers could end up getting subsidized health care. No, it’s not worth taxpayer money to add an enforcement mechanism.

But even this is one level removed, and only one article grappled with the fundamental question: would it really be so bad if the poorest workers in America got a break? In fact we might even owe it to them. On average, migrant labor is thought to be a small net gain to the American economy.

I get that Wilson’s little moment is a great story, right up there with the guy who threw a shoe at Bush (who was imprisoned for his prank, with far less coverage.) And I do understand the logic of a populist press as the paper ship sinks. What cannot be excused is the omission of any mention of the substantive content of the debate from the majority of coverage — 50 out of 70 articles said nothing at all about anything that will last.

A man was angry and a woman was crying, by the bus stop at 12:15AM near Causeway Bay MTR. I walked right past them (almost didn’t see them, don’t stop don’t get involved) then turned around when I heard blows and realized that no one else was paying attention.

What I saw was this: an angry Chinese man in a suit restraining a slight Chinese woman in a dress by the arm. He was yelling, and then he was banging his head against the ad on the side of the bus shelter. She was crying and she was trying to pull away from him.

I did it, I turned around, walked back, got within a few meters, and said quietly and clearly, “you need to leave her alone.” And that was all I had. Then I just stood there.

He pulled her closer. Her back was to me. He wasn’t looking at me. He put his arms around her waist like he was keeping her from leaving. She pushed him away weakly with her hands on his chest. She pulled him in halfheartedly with an arm around his shoulders. He restrained her. He held her close and soothed her hair. She was crying. She was apologizing. She was actually in the wrong, or she wasn’t.

His eyes flicked up to see me still standing there, just watching, heart pounding, wondering if I was about to get punched. I had nothing. Calling 999 wouldn’t help. He wouldn’t meet my eyes. He was angry. He was shamed. She had cheated on him, or he had cheated and then he had made it about her. She was staying or she wanted to go. She stood as conflicted as I was confused. I stood very still and alert and watched with I do not know what expression on my face.

How long was I going to stand there, and why?

I wanted to say, “come with me.” I imagined calling the only stable Westerners I knew in Hong Kong, Jessica and her husband, a coffee-shop encounter and traded cards, I barely knew them. I imagined how I would lead her gently into a taxi and say to the driver, “Shuen Wan station, please,” and call to get the exact address, then the conversation as I explained the emergency to a mere acquaintance in front of a crying woman. I wanted to say: you don’t have to put up with this. I wanted to say, there is help. I don’t know where but Jessica is going to look up women’s shelters while you drink this tea. Or milk. Or whatever the hell comforts a weeping Chinese woman gone too far from the village.

She did and she didn’t. Her body language said everything. “It’s okay. He’s my husband. Thank you.” Brave face.

But it wasn’t ok, and I didn’t and still don’t know how not OK, and I didn’t know what to say to her that would allow her to break free, if breaking free is what she needs, if breaking free can even exist for her. Unless of course everything actually really was OK. Unless of course I had just made it worse by shaming her husband in public.