Musings about games, religion, politics, and other forms of entertainment.

Sunday, April 22, 2007

Operation: Help Me With My Thesis - episode 2

Thanks to everyone who responded to my request for Master's Thesis ideas. As I mentioned in the comments section, I'm planning to do some news analysis using sites like Digg.com, reddit.com, and perhaps del.icio.us.

I like to say that the this topic is partly inspired by Anna Nicole Smith, since around the time I thought of it, Smith died and for some reason completely monopolized cable news for several weeks. I kept wondering: Why in the world do they think people care about her? People die all the time. As celebrities go, she wasn't particularly interesting. Do people actually read this stuff?

Web 2.0 can give sort of a handle on answering this question. At Digg.com and similar sites, people actually rate the news by voting it up or down. A given news item will get an overall "score" for how many people voted for and against it.

Now suppose you take the average rating of a news story on a given subject -- let's stick with Anna Nicole Smith as the example -- and compare it to the number of times that that subject story appeared in the news, across all news sites. The first number would tell you what people want to read about. The second number would tell you what is being presented most often as news. We could probably normalize this by what section of the newspaper it appears in -- for example, a story that appears on the front page is considered more important than one that doesn't; a long story may be more important than a short one.

So the question at hand is: how successful are news sources at generating information that people want? Are readers really treating their news as entertainment, or do they recommend hard hitting investigative reporters much more heavily? And what about media bias, either liberal or conservative?

In theory, it may be possible to quickly identify stories as leaning towards a liberal or conservative position, perhaps by cross-referencing them with the people who recommend them. Then what? Well, suppose it turns out that there are more liberal stories than conservative ones in the media... but suppose also that the liberal stories tend to be rated higher and read by more people than the conservative ones. That might indicate that, for instance, the idea of what "liberal" means is out of sync with the political center. Of course, it could go either way, and I'll be interested to try to come up with a measurement that doesn't bias the results.

There are tons of flaws with this topic, and I'll acknowledge some of them up front. For starters, those who subscribe to Digg almost certainly do not constitute a representative sample of all people in the country who read the news. So there's no way I can think of to justify any claims about all people nationwide. However, just investigating this cross section of people, and seeing what they like, could be useful and interesting in various ways that I haven't thought of yet.

When I talked about this topic with Dr. Ghosh, who will be my adviser, he said I shouldn't get sidetracked by that kind of problem, because it's not unusual for a research paper to be limited in scope. In fact, he recommended that I deliberately limit the scope to around five news sources, so that I have interesting things to say about just articles from those sites. I was thinking of picking three somewhat "mainstream" media sites (for example, NY Times, Washington Post, and CNN); then pick a liberal feed (perhaps Daily Kos) and a conservative feed (Fox News? Washington Times? WorldNet Daily?) to compare against.

3 comments:

Trying to figure out the least biased way of assigning initial bias. So I figure the idea is you hand tag a couple of sources with a priori bias. Those biases taint* users who vote stories from those sources as relevant and could also taint additional sources.

Is that what you were thinking or is there a less biased way of doing it?

That's a great idea, nephlm. I hadn't thought of it, but I think this is a good way to start.

For example:1. Pick 3-5 liberal sites (Daily Kos, Huffington Post, etc) and 3-5 conservative sites (Powerline, Town Hall, Fox News, etc). Call those the control group. I could probably even do some mining to see which of those sites are most popular overall.2. Identify readers who frequently recommend sites of one type and not the other. Those people are my control group of obvious liberals and obvious conservatives.3. Weight future stories based on which kinds of outliers recommend them. There you have the liberal/conservative axis.

I would probably have to scan the comments and make sure that "recommend" = "agree with" for these people. Some people may link a site on the grounds that "You should read this because it pisses me off!"

It might be possible, but less fruitful, to identify some sites as "fluff" (i.e. People Magazine) and other sites as especially hard-hitting (The Economist?). That might be a way to see what mainstream sites tend to be serious, but my instinct is that this will be much less clear-cut.

You should probably do a spot check, but I suspect that people posting to bring attention to something objectional would also post things they agree with and so step 2 would minimize false identification.

Anther minor wrinkle that could probably be safely ignored is people pointing out when biased sources post an article against their bias causing people to point it out in a "even x source says..." sort of way.