March 2, 2010

Real-Time is Prime-Time for Scams

I had a brief panic earlier this week. For a few hours I thought the Internet
was on the verge of collapse. The sudden concern was brought about by trying to
find season 1 episode 1 of "Glee" online somewhere.

My first thought was to go to Hulu.com as this has become my new authority for
legal online television. It turns out that only the last 5 episodes of Glee are
available on Hulu. A fascinating conversation with a Fox executive taught me
about how that is the result of a legacy method of licensing the content
for television programs that has been shoehorned into the Internet TV age. It
used to be that shows followed a path from brand-new to syndication in a
well-ordered manner which doesn't match well with the expectations the
public have of finding all shows archived on the Internet somewhere.
That, however, is tangential to my panic.

Unsuccessful on hulu, I started doing just a general Google search and turned up
many many many pages purporting to be Glee Episode 1 Season 1, but were really
just gateway videos to ... you guessed it ... porn.

I immediately related this experience to the resulting aftermath of the Haiti
and now Chile earthquakes in which immediately following both disasters,
internet sites sprang up which fraudulently offered to take donations on behalf
of victims or redirect you to their issue / product of primary concern which was rarely related to the disaster.

A final example of this effect comes from Twitter. In the Twitterverse whenever
a meme is created, usually with a hashtag, it is not long after that the
griefers and scammers show up. They post their VIAGRA ad with the hottest
twitter meme hashtag and destroy the conversation for everyone else.

My panic peaked at this point? How has the internet survived so long in the face of this stuff? Has it just grown to the point where this is now economically feasible? Are we in a new era of the web which looks like the spam-era of email? As part of the
work that I've been doing on Information Retrieval I was able to consider how
powerful the signal from PageRank must be to overcome this: To have been overcoming this for so long.

PageRank, is a
technique in which links from one page to another confirm authority on the
destination page. The paths that people can take through the Internet by following links therefore reveal a
great deal about where the good content is and where the bad content is. The
links represent the efforts of human curation on the Internet. Every link that
you put on your web page helps PageRank sifts the garbage from the gold.

However, this doesn't work with real-time information because PageRank is pretty
slow. It takes time for people to add those links. It takes time to figure
out the shape of the Internet and it takes time to report the results back to
Internet searchers. Apparently it takes more than half a season of Glee,
because I can only find garbage today.

So PageRank works well for archived data, what can work for real-time data?
Maybe social networks can. If you can leverage social networks to immediately
vote on the content being created by the real-time web, then perhaps the social
network can replace PageRank for ephemeral data. All that remains is a way of
figuring out what people think is good or bad in the same way that looking at a
link tells you whether people think content is good or bad.

So what started out as a panic that the Internet was about to collapse, really gave me a new appreciation for PageRank. In some ways the link structure of the Internet is the social network that we have been leveraging all along. My panic also made me realize that we need a new signal for real-time ephemeral data - like news and tweets - to sift the good from the bad. My panic has subsided now that I know the shape of the problem a little better. I think the problem is large, but it would be cool to solve it.