Pages

Saturday, November 22, 2008

SVD for You and Me

During Personify’s heyday, we were featured in a Business 2.0 article. It doesn’t seem to exist on the Web anymore, but the main thing I remember was a sentence that talked about Personify’s “algorithm-based software”—a phrase as useless as describing a car engine as “moving-parts-based.”

Certainly no one fed the author that phrase, and I doubt he or she invented it. More likely, the author wrote something that actually described our software, which an editor took the liberty of simplifying—to the point of pointlessness—for Business 2.0’s audience. Such things happen. It generated some smirks around the office, and that was that.

I tell this story because this weekend’s New York Times Magazine has a welcome counterpoint: an article about the Netflix Prize that could easily have hand-waved the details, per “algorithm-based software,” but instead made the details approachable and interesting for an audience even more general than Business 2.0’s.

Ironically, the algorithmic star of the article is singular value decomposition (SVD), a core component of, you guessed it, Personify’s algorithm-based software. Author Clive Thompson and his editor deserve credit for explaining SVD in everyday language, sprinkling plenty of movie examples from the Netflix contest. I’ll let you read it in the article (link below), but understanding SVD matters to Thompson’s larger questions of how predictable human tastes are and whether humans have limits to comprehending why certain predictions work.

A final irony: The New York Times Company may be running SVD to analyze behavior related to, among other things, Thompson’s article. I say that because The Times was a Personify customer, and last I heard (as of mid-2008), they were still running it at terabyte scale, six years after we discontinued official support. Just goes to show, many weird connections exist out there.

4 comments:

That's a great article. When I first heard of the Netflix Prize I considered resurrecting some old SVD code and throwing it at the problem, but figured "hmm I don't know much about statistics, probably not a good idea."Funny that now everyone is using it. Yet again, Bruce's application of decades-old technology was ahead of its time =)I wonder if the Netflix Challenge could benefit from the funky "rotation" post-processing logic that we used at Personify. It made the final segmentation more human-understandable by orthogonally rotating each dimension, turning somewhat opaque results (eg: x=.5,y=.5 and x=.5,y=-.5) into ones that our brains like more (x=0,y=1 and x=1,y=0).Netflix doesn't expose their Cinematch results in the same way (the end product is recommendations, not reports), but having a readily understandable segmentation would help with a "more like this / less like this" style of recommendations.

Rotation is a standard trick, so some of the contestants are almost certainly doing it. However, unlike a typical Personify data set, which had a clean layer of metadata from Beacons, the Netflix data set is metadata-poor. It's arguable whether that's a problem for the algorithms, but it's definitely an obstacle to having the output be readily understandable.One could scrape some relevant metadata and join it to the Netflix data set. That presents its own set of issues, however.

I didn't realize rotation was a standard trick - all of the stats guys we had look at it (other than Bruce) were pretty weirded out by it. Maybe it was just the particular implementation we chose.In re: metadata. I'm not sure I agree. Beacons at Personify had two benefits:1. Transforming unreadable input (URLs) into human-understandable terms (blue sweater info pages)2. Reducing the cardinality ("width") of the SVD input set by grouping "functionally identical" data into a single Beacon.Netflix has movie titles, so (1) isn't particularly necessary and (2) doesn't apply.When SVD spits out its (possibly rotated) results, the resulting clusters are bunches of movie titles - just as good, if not better, than bunches of Beacons. The NYTimes article referred to clusters in exactly this way.

Just having a cluster of movie titles does not get you to this (as quoted from the article): "You can find things like 'People who like action movies, but only if there's a lot of explosions, and not if there’s a lot of blood. And maybe they don’t like profanity.'"The Beacon equivalent is to have movies tagged with metadata that reflect the amount of action, explosions, blood, profanity, and so on. With that, you can profile a cluster, per above, by its intensity on the various metadata elements.