Why we made this change

Visitors are allowed 3 free articles per month (without a subscription), and private browsing prevents us from counting how many stories you've read. We hope you understand, and consider subscribing for unlimited online access.

How iTunes Genius Really Works

Ever since the feature debuted in 2008, there’s been a lot of speculation about how iTunes Genius accomplishes its playlist-building magic. Now an engineer at Apple that works on the iTune Genius team has revealed some tantalizing clues–a rare disclosure for the infamously secretive company.

Recapitulating what Steve Jobs has said previously about iTunes Genius, Apple engineer Erik Goldman writes in his post on Quora that the starting point for the Genius service is a packet of usage data–what songs a user has in his or her library (and, presumably, how often
he or she plays them)–sent from the iTunes application which is “folded into a larger database of users and songs.”

Basically, your library of tracks is compared to all the other Genius users’ libraries of tracks. Apple then runs a set of previously secret algorithms, which Goldman described as straightforward recommendation algorithms similar to those used by other services like Netflix when it suggests movies for a user to watch now or add to his quene, to generate statistics for each song. “These statistics are computed globally at regular intervals and stored in a cache,” notes Goldman, because data on the similarity of any two songs changes slowly–it’s assumed the only reason it changes at all is because of the changing tastes of the listening public, and the introduction of new
tracks and artists.

Goldman jokes that if he told you how Genius works, he’d have to kill you (or at the least, have a squad of police officers raid your brain to retrieve Apple’s rightful property), but he continues to describe how the program works anyway.

To uncover part of how iTunes Genius works, says Goldman, “look at information retrieval algorithms, especially those that leverage the vector-space model.” But before you can compare factors, such as the frequency of a particular artist or genre in a user’s library or
playlists, across iTunes libraries via a Vector-Space model, you need a clever way to define the factor that gives more weight to the things that really matter.

A simple way to properly weight factors for comparison is what’s known as term frequency-inverse document frequency (tf-idf). It’s simply a way to compare how often a particular factor occurs in a single document (or song, or library) to how often that factor occurs in a larger body such as the sum of all iTunes libraries stored by the Genius servers. Thus, a factor that occurs pretty often in a given user’s library–for example, an affinity for an obscure indy band–will tend to be a more powerful determinant, unless it also happens to occur quite often in the total set of data–as would be the case if the factor was an affinity for the Beatles.

Once you’ve got your tf-idf weights sorted, you can represent them in a vector space model as vectors.

In this example (courtesy Wikipedia) two different documents (or songs) have all their various tf-idf weights represented as a single vector (e.g. d1) which can then be compared to a second document / vector (e.g. d2) and a query (q)–such as “which of these two songs is most like the one for which I’ve just clicked the ‘genius’ button.” Whichever one is closer in angle to your query vector is more similar.

Digging deeper into iTune Genius system, Goldman talks about its use of latent-factor algorithms. “Latent-factor algorithms, in particular, tend to work very well on huge data sets with an enormous number of dimensions and a lot of noise,” says Goldman.

Latent factors are what shakes out when you do a particular kind of statistical analysis, called a factor analysis, on a set of data, looking for the hidden, unseen variables that cause the variation in all the different variables you’re examining. Let’s say that the variability in a dozen different variables turns out to be caused by just four or five “hidden” variables–those are your latent factors. They cause many other variables to move in more or less lock-step.

Discovering the hidden or “latent” factors in your data set is a handy way to reduce the size of the problem that you have to compute, and it works because humans are predictable: people who like Emo music are sad, and sad people also like the soundtracks to movie versions of vampire novels that are about yearning, etc. You might think of it as the mathematical expression of a stereotype–only it works.

If you want to go really deep on this subject, Goldman suggests you read the papers that came out of the million-dollar Netflix Prize, which was won by a combination of teams led by engineers from AT&T. Their challenge was to improve Netflix’s recommendation engine, and one of their primary innovations was reducing the computational intensity of the algorithms used in recommendation engines.

Previously, the amount of computation required to do a pairwise comparison of any two items in Netflix’s (and presumably Apple’s) library scaled as a quadratic function of the number of comparisons to be performed. But the AT&T team figured out how to re-write a fundamental algorithm to make the problem scale only linearly with the amount of data involved. So, whatever Apple’s new data center is for, it’s probably not to calculate Genius results.

Tagged

Christopher Mims was a contributing editor at MIT Technology Review between 2011 and 2012. Before that, he blogged about the converging crises of the 21st century for Grist. He… More has also reported for Wired and Scientific American, and worked on various projects for the BBC, The Atlantic and Smithsonian.

You've read
of three
free articles this month.
Subscribe now for unlimited online access.
You've read
of three
free articles this month.
Subscribe now for unlimited online access.
This is your last free article this month.
Subscribe now for unlimited online access.
You've read all your free articles this month.
Subscribe now for unlimited online access.
You've read
of three
free articles this month.
Log in for more, or subscribe now for unlimited online access.
Log in for two more free articles, or subscribe now
for unlimited online access.