How good is Google’s Instant Mix?

This week, Google launched the beta of its music locker service where you can upload all your music to the cloud and listen to it from anywhere. According to Techcrunch, Google’s Paul Joyce revealed that the Music Beta killer feature is ‘Instant Mix,’ Google’s version of Genius playlists, where you can select a song that you like and the music manager will create a playlist based on songs that sound similar. I wondered how good this ‘killer feature’ of Music Beta really was and so I decided to try to evaluate how well Instant Mix works to create playlists.

The EvaluationGoogle’s Instant Mix, like many playlisting engines, creates a playlist of songs given a seed song. It tries to find songs that go well with the seed song. Unfortunately, there’s no solid objective measure to evaluate playlists. There’s no algorithm that we can use to say whether one playlist is better than another. A good playlist derived from a single seed will certainly have songs that sound similar to the seed, but there are many other aspects as well: the mix of the familiar and the new, surprise, emotional arc, song order, song transitions, and so on. If you are interested in the perils of playlist evaluation, check out this talk Dr. Ben Fields and I gave at ISMIR 2010: Finding a path through the jukebox. The Playlist tutorial. (Warning, it is a 300 slide deck). Adding to the difficulty in evaluating the Instant Mix is that since it generates playlists within an individual’s music collection, the universe of music that it can draw from is much smaller than a general playlisting engine such as we see with a system like Pandora. A playlist may appear to be poor because it is filled with songs that are poor matches to the seed, but in fact those songs actually may be the best matches within the individual’s music collection.

Evaluating playlists is hard. However, there is something that we can do that is fairly easy to give us an idea of how well a playlisting engine works compared to others. I call it the WTF test. It is really quite simple. You generate a playlist, and just count the number of head-scratchers in the list. If you look at a song in a playlist and say to yourself ‘How the heck did this song get in this playlist’ you bump the counter for the playlist. The higher the WTF count the worse the playlist. As a first order quality metric, I really like the WTF Test. It is easy to apply, and focuses on a critical aspect of playlist quality. If a playlist is filled with jarring transitions, leaving the listener with iPod whiplash as they are jerked through songs of vastly different styles, it is a bad playlist.

For this evaluation, I took my personal collection of music (about 7,800 tracks) and enrolled it into 3 systems; Google Music, iTunes and The Echo Nest. I then created a set of playlist using each system and counted the WTFs for each playlist. I picked seed songs based on my music taste (it is my collection of music so it seemed like a natural place to start).

The SystemsI compared three systems: iTunes Genius, Google Instant Mix, and The Echo Nest playlisting API. All of them are black box algorihms, but we do know a little bit about them:

iTunes Genius – this system seems to be a collaborative filtering algorithm driven from purchase data acquired via the iTunes music store. It may use play, skip and ratings to steer the playlisting engine. More details about the system can be found in: Smarter than Genius? Human Evaluation of Music Recommender Systems. This is a one button system – there are no user-accessible controls that affect the playlisting algorithm.

Google Instant Mix – there is no data published on how this system works. It appears to be a hybrid system that uses collaborative filtering data along with acoustic similarity data. Since Google Music does give attribution to Gracenote, there is a possibility that some of Gracenote’s data is used in generating playlists. This is a one button system. There are no user-accessible controls that affect the playlisting algorithm.

The Echo Nest playlist engine – this is a hybrid system that uses cultural, collaborative filtering data and acoustic data to build the playlist. The cultural data is gleaned from a deep crawl of the web. The playlisting engine takes into account artist popularity, familiarity, cultural similarity, and acoustic similarity along with a number of other attributes There are a number of controls that can be set to control the playlists: variety, adventurousness, style, mood, energy. For this evaluation, the playlist engine was configured to create playlists with relatively low variety with songs by mostly mainstream artists. The configuration of the engine was not changed once the test was started.

The CollectionFor this evaluation I’ve used my personal iTunes music collection of about 7,800 songs. I think it is a fairly typical music collection. It has music of a wide variety of styles. It contains music of my taste (70s progrock and other dad-core, indie and numetal), music from my kids (radio pop, musicals), some indie, jazz, and a whole bunch of Canadian music from my friend Steve. There’s also a bunch of podcasts as well. It has the usual set of metadata screwups that you see in real-life collections (3 different spellings of Björk for example). I’ve placed a listing of all the music in the collection at Paul’s Music Collection if you are interested in all of the details.

The CaveatsAlthough I’ve tried my best to be objective, I clearly have a vested interest in the outcome of this evaluation. I work for a company that has its own playlisting technology. I have friends that work for Google. I like Apple products. So feel free to be skeptical about my results. I will try to do a few things to make it clear that I did not fudge things. I’ll show screenshots of results from the 3 playlisting sources, as opposed to just listing songs. (I’m too lazy to try to fake screenshots). I’ll also give API command I used for the Echo Nest playlists so you can generate those results yourself. Still, I won’t blame the skeptics. I encourage anyone to try a similar A/B/C evaluation on their own collection so we can compare results.

The Trials For each trial, I picked a seed song, generated a 25 song playlist using each system, and counted the WTFs in each list. I show the results as screenshots from each system and I mark each WTF that I see with a red dot.

Trial #1 – Miles Davis – Kind of Blue

I don’t have a whole lot of Jazz in my collection, so I thought this would be a good test to see if a playlister could find the Jazz amidst all the other stuff.

First up is iTunes Genius

This looks like an excellent mix. All jazz artists. The most WTF results are the Blood, Sweat and Tears tracks – which is Jazz-Rock fusion, or the Norah Jones tracks which are more coffee house, but neither of these tracks rise above the WTF level. Well done iTunes! WTF score: 0

Next up is The Echo Nest.

As with iTunes, the Echo Nest playlist has no WTFs, all hardcore jazz. I’d be pretty happy with this playlist, especially considering the limited amount of Jazz in my collection. I think this playlist may even be a bit better than the iTunes playlist. It is a bit more hardcore Jazz. If you are listening to Miles Davis, Norah Jones may not be for you. Well done Echo Nest. WTF score: 0

If you want to generate a similar playlist via our api use this API command:

Next up is google:

I’ve marked the playlist with red dots on the songs that I consider to be WTF songs. There are 18(!) songs on this 25 song playlist that are not justifiable. There’s electronica, rock, folk, Victorian era brass band and Coldplay. Yes, that’s right, there’s Coldplay on a Miles Davis playlist. WTF score: 18

Trial #2 – Lady Gaga – Bad Romance

Now, lets move away from Jazz into mainstream pop. Again, I don’t have too much pop in my music collection. Mostly it is from my daughter, but we don’t mix our music collections too much any more.

First up is iTunes:

iTunes falls down a bit here. There are 2 WTFs on the playlist. Iron & Wine and Jack Johnson both seem to be particularly bad fits. There are a few others that seem questionable. There’s a Coldplay vibe to the whole list, with U2, Muse, Mute Math on the list. I suspect this strange connection is due to the Twilight soundtracks that may appeal to the Lady Gaga demographic. Since iTunes relates artists based on sales, those that bought Lady Gaga and the Twilight albums would establish a connection between these two somewhat disparate types of music. But this is just a guess. WTF Score: 2

Next up: The Echo Nest

This looks like a good mix of pop music, with some theatrics, some diva, and mostly mainstream radio (I was really surprised to see all this pop music in my collection). I’m not so sure about the Vampire Weekend track, but since I gave VW an pass on the iTunes list, I’ll give it a pass here too. WTF Score: 0

Next up, Google Instant Mix

Google’s Instant Mix for Lady Gaga’s Bad Romance seems filled with non sequitur. Tracks by Dave Brubeck (cool jazz), Maynard Ferguson (big band jazz), are mixed in with tracks by Ice Cube and They Might be Giants. The most appropriate track in the playlist is a 20 year old track by Madonna. I think I was pretty lenient in counting WTFs on this one. Even then, it scores pretty poorly. WTF Score: 13

Trial #3 – The Nice – Rondo

Next up is some good ol’ progressive rock. The Nice was an early progressive rock band fronted by Keith Emerson (of Emerson Lake and Palmer fame). It is hardcore late 60s style progressive rock – keyboard heavy, frequent tempo and time signature changes, high speed, bull whips, damn the vocals stuff. This particular song is a cover of Brubeck’s Blue Rondo a la Turk. It is one of my favorite songs of all time. Really you should have a listen. I’ll wait. I have lots of music like this in my collection. It should be pretty easy to generate playlists that keep me happy with this seed.

First up: iTunes:

That’s a pretty awesome playlist. I’d listen to it. The closest we get to a clunker is a Beach Boys track. I give it a pass since it is from the right era, and the Beach Boys were experimental in their own way. WTF Score: 0

Next up is The Nest:

Another fine playlist. I actually like this one better than the iTunes list since it bubbles up some Rick Wakeman, making the playlist much more keyboard heavy (which is what I like). The supertramp track is a stretch, but not in the WTF territory. WTF Score: 0

Next up is Google Instant Mix:

I would not like to listen to this playlist. It has a number songs that are just too far out. ABBA, Simon & Garfunkel, are WTF enough, but this playlist takes WTF three steps further. First offense, including a song with the same title more than once. This playlist has two versions of ‘Side A-Popcorn’. That’s a no-no in playlisting (except for cover playlists). Next offense is the song ‘I think I love you’ by the Partridge family. This track was not in my collection. It was one of the free tracks that Google gave me when I signed up. 70s bubblegum pop doesn’t belong on this list. However,as bad as The Partridge family song is, it is not the worst track on the playlist. That award goes to FM 2.0: The future of Internet Radio’. Yep, Instant Mix decided that we should conclude a prog rock playlist with an hour long panel about the future of online music. That’s a big WTF. I can’t imagine what algorithm would have led to that choice. Google really deserves extra WTF points for these gaffes, but I’ll be kind. WTF Score: 11

First up, iTunes.

Next up, The Echo Nest

Another solid playlist, No WTFs. It is a bit more vocal heavy than the iTunes playlist. I think I prefer the iTunes version a bit more because of that. Still, nothing to complain about here: WTF Score: 0

Next Up Google

After listening to this playlist, I am starting to wonder if Google is just messing with us. They could do so much better by selecting songs at random within a top level genre than what they are doing now. This playlist only has 6 songs that can be considered OK, the rest are totally WTF. WTF Score: 18

Trial #5 The Beatles – Polythene Pam

For the last trial I chose the song Polythene Pam by The Beatles. It is at the core of the amazing bit on side two of Abbey Road. The zenith of the Beatles music are (IMHO) the opening chords to this song. Lets see how everyone does:

First up: iTunes

iTunes gets a bit WTF here. They can’t offer any recommendations based upon this song. This is totally puzzling to me since The Beatles have been available in the iTunes store for quite a while now. I tried to generate playlists seeded with many different Beatles songs and was not able to generate one playlist. Totally WTF. I think that not being able to generate a playlist for any Beatles song as seed should be worth at least 10 WTF points. WTF Score: 10

Next Up: The Echo Nest

No worries with The Echo Nest playlist. Probably not the most creative playlist, but quite serviceable. WTF Score: 0

Next up Google

Instant Mix scores better on this playlist than it has on the other four. That’s not because I think they did a better job on this playlist, it is just that since the Beatles cover such a wide range of music styles, it is not hard to make a justification for just about any song. Still, I do like the variety in this playlist. There are just two WTFs on this playlist. WTF Score: 2.

Conclusions

I learned quite a bit during this evaluation. First of all, Apple Genius is actually quite good. The last time I took a close look at iTunes Genius was 3 years ago. It was generating pretty poor recommendations. Today, however, Genius is generating reliable recommendations for just about any track I could throw at it, with the notable exception of Beatles tracks.

I was also quite pleased to see how well The Echo Nest playlister performed. Our playlist engine is designed to work with extremely large collections (10million tracks) or with personal sized collections. It has lots of options to allow you to control all sorts of aspects of the playlisting. I was glad to see that even when operating in a very constrained situation of a single seed song, with no user feedback it performed well. I am certainly not an unbiased observer, so I hope that anyone who cares enough about this stuff will try to create their own playlists with The Echo Nest API and make their own judgements. The API docs are here: The Echo Nest Playlist API.

However, the biggest surprise of all in this evaluation is how poorly Google’s Instant Mix performed. Nearly half of all songs in Instant Mix playlists were head scratchers – songs that just didn’t belong in the playlist. These playlists were not usable. It is a bit of a puzzle as to why the playlists are so bad considering all of the smart people at Google. Google does say that this release is a Beta, so we can give them a little leeway here. And I certainly wouldn’t count Google out here. They are data kings, and once the data starts rolling from millions of users, you can bet that their playlists will improve over time, just like Apple’s did. Still, when Paul Joyce said that the Music Beta killer feature is ‘Instant Mix’, I wonder if perhaps what he meant to say was “the feature that kills Google Music is ‘Instant Mix’.”

Share this:

Like this:

Related

This entry was posted on May 14, 2011, 11:26 am and is filed under Music, playlist, The Echo Nest. You can follow any responses to this entry through RSS 2.0.
Both comments and pings are currently closed.

Ah, excellent! Thank you for doing this evaluation. Exactly what I wanted to see. Yeah, when I saw the whole Google Music announcement, what struck me was the same comment: Google’s Paul Joyce revealed that the Music Beta killer feature is ‘Instant Mix,’

This was great to hear, because the rest of the music service seemed to be about streaming from the cloud. Which doesn’t really make sense to me. My ipod/touch/iphone is just as portable as any cloud device, works even when I have no signal on the airplane. And while it may not let me play music from any computer, pretty much every computer that I have has speakers with a minijack plug, which I can plug into my ipod in about 10 seconds. Thus, I again can have my music while in front of any device. With that same minijack and a $7 radio shack cord, I can connect into the aux input on my car stereo, too.

And I already have backup, through syncing with my computer. Which itself I can backup even further with an external hard drive, and can also back up to any number of cloud storage providers, should I choose. So I’m not saying the cloud doesn’t make sense. I’m saying cloud streaming doesn’t make sense.

Because I already have extreme portability, flexibility, and reliability.

So Google is an information retrieval, “organize the world’s information” company, right? It would seem to me that their forte would be on the MIR aspects of things, the smart playlisting. And they even claim as much, as you noted with Paul Joyce. So to see almost half of every playlist filled with WTFs really boggles the mind. Like you said:

However, the biggest surprise of all in this evaluation is how poorly Google’s Instant Mix performed. Nearly half of all songs in Instant Mix playlists were head scratchers – songs that just didn’t belong in the playlist. These playlists were not usable. It is a bit of a puzzle as to why the playlists are so bad considering all of the smart people at Google. Google does say that this release is a Beta, so we can give them a little leeway here.

I don’t know. I don’t feel as inclined to give them this leeway. And that is because I have been on them for literally ten years to take this music IR thing seriously, both through talking with senior leadership, to inviting them to sponsor ISMIR (which they finally did for the first time last year, after half a decade). I don’t mean to disparage the ISMIRites that we know at Google, because I think they’re doing the best they can. But are they getting the corporate support they need? The resources? The philosophical mindshare? How much engineering effort, I would ask, is being poured into streaming music from the cloud (a service that arguably most people don’t really need, for the reasons I’ve listed above), versus for doing MIR in support of smart playlisting? I’ll bet at least 80% of the effort has been on the cloud streaming, rather than on the playlisting. And that’s why I don’t want to give them leeway, because that doesn’t make sense to have done that, if Google really is an Information (“organize the world’s information”) Retrieval company. If they’re a cloud technology company, and are more interested in showing their cloud prowess, then sure. That’s the thing to build. But look at their corporate philosophy, their very first, Rule #1:

“Focus on the user and all else will follow. Since the beginning, we’ve focused on providing the best user experience possible. Whether we’re designing a new Internet browser or a new tweak to the look of the homepage [insert: or building a new music information retrieval service], we take great care to ensure that they will ultimately serve you, rather than our own internal goal or bottom line.”

Streaming from the cloud is not focusing on the user. Given that the user already has an Ipod, Zune, or Rio PMP300 (http://en.wikipedia.org/wiki/Rio_PMP300), which are already capable of being used on the subway, in cars, on airplanes, or in front of any computer with minijack speakers (translation: pretty much every one), what the user needs is something that he or she couldn’t do already, aka content-aware, smart playlisting. And that is deliverable without cloud storage and streaming. That you can do by scrobbling and local machine song-IDing, combined with cloud processing (translation: Google buying their own copies of all this music, and extracting features from those.) Nothing needs to be uploaded from the user other than the scrobble. Nothing needs to be downloaded from Google other than the recommended songID.

And yet Google probably spent 80% of their engineering effort on the streaming. Ten years after I started telling them to look at the MIR side of things. No, I can’t grant leeway, if that’s the approach they want to take. Hrmph.

i found this really interesting, and agree agree the WTF test is the most straightforward high-throughput human-centric test, but unless you’re actually listening to the playlists, i think you’re missing something. actually hearing the transitions reveals a whole new set of jarring transitions that are really important to the listening experience, but most interesting to me, sometimes adds a FTW: a counterintuitive but perfect fit. mostly do a ratings-weighted random shuffle with foobar2k, sometimes adding a genre sort on top, but i’m always most pleased when some sort of crazy transition — jazz to prog to classical to hip-hop to pop — comes together and makes perfect sense. this is really the best that computer playlists have to offer, when they allow you to discover new connections betweens songs and new ways of listening to your collection.

Great analysis Paul (as always).
You were a bit kinder to Google than I was (“Polythene Pam” to The Police’s “Walking on the Moon” is a bit of a jump for me, but to each his own), but it is sure nice to see these dudes laid out side by side.

Nice experiments! Unfortunately Google music beta is not yet available in Germany. I saw on the Music Beta YouTube video that one could also organize music by mood. I suspect they are (will be) using Gracenote’s mood technology. I got a confirmation from Gracenote that they are not using the mood technology in the current version.

I’m still using the MusicIP player, even though it’s been basically abandonware for the last 2 years. The music analyzing servers still function somehow and I’m either used to the mixes it makes, or they’re just good mixes.. :)

Great article but your WTF test logic is flawed. You need to listen to the head-scratchers first before deciding that it is a flawed playlist. The whole point of using a recommendation engine is to expose music that you would not otherwise have picked or chosen.

I’m not sure that fully addresses what MK is saying. I suppose it depends on what you believe a suggestion engine is supposed to do.

If you believe it should find music that is similar to a given song, then the WTF test makes perfect sense. But if you believe (as I do) that it should recommend songs that I might like whether or not they are similar, then the WTF test is meaningless.

Michael – The goal of a playlisting engine is not as MK says to “expose music that you would not otherwise have picked or chosen.” The goal of a playlister is to provide a good listening experience. Don’t confuse recommendation with playlisting. They are two different things.

Paul (#9), MK – I don’t think it’s necessarily fair to say the goal of a playlister is or isn’t recommendation, or is only to provide a “good listening experience”.

Really, it boils down to context: is the playlist intended for discovery, or for casual/background listening? In the case of automatic playlisters, this distinction is almost never clear, and services like Pandora tend to blur the line quite a bit. Last.fm does a good job of separating the two with the “Your Library Radio” and “Your Mix Radio”, but this seems pretty uncommon.

My $.02 is that a listening session for “discovery” will involve a high degree of listener attention, meaning that a surprise or “wtf” moment isn’t necessarily a bad thing because the listener is already focused on the music. Of course, if the wtf moment is just due to random crappiness in the algorithm, that’s a problem, but it shouldn’t be considered automatically bad.

Playlists for casual listening, on the other hand, should try to minimize user attention/frustration/abandonment, so that surprise/wtf moments are certainly bad, because the listener is probably disengaged and doesn’t want to be distracted by jarring transitions.

The tl;dr version: it’s not fair to evaluate a playlist completely absent of context, and here it’s not exactly clear what Google’s intended context is.

Brian M. – I’m not saying that a playlist can’t help you find music. I’m saying that what you do to create a playlist is different than what you would do to create a music recommendation. I still say that the listening experience is the central aspect of a playlister. For playlists that are generated within one’s own music collection, I don’t think the intention should be music discovery. I think we can assume that the user is familiar with much of the music in the collection.

I’ve a suggestion to extend the scoring of playlists beyond the WTF-metric. I’m into playlisting myself and one thing that strikes me, is the presence of re-occurring artists in the playlists. For example: the playlists for Lady Gaga’s Bad Romance. iTunes Genius generates a playlist with 21 distinct artists. The echonest mechanism generates a playlist with several songs from Goldfrapp, Madonna, Vanessa Carlton, Beyoncé. The echonest playlist contains 12 different artist for less than 2 hours of music. (I’ve not counted them for the Google playlist, because there where only 11 non-WTF songs.) There are similar findings for the other playlists.

Sure, the number of WTF-songs should be minimized. But hearing the same artists all over again is also annoying/boring.

I’m not sure this suggestion to include an artist-spreading metric in playlist scoring, rather than an invitation to boost the echonest artist-spreading algorithm :)

mathijs.biesmans – There’s an interesting tension between familiarity, variety, listener adventurousness and serendipity. Different listeners have different ideas of what the ideal mix of the new and the familiar is. With the Echo Nest Playlist API you can adjust the number of unique artists appearing in the playlist with the ‘variety’ parameter. For this experiment I chose a ‘low variety’ setting since most listeners fall into the ‘less adventurous’ listening category.

The WTF metric captures the coherence of a playlist. This is an interesting problem, but not the most important. The challenge is to get interesting playlists that score high on the WTF list.

I recently went through a playlist generated by a friend who has both considerable knowledge of music and arguably some knowledge about my taste. The WTF score would have been up in the 80% range. Yet the vast majority of these unknown tracks were ear-openers to me.

—

After years of using Pandora and other “discovery” services, my music diet had become stale. The leap from Feist to Birdy Nam Nam is just too large for such services to work. Yet I happen to believe that both are excellent.

Paul, this is really great work. As someone very interested in this sort of thing, but living beyond the borders of the US, I’ve been itching to see what this service was capable of since it was announced. It appears the answer is not much. Which is too bad really.

And regarding the comments on the WTF measure, it seems to me to be a very good measure for broad judgement, if for no other reason then the fact that it’s much easier recommend something that’s ‘not bad’ versus ‘good’ or ‘great’. (At this point it basically goes with out saying that I treat playlisting as a special case recommender[warning, that link is my thesis], mostly). This is not to say that it wouldn’t be interesting to examine artist/popularity/familiarity diversity plus neighbouring song smoothness or other more complicated measures, but this is clearly the place to start and given google’s performance, at this point they almost certainly wouldn’t do well in other, more complicated measures. It’ll be interesting to see if the service improves considerably in the next few months, as they start to get user feedback…

Also, I have to admit to having basically written off Genius years ago. In light of these playlists, I think I may have to give it another go…

(Adding: I just heard a bit of an unreliable rumour that this service is based on some content-based sim, in isolation. Might explain the terrible, though it makes me wonder what sort of feature set was involved as some of those lists seem a stretch even for a culture blind content-based process)

Yeah sorry, my aside was poorly executed sarcasm. Though that said, using very little metadata should filter some of the more egregious problems in Paul’s examples, which makes me wonder about just how the combination is being done…

A lot of these comments above seem to re-highlight the need for recommendation transparency. I’m not talking about Pandorian transparency, where it tells you that a song was recommended because of female vocals or heavy syncopation. Nor even an Amazonian transparency in the form of users who listened to this also listened to that, therefore you get to listen to that, too.

No, I’m talking about overall optimization criterion transparency. Is the playlist being put together to satisfy mass expectation? Is it being put together to maximize surprise (“wreckommender?”) And how much drift is involved? Does the whole playlist try to stay true to the original seed i.e. is it estimating prob(song_x | seed) for every song in the list? Or does it contextually update itself, as it goes? I.e. prob(song_2 | seed), then prob(song_3 | seed, song2), then prob (song_4 | seed, song_2, song3), etc?

Cool. I really like this comparison and the WTF counter. Of course, the other commenters are right regarding some of their criticisms. However, as Paul pointed out several times: playlisting is always be judged very subjective and that’s the key, where especially the Echo Nest playlist engine can win. On the one side, one can offer a preconfigured one button UI. On the other side, one can offer a semi-expert to professional UI where a user is able to set different parameters as preferred. This depends always from the knowledge and the preferences of the single user.

Finally, please also remember that all these playlists are generated on Paul’s own music collection. So, if he hasn’t not much music of a certain genre it cannot recommend much “perfectly” fitting tracks for a certain seed music song of this genre.

I disagree Google Music Instant Mix are superior than iTunes. I’ve been using iTunes Genius since it was released and I’ve been using Google Music for a couple of months and I rarely use iTunes Genius anymore. Why Google Music is superior,
1. It surprises me. For some reason iTunes always plays “Keep the Car Running” from Arcade Fire in every mix. No surprises it is almost the same songs each time (my collection is over 8000 songs).
2. iTunes cannot create a song from the edge of my collection. Artists that are very different from what I usually listen. Google Music will slowly evolve the list and many times it makes great choices (Adele->Coldplay->Radiohead).
3. iTunes is unable to create mixes from new music. I got the new The Strokes album and I could not create a mix for weeks (even though I have all The Strokes album and a lot of similar music).

arturo – thanks for the input. I see from your IP address that you may not be 100% impartial on the matter. If you are praising a system that is built by the company you work for, it is good form to mention that.

I guess it was implicit where I worked for, although I don’t work on Android. As a heavy user of iTunes Genius, I was bored with it and I prefer Google Music.

Music services should add some randomness to the mix and still work with just released music. iTunes fails on these areas. It is ridiculous that they still suck at it because they have all this data about their users.

I’m confused — google’s “instant mixes” seem pretty much absolutely random to me in almost all cases except very popular artists. And you think this is a good result? You think for the average use case of someone creating an instant mix of a Kraftwerk song they should see a Jack Johnson song? If so, why not just a “generate 25 random songs” button? Why have any intelligence at all?

As a fan of Kraftwerk, I would throw your product in the trash after i heard Jack Johnson in my “instant mix.” Do you really think customers of Music beta by Google want pure randomness when they click “Instant Mix?”

I honestly can’t imagine something farther apart to Kraftwerk than Jack Johnson. OK, maybe Randy Newman. Wait– no, he has a lot of repetition and also sings about the futility of humanism. Hmm… Erik Satie? No, I’m sure they ganked some of his themes in their later records… You know what, I am pretty sure after 10 minutes of thought that Jack Johnson is the worst possible person to put in a Kraftwerk mix. I challenge anyone to think of something worse.

Though if you really want combined novelty and similarity, I still maintain (as I have been maintaining for about a decade now) that cover songs / similar melodic themes / samples are a good way to blend familiar and novel. Try Senor Coconut’s cover of Autobahn, something that is entirely findable by modern, content-based Music IR techniques:

And in that manner you have a genre-spanning, novel playlist that is still tied back to the original seed song through a combination of audio analysis features and artist discography features. Something that is completely possible using today’s MIR techniques. Techniques that I maintain Google should have been developing for years, were it really a search company. (Are any of these songs in your collection, Paul? *Could* they have been added to the playlist?) There is no excuse for Jack Johnson.

I guess if Kraftwerk and Jack Johnson are in your collection is because you like them both. The engine is only playing your music :) The more eclectic and the smaller your collection the harder it is to predict the music.

Google Music might have WTF moments, but this is better than not finding any songs or playing the same songs over and over again. Most of my Music is full albums and I own multiple albums per artist. Google Music only had problems with a couple of songs where I only had one song for the given artist.

iTunes has big problems that I already outlined. The WTF test is very limited a not enough for people to make a decision. iTunes decision engine is very conservative and boring.

Jeremy, there have been many things you have said/written that I have found informative and interesting/illuminating over the years, but this cover of Autobahn is, quite possibly THE BEST THING EVER. (back to playlists…)

My gut feel in this case is that 99% of the “Auto-generated Playlists (from my own collection)” are going to be built because I am working/going for a walk/driving/cooking breakfast and I don’t want to be bothered queuing up every track like a college DJ. I just want a flowing mix of stuff I already know I like to create a non-jarring listening experience.

To be honest the Kraftwerk/Afrika Bambaataa/Coldplay/Senor Coconut playlist would have me diving for the FFWD button. In most of my listening scenarios (from my own collection) I am not looking to broaden my horizons with new adventure, I am trying to make pancakes for my kid or looking for an walk across town without having to dig into my parka to fast-forward my MP3 player with Thinsulate gloves on.

Serendipity plays much less of a role in my auto-playlists than it does in my “Recommendation” explorations (where I’m trying to find something new from outside my collection), so please don’t be too clever with your “The rhythm section of the b-side of this cut was sampled in this other hip-hop groove from 1:45-1:47…LISTEN FOR IT!”…I’m just looking for a good (sensible/easy) listening experience. Please don’t take me to a master class in minor chord sonority.

I’m just trying to make goddam pancakes and can’t fast-forward with batter on my fingers.

To be honest the Kraftwerk/Afrika Bambaataa/Coldplay/Senor Coconut playlist would have me diving for the FFWD button. In most of my listening scenarios (from my own collection) I am not looking to broaden my horizons with new adventure, I am trying to make pancakes for my kid or looking for an walk across town without having to dig into my parka to fast-forward my MP3 player with Thinsulate gloves on.

Right, Zac, so I was talking to Arturo and his information needs. He wanted a degree of “Android” (as opposed to iPod?) whiplash in his listening. I was trying to explain to him how one might have that whiplash that he craves, without simply doing Math.rand(). He could have the novelty/heterogeneity of Kraftwerk, Senor Coconut, Coldplay, and Afrika Bambaataa, while at the same time the familiarity of various hooks and lyrics. One could do the same thing with rhythms.. walk through a variety of musical genres, while keeping the rhythmic style similar. Hold one dimension constant, while varying others. Remember, this is still all (supposedly) on his own music, and so he at least knows all the songs.

You personally might not want this, and that’s fine; I wasn’t speaking to your information need; I was speaking to his. And from what I understand of Echo Nest, it gives you that control that you want, to hold as many dimensions constant (or low variance) as possible, so as to give you your wafflemaking playlist. Already right there, that’s more than you can get from Google Music or from Apple Genius. So an Echo Nest playlist with Paul’s default (low variance) parameters would be perfect for you. Fine.

I’m just trying to explain that it doesn’t have to be all or nothing — completely boring or completely random. You can do an intelligent middle ground.

Why the heck didn’t you use a blind in your experiments? All of these results are invalid.

Can someone please repeat this with good experimental design? Get a bunch of people who have similar knowledge in music to the author and get experimenters to give them the lists. Get them to mark the WTF tracks and them get the original track generators (not the experiment conductors) to map them back to the services.

I mean I guess this is more of a review than an experiment or study, but there is room here for a neat experiment.

All the Beatles albums were released at the same time. I imagine, that when it’s compiling it’s playlists from purchase lists, there would be a lot of multiple Beatles purchases at the same time. And there might be a limit as to how many song by the same artist for variety sake. And think of the scale of people buying multiple Beatles albums. That’s how I imagine it crashed. Give it time for The Beatles to work into their system.

The thing is, Google isn’t using musicians or recommendations or anything like that; it’s playing the music users upload through its machine learning system to learn what music sounds like and pick music it thinks sounds similar. It will have started with a training set of chosen music to generate initial rules and will now go through the uploaded user music to apply what it’s learned (the initial mixes), listen to more music to generate new rules (BOC sounds like Cyndi Lauper) and perhaps use user behaviour (every time we put Sinatra after BOC users skip the Sinatra) to refine this. It’s utterly different from a curated system.

It would have been interesting to see a “baseline WTF” based on some sort of shuffle within the given genre. And for those commenters wanting more of the unexpected… that’s what shuffle is for… agree that a playlist generated is trying to add coherence and consistency. (And a good recommendation/discovery service like Pandora needs to go outside of your library.)

Honestly, I wish google would spend a little more (no a lot more) time improving their core search. They really seem to be the new Microsoft jumping into everything while their core product devolves into Vista. Search and the web as google likes it is slowly decaying…

My experience of iTunes Genius is that it’s useful, but seems to be just grabbing a few similar artists based on the selected artist (not the track!) and then grabbing popular songs by them, a simple process with good results helped immensely by the fact that it’s choosing them from your own collection where you presumably like most tracks. Not that getting a workable result with a simple algorithm isn’t cool (quite the opposite) but it does have its weak points as shown by your first list, which is clearly terrible compared with your own APIs results and it seems likely, in my experience, that you’d get roughly that same playlist for any song by any of the highly popular artists on it, which is an even worse result. Also, despite having a collection that’s about as eclectic as your own, I feel that it sticks to the safer choices leaving large sections simply untouched over repeated playlists.

As for Google’s rather bizarre showing, I don’t personally like playlists based on “how songs sound”, perhaps because my tastes in music are varied and I like that variety to be expressed even within a single playlist, but I wouldn’t rule out the fact that it works better on smaller, more homogeneous collections and that Google has the data to show that people actually have that kind of collection. They’ve never seemed afraid of weird corner cases if they can easily and cheaply get 80% of the result in most cases by throwing computers at the data. Again they have the advantage of limiting the amount of space you have which means most people will have small, focused collections which means even truly random playlists are a plausible option for pleasing the user. If they can nudge the randomness down a notch or two then that might suit me better than Apple trying to nudge their algorithm towards more randomness. On the other hand your own API seems to be doing a great job so I might just go and read up on that.

Paul, at the risk of commenting too many times on your post, can I just ask Dave one thing? Dave, if you really feel that way about playlists (that you like variety, not based on how songs sound), and if you think that 80% of people have homogeneous collections and that Google knows this, and you think that the combination of global personal collection homogeneity plus local variety gives the best user experience: Then why does Google Music exist? This is a real question, and I asked it in comment #1 above. What’s the benefit of having Google Music? My iPod is already just as portable, if not MORE portable, than any wireless Android device. It’ll take a $7 cord to plug my Android device into my car’s aux speaker, just like it’ll take a $7 cord to plug my iPod into my car’s aux speaker. Best of all, the iPod has this “killer feature” playlist generation device called… wait for it… shuffle! You can shuffle by song, you can shuffle by artist. And you don’t have to have an expensive data plan to connect you to the cloud in order for it to work. You can even use it on airplanes, and underground. It’s amazing! It gives you the variety that you desire, not based on how songs sound, within the homogeneity of your own collection.

So (1) If there is no need for cloud-streamed music, and (2) no need for smart playlisting, because homogeneous collection (which 80% of users have) shuffle solves that problem, then why did Google even create anything having to do with music in the first place?

Jeremy – all good points, but I don’t think everyone can carry all their music on their device. For most people, their phone is now their music player. It is filled with apps, video, audiobooks, photos, movies, and more apps. People can’t carry 10K songs in their pocket anymore. They are lucky if they can carry 1K songs. This means they have to pick and chose which songs to bring and which to leave home. Keeping one’s music in the cloud does solve a real problem. It let’s people access all their music at almost anytime. It also can deal with the device shifting issue. I want to talk on my phone, but my family wants to listen to that Katy Perry album. If my apple TV or my Tivo can play my music from the locker, I don’t have to deprive my family of Katy Perry while I order pizza. I do think that in the long term, music subscription services will win, but music lockers like Google’s will serve a useful function for many music fans for a long time.

My apple TV already plays music from my locker.. my computer locker. Not my cloud locker. When I’m at home, I already have a hard drive that’s large enough to hold anything, and my apple TV already streams it. Again, with no need to upload it all the way to the cloud, before being able to stream it to the apple TV in any room. And when I’m on the road, in a hotel room, it’s not like I have my apple TV with me anyway.

But ok, maybe you have a point for people who use their phone as their mp3 player. I don’t, but maybe I’m the minority nowadays. Still, doesn’t storage double every year or two? 64GB, 128GB. Local storage will always outpace network bandwidth, especially wireless network bandwidth. How much money do you want to be spending on your wireless dataplan, anyway? That’s a real cost.

And in the meantime? Seems like the more interesting solution would not be to create a whole energy-sucking, environment-impacting data center to hold trillions of songs. No, the more interesting, more green, solution would be to come up with device sync software that did predictive syncing. Say you can only fit 50% of your collection onto your device, (at least for the next 9 months, until the new device comes out). Which 50% of that should that be? My solution? Use Information Retrieval and predictive analytics to basically create a “50% of collection” pseudo playlist. Based on the last 250 songs that the user has listened to, determine which 50% of the collection should be present on the device. Maybe even do some interesting time-of-year analysis, so that the Christmas songs are more likely to be synched to the device starting around early November, the Credence Clearwater Revival songs are more likely to be synched to the device starting in summer, and (if I’m, say, George Tzan.) the whale song recordings are never synched, at least as music, simply because they also happen to be mp3s, etc.

I mean, as music Information Retrieval (rather than just music informatics) people, we know that music listening follows a Zipfian distribution. There are some songs that get listened to a lot, a long tail of songs that only have 2 plays, an even longer tail of songs that only have 1 play, and perhaps even a longer tail of songs that you ripped (or Napster-ed), but haven’t even listened to once. The statistics of that distribution, combined with recent plays, combined with time-of-year analysis lets us do “smart” synching that keeps the device full with all the songs that you really would have listened to anyway. And lets you not worry about the 50% that isn’t there (at least until the next sync later tonight when you get home from work), because there is a very high probability that you wouldn’t have listened to them anyway (Christmas songs in July, all those songs with 0 plays, etc.).

This sort of information organization and retrieval is *supposed* to be what Goog is all about. And is more interesting, anyway, as it is not unrelated, broadly, to playlisting. What you learn from smart synching you can apply to smart playlisting. And is greener than building a whole new datacenter, too. Much greener.

Unless it’s been said already and I missed it in the comments (not yet enough caffeine…) I can’t help but believe that Google’s Instant Mix will improve significantly over time, in a similar way as did google’s voice recognition, spam blocking, ad placement, etc.

My guess is that Google’s algorithms are tracking and tweaking along the way and eventually, the playlist mixes might get scary good.

How meaningful do you consider your conclusions to be, given that you’re largely comparing a 3+ year old product with one just released in beta? I mean, I get your point, but I just don’t see how it will remain relevant for more than a handful of days, which makes me wonder why you bothered.

First impressions are important for music. Note that although Echo Nest has been around since 2005, we did not release our playlisting APIs until 2010. And they were just as good on launch as they are now. We would never release something this terrible and let it “improve over time.” You can’t treat music like you do any other data platform, it’s very personal and more important to people than you’d think.

The fact that Google would release this — even as a “beta” — is very confusing. I know a lot of the guys on that team, they are all guys I’d kill to work with and I know they love music as much as we do. But the stuff I see in my instant mixes is embarrassing and I can’t imagine how it got released. A team as tiny as ours puts a ton of effort into QA and we take every WTF seriously before releasing anything.

This is not some competitive thing, we don’t tangle with google day on any front. I’m more pissed off as a fan of music and music intelligence approaches to see such a bad first effort from these guys.

The fact that Google would release this — even as a “beta” — is very confusing. I know a lot of the guys on that team, they are all guys I’d kill to work with and I know they love music as much as we do. But the stuff I see in my instant mixes is embarrassing and I can’t imagine how it got released. A team as tiny as ours puts a ton of effort into QA and we take every WTF seriously before releasing anything.

That’s why I was wondering, in comment #1 above: “I don’t mean to disparage the ISMIRites that we know at Google, because I think they’re doing the best they can. But are they getting the corporate support they need? The resources? The philosophical mindshare? How much engineering effort, I would ask, is being poured into streaming music from the cloud (a service that arguably most people don’t really need, for the reasons I’ve listed above), versus for doing MIR in support of smart playlisting?”

That’s at least half of what has me upset about the whole cloud streaming aspect of Google Music. In addition to the fact that I see 9 reasons why a standalone iPod serves the user better for every 1 reason that cloud streaming serves the user better, I get the feeling that the cloud streaming, not the music intelligence (retrieval and organization) was the priority. Almost like they decided to build the cloud streaming, first, and then later came along and said, ok, how do we make this a little bit sexy. Like the smart playlisting was an afterthought, rather than the core priority. Not an afterthought for those ISMIRite Googlers, but for the people responsible for giving them resources and directing the project at a higher level. Cloud streaming and smart playlisting as two separate concepts, both from an MIR as well as from an engineering standpoint. Were music retrieval and organization really a priority, Google could have been doing smart playlisting for years, without having to have a cloud streaming service. It’s a mistake, in my opinion, to conflate the two.

As you say: “This is not some competitive thing, we don’t tangle with google day on any front. I’m more pissed off as a fan of music and music intelligence approaches to see such a bad first effort from these guys.” I feel like the reason there was such a bad first effort is that most of the engineering got put onto cloud streaming. Not onto the music intelligence.

This is interesting work. The one other thing I’d be curious to see (hear?) is how well they do with other languages. I’ve seen other systems work great with American English (or in many of your cases, British English!) and then just give up when you throw a little Japanese at them.

an update– something has changed recently, in the past 12 hours. I tried to make a mix today and things are very different. What’s happening now is popular artists (Led Zeppelin, Beatles) are getting very very narrowly focused playlists. For example, an Instant Mix for “Stairway to Heaven” is almost all Led Zeppelin and Rolling Stones and Aerosmith. And any artist not very popular (I tried a bunch — FS Blumm, Supersilent, Siriusmo, my own music) – all return the “not enough songs to make a playlist” error.

I am guessing Google has turned off acoustic matching and is now only showing results they have CF or metadata (genre) matches for. Can’t confirm that. But if true, it’s clear they have very little metadata / cultural data, which is very weird for these guys. It’s definitely triggered on artist metadata now. They “know” about Keith Fullerton Whitman and give more or less the same results for his more popular earlier stuff as some unreleased tracks I have in my collection that they can’t know about.

Opinions will differ– whereas as of right now Instant Mixes have far less “WTFs” for popular artists, they’ve done something much worse in my opinion, they’ve given up on anything that is not popular to make things less “interesting” (maybe in light of this press? That would be a hoot.) But bubbling up of less popular stuff is the entire moral point of music retrieval and music intelligence, and shutting it off is offensive to anyone that cares about this stuff.

I would like to point out that in my experience, Apple’s Genius is based on more than music sales through the store. I have a copious amount of live recordings and mashups in my collection and tracks from both get placed in Genius results frequently, and correctly. I believe when you allow Genius to look at your collection it truly does look at everything.

IIRC, John Siracusa was complaining a lot at the beginning of Genius that the feature would be useless for collections that weren’t mostly iTMS purchases, and this has not been the case at all.

Great article. Very interesting. I’ve never heard of Echo Nest before and came across this post via Daring Fireball. I’m a web designer that codes HTML, CSS, WordPress, a little Javascript and a little PHP. I know you provide documents for the Echo Nest API but have thought of or know of a simple how-to article using your own iTunes XML file as the catalog? I’d love to play around with this but no matter what I type, when trying to create a catalog, I get a 405: Method Not Allowed error. I would love a quick how-to or beginner’s guide.

Two things:
1. Google has been returning HORRIBLE results for year, especially for what comes down to music.

2. They really must be dumbfucks: after having failed completely (and they deserved it) with Google Wave because of their stupid vip invite system.
As soon as the Music Beta invites were open I sent two email to receive access to it. This week I had to write a long paper for thousands of students readers, music professionals and friends on the available Cloud Music Player/Collection available, from browser solutions like mSpot or Joomla, or hybrid apps like Spotify…and Music Beta is not going to be part of it as viable alternative.

How stupid you must be to do the same mistake twice with that much budget?

The Beatles- Polythene Pam
Led Zeppelin- In The Light
The Who- The Song Is Over
The Kinks- Do It Again
The Rolling Stones- 2000 Light Years from Home
Buffalo Springfield- Mr. Soul
Stills & Nash (And Young) Crosby- 49 Bye-Byes
The Moody Blues- Ride My See-Saw
The Beatles- Fixing a hole
Pink Floyd- Any Colour You Like
Led Zeppelin- (Karaoke) Out On The Tiles
The Allman Brothers Band- Dreams
The Traveling Wilburys- Rattled
The Rolling Stones- Can’t You Hear Me Knocking
The Who- Bargain
The Kinks- Low Budget
Squeeze- If I Didn’t Love You
Buffalo Springfield- I Am A Child
Lou Reed- Dirty Boulevard
Led Zeppelin- For Your Life
The Rolling Stones- Prodigal Son
Pink Floyd- On The Run
The Beatles- Martha My Dear
The Traveling Wilburys- Margarita
Yes- Perpetual Change

I was reading this with interest and generally agree with the findings presented here. I believe that for most people generating a playlist, they do NOT want to be surprised or challenged; They’re looking for something to generally hold a genre and provide an accompaniment to whatever they might be doing for the next few hours. Any playlist generator should at the very least have some kind of “twist” parameter that allows the amount of WFT-ness to be adjusted. The key point of this article is that Google’s instant mix just doesn’t have it right – Shame on those of you who are telling the author that he doesn’t know how to listen to his own music collection.

With all that said, the best thing I ever did was to have my entire collection (on my mk2 Rio Car) on shuffle and listen to it from beginning to end, plugging it into my home audio system when I wasn’t driving. It took nearly 6 months to get through it, but despite sometimes manually picking favourites albums out from time to time I always saved the playlist in the bookmarks so that I could continue. The hard drives are acting up these days but are due to be replaced and I’m sure that I’ll be repeating the experiment.

I would love to try the EchoNest playlisting API on my collection of music in the Rhapsody software. It doesn’t have to generate an actual playlist inside the program, just a purely textual list of songs picked from my library would be fine. Is there a possibility to do this, or might it be coming up? (Unfortunately, I’m not a programmer myself)

Just want to add a small comment: as someone already stated, there are people more focused on listening to their own favorite music area and other people who are willing to explore.

Maybe it’s important to highlight that in some case you can know in advance the kind of people you are serving; I did a first small experiment on this and I found for example that jazz lovers seem to be much more eclectic than metal lovers. If you are interested you can find the results here: