Creating Shazam in Java

This got me interested in how a program like Shazam works… And more importantly, how hard is it to program something similar in Java?

About Shazam

Shazam is an application which you can use to analyse/match music. When you install it on your phone, and hold the microphone to some music for about 20 to 30 seconds, it will tell you which song it is.

When I first used it it gave me a magical feeling. “How did it do that!?”. And even today, after using it a lot, it still has a bit of magical feel to it.
Wouldn’t it be great if we can program something of our own that gives that same feeling? That was my goal for the past weekend.

Listen up..!

First things first, get the music sample to analyse we first need to listen to the microphone in our Java application…! This is something I hadn’t done yet in Java, so I had no idea how hard this was going to be.

So, now we have the recorded data in a ByteArrayOutputStream, great! Step 1 complete.

Microphone data

The next challenge is analyzing the data, when I outputted the data I received in my byte array I got a long list of numbers, like this:

0
0
1
2
4
7
6
3
-1
-2
-4
-2
-5
-7
-8
(etc)

Erhm… yes? This is sound?

To see if the data could be visualized I took the output and placed it in Open Office to generate a line graph:

Ah yes! This kind of looks like ‘sound’. It looks like what you see when using for example Windows Sound Recorder.

This data is actually known as time domain. But these numbers are currently basically useless to us… if you read the above article on how Shazam works you’ll read that they use a spectrum analysis instead of direct time domain data.
So the next big question is: How do we transform the current data into a spectrum analysis?

Discrete Fourier transform

To turn our data into usable data we need to apply the so called Discrete Fourier Transformation. This turns the data from time domain into frequency domain.
There is just one problem, if you transform the data into the frequency domain you loose every bit of information regarding time. So you’ll know what the magnitude of all the frequencies are, but you have no idea when they appear.

To solve this we need a sliding window. We take chunks of data (in my case 4096 bytes of data) and transform just this bit of information. Then we know the magnitude of all frequencies that occur during just these 4096 bytes.

Implementing this

Instead of worrying about the Fourier Transformation I googled a bit and found code for the so called FFT (Fast Fourier Transformation). I’m calling this code with the chunks:

Now we have a double array containing all chunks as Complex[]. This array contains data about all frequencies. To visualize this data I decided to implement a full spectrum analyzer (just to make sure I got the math right).
To show the data I hacked this together:

for(int i = 0; i < results.length; i++) {
int freq = 1;
for(int line = 1; line < size; line++) {
// To get the magnitude of the sound at a given frequency slice
// get the abs() from the complex number.
// In this case I use Math.log to get a more managable number (used for color)
double magnitude = Math.log(results[i][freq].abs()+1);
// The more blue in the color the more intensity for a given frequency point:
g2d.setColor(new Color(0,(int)magnitude*10,(int)magnitude*20));
// Fill:
g2d.fillRect(i*blockSizeX, (size-line)*blockSizeY,blockSizeX,blockSizeY);
// I used a improviced logarithmic scale and normal scale:
if (logModeEnabled && (Math.log10(line) * Math.log10(line)) > 1) {
freq += (int) (Math.log10(line) * Math.log10(line));
} else {
freq++;
}
}
}

Introducing, Aphex Twin

This seems a bit of OT (off-topic), but I’d like to tell you about a electronic musician called Aphex Twin (Richard David James). He makes crazy electronic music… but some songs have an interesting feature. His biggest hit for example, Windowlicker has a spectrogram image in it.
If you look at the song as spectral image it shows a nice spiral. Another song, called ‘Mathematical Equation’ shows the face of Twin! More information can be found here: Bastwood – Aphex Twin’s face.

When running this song against my spectral analyzer I get the following result:

Not perfect, but it seems to be Twin’s face!

Determining the key music points

The next step in Shazam’s algorithm is to determine some key points in the song, save those points as a hash and then try to match on them against their database of over 8 million songs. This is done for speed, the lookup of a hash is O(1) speed. That explains a lot of the awesome performance of Shazam!

Because I wanted to have everything working in one weekend (this is my maximum attention span sadly enough, then I need a new project to work on) I kept my algorithm as simple as possible. And to my surprise it worked.

For each line the in spectrum analysis I take the points with the highest magnitude from certain ranges. In my case: 40-80, 80-120, 120-180, 180-300.

Indexing my own music

With this algorithm in place I decided to index all my 3000 songs. Instead of using the microphone you can just open mp3 files, convert them to the correct format, and read them the same way we did with the microphone, using an AudioInputStream. Converting stereo music into mono-channel audio was a bit trickier then I hoped. Examples can be found online (requires a bit too much code to paste here) have to change the sampling a bit.

Matching!

The most important part of the program is the matching process. Reading Shazams paper they use hashing to get matches and the decide which song was the best match.

Instead of using difficult point-groupings in time I decided to use a line of our data (for example “33, 47, 94, 137″) as one hash: 1370944733
(in my tests using 3 or 4 points works best, but tweaking is difficult, I need to re-index my mp3 every time!)

Now we already have everything in place to do a lookup. First I read all the songs and generate hashes for each point of data. This is put into the hash-database.
The second step is reading the data of the song we need to match. These hashes are retrieved and we look at the matching datapoints.

There is just one problem, for each hash there are some hits, but how do we determine which song is the correct song..? Looking at the amount of matches? No, this doesn’t work…
The most important thing is timing. We must overlap the timing…! But how can we do this if we don’t know where we are in the song? After all, we could just as easily have recorded the final chords of the song.

By looking at the data I discovered something interesting, because we have the following data:

– A hash of the recording
– A matching hash of the possible match
– A song ID of the possible match
– The current time in our own recording
– The time of the hash in the possible match

Now we can substract the current time in our recording (for example, line 34) with the time of the hash-match (for example, line 1352). This difference is stored together with the song ID. Because this offset, this difference, tells us where we possibly could be in the song.
When we have gone through all the hashes from our recording we are left with a lot of song id’s and offsets. The cool thing is, if you have a lot of hashes with matching offsets, you’ve found your song.

The results

For example, when listening to The Kooks – Match Box for just 20 seconds, this is the output of my program:

Very cool stuff Roy! Impressive..
I didn’t expect it to also match live recordings! Because thats a completely different song, with more or less and different instruments.
Compared to the original + background noise at least.

http://www.albert-jandevries.nl Albert-Jan de Vries

Great article, thanks for sharing!

http://milandinic.blogspot.com/ mile

Nice! Thanks for sharing! It would be nice to share the code too :)

http://www.operative.net Hannibal Tabu

So … when will you have a Shazam java build for Maemo?

http://the-music-collective.com Wow

Seriously. Awesome.

Casper Bang

Nice article. Would love to see the source as well! :)

uaretroublesome

Sounds like a gem ! Please share the source code

GeekyCoder

amazing !
looking forward to the source and will like to integrate it into a project …

thx

Robert

very interesting! thanks a lot for sharing!

derek

seriously impressive.

http://baptiste-wicht.com/ Baptiste Wicht

Very good job !

Really interesting.

http://zion-city.blogspot.com/ Luca Garulli

Impressive!

If you’re interested to create a service you need a fast db and space to store hashes of songs. I can give you an instance of a OrientDB database reachable by the Internet via HTTP RESTful.

bye,
Lvc@

RIccardo

Great work man, waiting to see the first release of your code !!

http://gianluigidavassiuk.blogspot.com Muzero

Brilliant! Absolutely awesome!!! i liked the FFT part… =)

Francesco

Great Job!!!
I would like to see the code: where will be available?

Steven Devijver

Great work, thanks for sharing! I suppose you’ve used this FFT implementation:

Great Work !
How about sharing the code on SourceForge and watch it as people who are interested make it faster and more perfect :)

Ian McGowan

I would pay for something that could successfully de-duplicate my large mp3 collection. I like the idea of an open source version of the program, and then sell the value add of a pre-compiled list of fingerprints. Kind of like musicbrainz.org, but it works..

F.Baube

Yes please share the code if you would.
Automated content detection is a pretty
hairy subject, and it’s great that your
weekend hack came together into something
that does something!

http://www.twitter.com/prairiedog2k prairiedog2k

Great article. Nice to see some old school reverse engineering going down. Keep up the great work.

Felipe

Pretty amazing work, and in one weekend… just WOW.
Congratz, Is very nice too see smart ppl doing smart things
just for the love of it.

Please share your code even if is spaghetti or hacked code, I think
most of us could just look into it to make it work in our environments and somebody can work on it to make it more usable and maybe add it in some open-source project.

:)

stucco

set it free!

hurry, before the lawyers come!

http://www.averagejackal.net Mark

I’d also love to take a gander at that code, even, as others have said, if it’s “not cleaned up for company”. I’d even help clean it up.

Cheers,
-Mark

Francesco

Yes, the code would be interesting even if still “dirty”…

http://ivonet.nl Ivo

Very cool…
I just love magic :-)

Amit

This is marvelous. This is not a hack. You have implemented your own version of Shazam / Nabbit /Midomi. Shazam might be too ahead but your implementation of the few steps to match the song track is beyond spell.
we will definitely help build your code to release for mass.

Cheers,
Amit

http://blog.zurka.us/ nasal

Awesome project and very well presented!

I might even try to study some Java just to try and repeat this experiment by myself!

Joe

I would kill for this source, I’m currently doing a DIY on a in-car computer and could really use this!