I have many AAC music files on many different computers and hard drive... mostly from iTunes with m4a and m4p extensions and aside from the file names and meta data I'm looking for a way to determine if the AAC files are for the same song. BTW I've accumulated all the files on a Mac and would prefer to do the processing there, but I also have access to a PC as well.

My first thought was to take an MD5 hash of the files and compare them, but it turns out that differences in the metadata can cause the MD5 hash to differ between 2 copies of the same song. I also tried to delete all metadata (used AtomicParsley --metaEnema) and then compare the MD5 hashes, but other info stuffed into the AAC media container also causes different MD5 hashes.

So if I could somehow extract the raw AAC audio data from the file then write it to a temp file and take the MD5 has of that, it should allow me to detect identical files.

I've used a similar approach to compare JPEG files, but I can't seem to find a way to get at the raw AAC audio data.

AFAIK it tracks music files down by their "digital signature" (or something like that) checking them up against a database (Gracenote?). There are also some other programs for PCs which work in a similar fashion, but I don't recall any names ATM.

AFAIK it tracks music files down by their "digital signature" if memory doesn't fail me, that's how they used to call it a few years back when I used iTunes. There also are some other programs for PC which work in a similar fashion but I don't recall any names ATM.

Thanks, I will check it out... I wonder if it will work on files that are not in an iTunes library... meaning over the years, the files have been scattered here and there and lots of them are not even in iTunes any more.

AFAIK it tracks music files down by their "digital signature" (or something like that) checking them up against a database (Gracenote?). There are also some other programs for PCs which work in a similar fashion, but I don't recall any names ATM.

Edit: further clarification

hmmm.... $50 a little pricy for my blood I was hoping to do the job using some shell scripting and some open source / freeware libraries.

Since m4p is DRM protected, there may not be so much options for you, and I cannot tell you how to do with them.For non DRM protected files, you can use something like the following to extract raw AAC bitstream:

CODE

ffmpeg -i input.m4a -c:a copy -f s8 output.raw.aac

This is a bit tricky job since ffmpeg will append ADTS headers for AAC output by default. ADTS header should be usually fine, but since you want raw AAC, "-f s8" is set to fake output to be a signed 8bit raw PCM. In combination with -c:a copy, it seems that ffmpeg successfully writes raw AAC bitstream as intended.If you want to listen to the raw AAC file, probably you have to append ADTS header by the following:

That's interesting, but I wonder if it is appropriate to hash decoded PCM of floating point based (therefore not assured to be bit-exact) lossy coders.Of course it should be enough to compare decoded PCM if OP just wants to compare A from B NOW.

How many? More than can be imported to a media player and sorted by length?

That said, maybe someone can give input on the following: suppose I do ffmpeg -i infile.m4a -acodec copy outfile.m4a, is there any possibility that the infile might contain headers (say, for gapless playback) which will be lost and a player see them as e.g. different lengths? mp3 files might have misleading length information ...

That said, maybe someone can give input on the following: suppose I do ffmpeg -i infile.m4a -acodec copy outfile.m4a, is there any possibility that the infile might contain headers (say, for gapless playback) which will be lost and a player see them as e.g. different lengths? mp3 files might have misleading length information ...

Although I don't understand why you remux to m4a here, your guess is correct.Amount of delay and padding for gapless playback are usually stored under a special tag named "iTunSMPB", and it is lost by that remux process (actually, ffmpeg will copy most of major tags in infile.m4a but not iTunSMPB).

As long as the same decoder is used it should work. Changing decoders will possibly change the hashes for some formats, particularly for 24 bit audio or fixed point arithmetic.

Pretty much. And I doubt the AAC decoder would change, unless its based on FFMPEG, and upstream they discover some things that need fixing or something.

The component uses FB2K's input services to decode the files to raw PCM data, which is then sent directly to the hashing functions. There is still room for improvement like multithreading, a proper output hash dialog, and different selectable hashroutines. (right now it uses SHA-1)