Converting music into pictures and back

The objective I had in mind was to be able to convert music into a picture and convert it back again without too much quality loss. The simplest way is of course to use a lossless format like .png and simply use the rgb values as storage for 3 bytes from for instance an mp3. But that is not very interesting. So I wanted to make it possible to use a lossy format for the picture, of which the most famous is jpeg.

Without looking how jpeg compresses things I started by simply writing spectrum data to a picture file. It seemed logical as spectrum data already looks a bit 2d and not too random.

So, the process is as follows:

Use a hanning window of double the step size to process each sample twice.

If you look at the closeup you’ll see that the lowest frequencies (bottom of the image) change continuously in phase (color in the image). Now if you compress this, all the quickly changing colors will be smoothed out and especially phase information will become highly distorted. So, to remove the changing colors, instead of storing the absolute phase, we’ll store the phase relative to the previous window. Now the result looks a lot more smoother in the x-direction:

(click to show full size)

Compression and decompression perform okay on bmp. But when the image is converted to jpg and back, large problems arise in the phase information. Due to the fact that only relative phase is stored, the absolute phase wanders a lot and after a few seconds, the phase is completely off.

To correct this, I thought about storing absolute phase information in the leftover green channel and spread the value over a couple of frames for higher accuracy. But it might be a better idea to look at the insides of jpeg compression.

Jpeg compresses using 8×8 blocks and looks at the horizontal and vertical frequencies present in each block using a discrete cosine transform. For higher compression, courser quantization of the possible amplitudes of each frequency is used, thereby mostly reducing higher frequency features in the image.

To get the same effect in audio compression (remove high frequency accuracy first upon higher compression), we can store the lowest audio frequencies inside the lowest image frequencies, and the higher audio frequencies inside the higher image frequencies. As I’m using a window size of 1024, the number of bands do not fit inside a 8×8=64 frequency jpeg block. So the solution I use is to spread the audio frequency information over multiple jpeg blocks. If there are for instance 16 jpeg blocks per audio spectrum, then the lowest 16 audio frequencies go into the first frequency of the 16 blocks, the next 16 audio frequencies stored into the second frequency, and so on until all 64 frequencies of all 8 blocks are filled, which would correspond to 1024 audio frequencies. I’ve again put the real values into the red channel and the imaginary values into the blue channel, except this time I used absolute phase information. The result is the following:

Now the problem is that jpeg has a simple pre-processing step that uses the knowledge that the human eye is more sensitive to brightness change than to color change. Basically, all color information is reduced 4 times in size (x/2 and y/2). This means that even at highest quality jpg compression, a lot of the phase information is gone, which results in clicks in the recording, where the phase information in succeeding audio windows mismatches.