The current version is still mostly single-threaded for single-file encoding but scales for multiple-files encoding.

The multicore tweaks used in this version are:- parallel/asynchronous conversion/scaling of PCM data to float samples- parallel/asynchronous computation of replay gain- I/O ordering (WAV files are read into memory first before MP3 files are written to disk)

Single-file encoding is slower than the original LAME because I haven't SSE-optimized the code.The scale factor in multiple-files encoding is 3.3 compared to LAME and 3.4 compared to "fpMP3Enc single", which is quite good for a first version.

The next steps will be to perform the psycho acoustics computation for each frame in parallel, then the MDCT, and so on.

Compile and usage:- You need Visual Studio 2008 (and obviously Windows) to compile the project.- SSE4 is enabled by default. To disable: #undef USE_SSE4- Memory control is disabled by default in this version. This can lead to pagefile swapping if too many files are to be encoded. You can enable memory control by using a different 'MEMORY_MODE' macro (experimental).- Win32 version: The total WAV file size sum must not exceed 400MiB.- Win64 version: The total WAV file size sum must not exceed 400GiB.- Input: WAV 16-bit stereo, 44.1kHz- Output: MP3 Stereo/Joint-Stereo, 44.1kHz

I'd like to know what you think about this project so contact me if you have any comments or feedback.George

I'd like to know what you think about this project so contact me if you have any comments or feedback.

I think it's great. Nice to see how multi-threading encoding is finally starting off here and there. That there's not one single effort for one codec is good, too, IMHO. Different aproaches mostly in the form of different MT-libraries, it seems, need to be tried out and tweaked to see which one is ultimately superior in certain situations.

Group: Members
Posts: 103
Joined: 18-July 08
From: New York
Member No.: 55969

Doesn't foobar2000 already do this if you have more than one core in your computer? I was getting similar numbers on quad core machine when I had to do some coding for my dad. My core duo goes about 43x if its not running warm.

foobar2000 does not use multi-threading encoders, but it could. (Actually out of the box it uses no encoders, but what is meant is that most users will use single-threaded encoders, simply because those are the ones available.) nazgulord is correct.

A multi-threaded encoder would be better if you have less files to encode than you have cores, if your harddrive becomes the bottleneck when encoding multiple files at the same time, and then there might be other more code-related advantages that may make a sequential file encode with a mt-enabled encoder faster than several instances of a single-threaded encoder. For example, when multiple instances of the same executable are running it means 4 times the exact same code is executed in parallel, even code that might only be needed once in a real multi-threaded encoder. Modern CPUs share their cache, so there is some internal automatic optimisation, still a less generic mt-library can be better than that or even trigger hardware optimizations better.

But IMHO, the most apparent drawbacks of using multiple instances of single-threaded encoders are when you process very big files (whole CD images instead of track based files), because to the end of your batch encode as there are less files left than you have cores, efficiency drops as not all cores are being fully used anymore, yet finishing the big CD rips that are left might still take some minutes. It's even more apparent when you only want to transcode a single CD image, it's as (in)efficient as on a single core system. And also when you use very fast encoding settings usually on systems with more than two cores but normal non-RAID disk setups, the disk access of, for instance, four I/O streams may be more than the single hard disk can handle, so that the CPU load then drops far below 98-100%. And you want your CPU to be the bottleneck not your harddrive, because more CPU intense settings usually mean higher quality or better compression.

For example because of the latter I started to use the extra modes of the WavPack encoder in foobar2000. Transcoding four lossless audio files at once using -hh is simply more than my harddrive can handle.

Otherwise, running two encodes in parallel will almost certainly be faster due to less overhead from threading, synchronization, etc.

Are you sure?

Running two encodes in parallel results in essentially perfect parallelization. The only overhead comes from disk contention, which is still a problem for the multithreaded single process case anyway.

Running one process will encounter additional overhead due to thread synchronization, lack of granularity in parallelism, overhead for inter-thread communication, etc. In order to make up for this, there would have to be additional work saved by running in one process and I don't see what that would be for MP3.

Of course if you have 4 cores and only 3 files for instance, the second approach may still be faster, simply because it finds a use for the fourth core . . . For instance when I multithreaded libmad, it gave a ~90% speed up, so it was actually slower then just running two files in parallel. Of course in my case I only had one file to decode, so it ended up being the best solution

Doesn't foobar2000 already do this if you have more than one core in your computer? I was getting similar numbers on quad core machine when I had to do some coding for my dad. My core duo goes about 43x if its not running warm.

I cannot confirm this on my quad core. Converting the test set with foobar2000 took me about 6 minutes which gave a 47.4x speed. (BTW, the test was encoding to CBR 128.)

QUOTE (Mike Giacomelli @ Aug 2 2009, 18:38)

Running two encodes in parallel results in essentially perfect parallelization. The only overhead comes from disk contention, which is still a problem for the multithreaded single process case anyway.

This is not a problem in "fpMP3Enc". I/O processing is a separate task and is controlled by a file I/O scheduler that sorts and serializes I/O operations on the same drive. The encoder tasks work only on memory.

That's the reason why foobar2000 has a 47.4x performance while "fpMP3Enc" has 80.2x. I think, on an i7 it's possible to get 150x and above... unfortunately I don't have such a system to test it.

QUOTE

Running one process will encounter additional overhead due to thread synchronization, lack of granularity in parallelism, overhead for inter-thread communication, etc. In order to make up for this, there would have to be additional work saved by running in one process and I don't see what that would be for MP3.

As you mentioned above, concurrent disk access is a problem when you execute multiple encoder processes in parallel. If you run them as tasks in one process you can use a file I/O scheduler that takes care of it, as I did.

Running one process will encounter additional overhead due to thread synchronization, lack of granularity in parallelism, overhead for inter-thread communication, etc. In order to make up for this, there would have to be additional work saved by running in one process and I don't see what that would be for MP3.

As you mentioned above, concurrent disk access is a problem when you execute multiple encoder processes in parallel. If you run them as tasks in one process you can use a file I/O scheduler that takes care of it, as I did.

Why is there a speedup for scheduling sequential reads vs. letting the OS's file caching schedule?

Why is there a speedup for scheduling sequential reads vs. letting the OS's file caching schedule?

For example, if you have four processes where each one tries to read a file sequentially on the same disk you won't get sequential reads. The OS will split the I/O operations into pieces in order to feed each process.

My scheduler does not interrupt I/O operations. For example, if you want to read two 50 MiB files that reside on the same drive in 50 1 MiB chunks each, first the 50 I/O operations of the first file are performed and then the 50 I/O operations of the second file.

Of course, my library supports setting a memory limit.

I wrote an article about what I call "Parallel File Processing" (unfortunately it's only in German) in my blog . You can find the article here. This is the theory behind it.

In the more general case, an optimal scheduler would read just enough from one file before it needs to switch to reading for another file. This would lead to pretty big buffers, but not necessarily whole-file buffering.

No idea yet if this encoder is implementing something like that, but if it is, that sort of scheduler would be extremely valuable for other open-source applications.

In the more general case, an optimal scheduler would read just enough from one file before it needs to switch to reading for another file. This would lead to pretty big buffers, but not necessarily whole-file buffering.

This is possible with my framework (see SequentialVirtualMemory classes).

In "fpMP3Enc" I've not determined the optimal memory strategy yet because I'm still working on parallelizing the encoder. It will depend on the final performance.

QUOTE

No idea yet if this encoder is implementing something like that, but if it is, that sort of scheduler would be extremely valuable for other open-source applications.

The framework can be used for free in (totally) non-commercial applications.

In "fpMP3Enc" I've not determined the optimal memory strategy yet because I'm still working on parallelizing the encoder. It will depend on the final performance.

Will it be optimised for fpMP3Enc or will it be more adaptable to different encoders or even decoders?

There are codec settings that decode very slowly, Monkey's Audio's extra high for example. The encoding tasks might run idle if the I/O scheduler reads too much data from such files. Maybe when the scheduler adjusts the buffer size dynamically it should look in both directions, and if decoding takes longer than encoding even give the decoding tasks higher priority?

I know fpMP3Enc currently just supports WAV, but that might change in the future, or not be the case in other projects that might want to use your library.

Anyway, it's an interesting (and logical) approach to optimize an encoder for multi-cores without parallelizing that much of the original code, I guess.

Will it be optimised for fpMP3Enc or will it be more adaptable to different encoders or even decoders?

The memory strategy is not set by the framework, it's set by the application and each memory object in the application can use a different strategy.

QUOTE

There are codec settings that decode very slowly, Monkey's Audio's extra high for example. The encoding tasks might run idle if the I/O scheduler reads too much data from such files. Maybe when the scheduler adjusts the buffer size dynamically it should look in both directions, and if decoding takes longer than encoding even give the decoding tasks higher priority?

This can be done by having a small buffer for the input files and a large buffer for the output files. Then the scheduler would fill the buffer for the first file first, switch to the next and so on and return to the first file to read the next chunk.

QUOTE

I know fpMP3Enc currently just supports WAV, but that might change in the future, or not be the case in other projects that might want to use your library.

For me, fpMP3Enc is just a case study or proof of concept for my multicore framework so you cannot expect new features from me. But the fpMP3Enc source code will become available under an open source license (probably GPL) for other developers to extend it.

I've updated the application and added three command-line switches that show how it scales under different memory conditions:

--mem-use-ignore (default)This switch should be used if you know that you have enough physical memory free for the task. If more memory is needed than expected, page file swapping may occur.

--mem-use-file <size in MiB>With this switch you can specify the maximum buffer size to use for each file.

--mem-use-system-free <size in MiB>This switch can be used to specify the physical memory size that should be kept free for other applications. fpMP3Enc will stop committing memory if the available physical memory falls below this value. To avoid a deadlock, the minimum committed memory for each memory object is 128 KiB. Page file swapping will not occur unless another application needs more memory than specified.

So, on my system I could use a 3 MiB buffer per file to get the best results. This values will vary on other systems.I recommend to use the "--mem-use-system-free" switch with a value that's OK for your system.

This means that fpMP3Enc is about 87% or 1.87x faster than LAME in single file encoding, while the speedup is 3.4x in multi file encoding.

The following optimizations were performed:- A frame buffer (20 MiB) is used to hold the work data for about 1000 frames- Psycho acoustics are computed frame by frame in a separate task without data dependencies- MDCT is computed frame by frame almost in parallel to psycho acoustics (right after attack detection, which is at the very beginning)- MDCT (left channel) and MDCT (right channel) are computed in separte tasks (for details see here; English translation will follow)- A small part of the "VBR_encode_frame" function is split into four tasks

ABR and vbr-old should also benefit from these changes. CBR is disabled in this version because this mode has some data dependencies that I have not handled yet.

To get best results you should not use more than 3 threads on quad cores or higher for single file encoding. The value can be set by using the "--threads" option.

There is still so much to optimize in the LAME code. Let's see how far it can get...

x80 encoding speed equates to 14MB a second read (from an uncompressed file, if lossless then half that). A modern HDD should be able to do 3x that without breaking a sweat, so I am not sure where the speed differentials come from when buffering to memory. Windows can have radically different read speeds depending on the transfer buffer size selected, off the top of my head on XP 64KB was the optimum size.

This means that fpMP3Enc is about 87% or 1.87x faster than LAME in single file encoding, while the speedup is 3.4x in multi file encoding.

Thanks for sharing your work. I'm no expert on programming but would like some things to be cleared up.

How many threads did you use with the presented results of fpMP3Enc ? I suspect the multi file encoding is done with 4 cores. If so, Lame encoding 4 or more files with one file per core would result in about 4 x 32.3=129.2 times encoding speed.

Also, should a native well designed 64 bits encoder be nearly twice as fast as its 32 bits counterpart or do the Core2Duo chips handle this well through some kind of emulation?

I'm trying to understand what is measured here and where the speed gain is coming from.

x80 encoding speed equates to 14MB a second read (from an uncompressed file, if lossless then half that). A modern HDD should be able to do 3x that without breaking a sweat, so I am not sure where the speed differentials come from when buffering to memory. Windows can have radically different read speeds depending on the transfer buffer size selected, off the top of my head on XP 64KB was the optimum size.

If you consider that 8 files (4x read, 4x write) are processed in parallel then 14MB/s is pretty good on a 100MB/s drive.

With the I/O strategy used in fpMP3Enc you could use 8 threads processing 16 files in parallel and get almost the same throughput on a quad-core. On an 8-core the throughput should be far above 20MB/s.

In contrast, if you execute 4 LAME processes for parallel encoding, the concurrent disk access will lead to an I/O performance below 14MB/s.

How many threads did you use with the presented results of fpMP3Enc ? I suspect the multi file encoding is done with 4 cores. If so, Lame encoding 4 or more files with one file per core would result in about 4 x 32.3=129.2 times encoding speed.

I used 3 threads in single and 4 threads in multi file encoding for the CPU bound tasks. I/O always needs two threads, one for the I/O scheduler and one for listening to the I/O completion port.

For the reason why it's not possible to get a 129.2x speed with 4xLAME, see my previous post.

QUOTE

Also, should a native well designed 64 bits encoder be nearly twice as fast as its 32 bits counterpart or do the Core2Duo chips handle this well through some kind of emulation?

AFAIK, the CPU intensive parts of LAME are SSE-optimized using 64- and 128-bit registers on both systems, x32 and x64. So, I don't think that an x64 encoder would be much faster than a x32 encoder.