So far I wasn't much interested in the small compression gain, but V0.4 speed gain is impressive.Anyway I will not buy an Nvidia card just for Cuda... sadly I own a geforce 7600GS & my next graphic card will be integrated to my future Core I3 530 ... so I guess I will never use flacuda ;(

It makes me wonder how fast could be a multi-threaded flacuda -4 encoding runned on a sandy bridge octo-core with a geforce 300 ... more than 16X faster compared to my old athlon XP 3000+ (barton) I guess

The sad thing for flacuda is that in a near future cheap GPU will be integrated to low end CPU as soon as 2010 (clarkdale, 2 core+45mn GPU), & the middle-end CPU as soon as 2011 (sandy bridge, 4 core+32mn GPU) for intel, & AMD will follow (one year late as always), all these integrated GPU will have hardware acceleration for blu-ray video codecs so unless you're a die-hard gamer, buying an nvidia card will be a pure loss of money.

The coming years will be hard for nvidia. I am not even sure it will survive.

I wouldn't kill nVidia just yet. AFAIK, as of now, it is the only card that supports GPU video transcoding, and it is heavily used in newer encoding applications, as well in new Photoshop for calculations of some effects.While we are on the subject, where is this multithreaded flac encoder, BINARY, so I can test it?

IMHO audio or video encoding will not help nvidia survive long because if the only purpose of buying GPU become to accelerating encoding then you'd better buy a higher-end CPU, being written in a lower process than GPU, CPU will always have the advantage in brute encoding force vs. power consumption & heat.

As for a multithreaded flac encoder, AFAIK there is none, I think I recall I read about some very experimental proof-of-concept code on some mailing list, but nothing serious.

Maybe we should start a donation to buy a quad core for Josh, it cannot be more useless than buying a PC for Klemm afterall

The simpelest way to use multi threading for any encoder is to run multiple encoders simultaneously (foobar2000 can do that). The number of usable threads depends on when the hard disk becomes the bottleneck.

The simpelest way to use multi threading for any encoder is to run multiple encoders simultaneously (foobar2000 can do that). The number of usable threads depends on when the hard disk becomes the bottleneck.

Sounds fun, though I'm afraid we'd bump into a strong bottleneck because of disk head positioning Even converting with 2 threads one HDD seeks like crazy - but it's still a lot faster than 1 thread.NCQ in AHCI mode should help a lot with more threads, but it didn't when I tested it a while ago. Physically different source/target drives can alleviate this bottleneck quite a bit.Fast SSDs are worth a try too This CUDA encoder can be a different solution, in case of one instance it's faster than the reference encoder running on one core of my CPU (converting one file at a time is the least disk-bottlenecked way to do it).A natively multithreaded CPU-based encoder (working on segments of one single track) is another option.

Is there anybody here who knows the math behind Cholesky decomposition used in ffmpeg as an alternative method of LPC coefficients search?This method is too slow for CPU, but i thought i'd give it a shot on GPU.The problem is, GPU doesn't do double precision very well.The lls code from ffmpeg doesn't work on single precision floats due to overflows.My first idea was to scale down the signal to avoid overflows, but results were poor.There's something i don't understand about this algorithm: in theory, LPC coeffs shouldn't depend on the scale of the signal - after all, they are linear I have a suspicion that in practice this algorithm does depend on the scale of the signal a lot. I don't pretend to understand this math, but:First suspicious piece of code is this (from av_solve_lls):

When the signal is multiplied by 10, covar[i][j] is multiplied by 100, and both factor[i][k] and factor[j][k] are multiplied by 100, so factor[i][k]*factor[j][k] is multiplied by 10000. So this sum doesn't scale in any predictable fashion.

I also don't understand this magic 'threshold' business.

CODE

if(sum < threshold) sum= 1.0;

How should the threshold scale with the signal? Should the sum always be set to 1.0 if it's below threshold, or to some value depending on the scale of the signal? Or am i on the wrong track completely?

I also found this old post from Josh:

QUOTE (jcoalson @ Jul 24 2006, 10:04)

I have actually been doing experiments solving the full prediction linear system with SVD; this should give a lower bound on the compression achievable by the FLAC filter.

Is there any working code left from those experiments, and how successful were they?

I must add, that when computations are done in double precision, lls coeffs do not depend much on the scale of the signal, so the algorithm works, despite non-linear scaling of intermediate values.But in single precision they start to drift much more. Which is wierd, because in literature Cholesky decomposition is said to be more stable than Levinson-Durbin recursion, with regard to rounding errors.

Sounds fun, though I'm afraid we'd bump into a strong bottleneck because of disk head positioning Even converting with 2 threads one HDD seeks like crazy - but it's still a lot faster than 1 thread. [ ] A natively multithreaded CPU-based encoder (working on segments of one single track) is another option.

Ideally you would run multiple instances of a single-threaded encoder (one track per CPU core) and one instance of the CUDA encoder per GPU at the same time - it's just a matter of making sure that all instances are kept busy.

When the number of remaining tracks gets lower than the number of available cores, you prioritize the GPU instance (since it's faster than a single-threaded encoder on a single CPU core), but also run (if available) a multi-threaded encoder; one MT encoder over two cores is likely to be slower than two instances of a ST encoder over the same number of cores (see the Lancer builds of the Ogg Vorbis encoder). In other words, an MT encoder is particularly useful for keeping CPU cores busy when the workload dries up.

In short, the priorities go like this (if you have a multi-core CPU, that is):ST * n CPU cores > GPU > MT

As for the I/O bottlenecks, that's when a large enough RAMdisk comes in very handy. Even just 1GiB is often enough for encoding a whole album (WAV + FLAC or FLAC + Ogg Vorbis or whatever on the RAMdisk).

I already use all available CPU cores when I encode my rips to FLAC or any other codec (one track per core); what I could really use, even before a MT FLAC encoder comes up, is a simple, command-line, multi-threaded Replay Gain utility. As I've said in the past, computing RG values on an album now takes longer than encoding it in the first place (because the former uses only one core while the latter uses all 4 cores on my quadcore CPU).

As for the I/O bottlenecks, that's when a large enough RAMdisk comes in very handy. Even just 1GiB is often enough for encoding a whole album (WAV + FLAC or FLAC + Ogg Vorbis or whatever on the RAMdisk).

You're absolutely right, I don't know how I could forget about RAMdisks. I used them all the time when 8MiB felt plenty of RAM, but somehow I never thought about them since we have multiple GiBs at our disposal... talk about contradictions...

I've gotten flacuda to work with the old but still handy Flac Frontend. The only little issue is that flacuda doesn't recognize the -V option as verify like the flac.exe does, so I can't use the verify checkbox in the Frontend. It's a tiny thing, but it would be cool if, maybe along with a future update, -V was added to flacuda. If not, I'll just go about setting it up to work with Foobar.

I've gotten flacuda to work with the old but still handy Flac Frontend. The only little issue is that flacuda doesn't recognize the -V option as verify like the flac.exe does, so I can't use the verify checkbox in the Frontend. It's a tiny thing, but it would be cool if, maybe along with a future update, -V was added to flacuda. If not, I'll just go about setting it up to work with Foobar.

Thank you again, Gregory. Very cool stuff.

Since you canīt use replaygain with Flac Frontend and FlaCuda and you still want its simple layout just try Multi-Frontend from the same author. There you can define your line with --verify.I even resurrected frontah for mirroring old files to new folder and FlaCuda and tags with one click. Its ini is simple to adjust to make it work. To sad frontah developement was stopped.

Edit: When anyone recommends foobar now, please tell me how you can simple mirror (reencode) folders + copying Tag + replaygaininfo in one go. I didnīt manage to do it that simple but i read here and there "Use foobar" but no detailed info how. Maybe i do misunderstand its functionality.

Edit2:Finished the reencode of my collection. Since i used flac 1.10-1.21, flake and some other builds over the years i suppose it is of no use to calculate my space savings as a guiding value.On some albums there were big savings. A few albums come out bigger, mainly very silent music or with many silent parts in it. I can imagine on some collections with special kinds of music it wonīt save as much space as expected.

I eally canīt tell if your FlaCuda became any faster cause it was damn fast before. All i can say it is kind of fun having the GPU doing its job while you donīt notice your system being under heavy stress. So encoding with FlaCuda you can still do heavy tasks in Front. I love it.

I second, it's ridiculous Now FlaCuda 0.6 at -6 is a almost as fast as Flac 1.2.1 -8 running on two threads... and this is a stock 8600GT standing up against a pretty much overclocked core2 duo... If I give the geforce a little bit of overclock, it comes out faster than the 2 instances of Flac1.2.1 together... the file sizes are even a bit smaller than with the CPU encoder and there are 'more hardcore' settings... it's true that heavier compression takes a toll on decoding speed too, so I stick with the original -8-ish compression when I use FLAC.TAK is somewhat slower to decode, but it compresses better than even FlaCuda does at -11 and that's beyond the speed crossover point: that -11 FLAC is slower to decode than the -p2m TAK (which is 18kbps smaller in case of my test material).No, it's not a TAK marketing remark, I'm just testing that too, it's interesting for me to compare these codecs.