In my opinion the answer for the question is: not yet, as far as current general purpose(non-real time) operating systems and their APIs are concerned. Most of them appear to over do the idea of playing back of a sound buffer. Rightly so, since even the fastest CPUs will hiccup on playback if playback is done with a bad design. But as time progresses single thread performance has risen and multi-thread performance tends to be better if done correctly but there are too many ways to acutally do it.

We have always considered multi-threaded designs superior for audio playback and processing for the mere reason of we just can do more in a given amount of time. Simple playback as been fluid for years on CPUs that are generally available, even on phones and other small devices. Playback with live DSP is a whole different problem, Vrok(older versions) tried to answer it with a 2 thread model with a decoder thread and a playback thread. DSP is done on the decoder thread; the design had little reason for doing so. The only reason was why not! It worked well, but as the CPUs that it was run on got weaker the buffer had to be made bigger to reduce the artifacts. It added a notable amount of latency to the audio.

Even with years of DSP on audio, almost everyone in the audio business has been doing it linearly: one effect after another. There’s very less reason for this choice, right now I can’t think of one other than the ease of implementation. Non-linear DSP should be great! Atleast in theory. However, this is not the first time that someone thought of it, JACK(JACK Audio Connection Kit) offers a great implementation on supported operating systems. JACK is a bulky system and is recommended for people with some knowledge in sound servers and patch bays, it is a no go for someone who is only going to listen to some music in their free time. Therefore why not a mini-JACK like system? Which is faster, portable and extendable.

Below is a sketch of how VrokNext would work, each node will run on a runnable unit (lets call it a thread for now, further below the differences are detailed).

Communication between threads are done by two queues which are built to work fast (preferably lock-free). Memory allocation and deallocation could be done for buffer management but there’s two things that’ll add some overhead, every allocation and deallocation would need to be synced and a big amount of allocations and deallocations for long periods of time will have an adverse affect on the allocator because of segmentation (you may think of this as over engineering but if run on a system with low resources these might be significant). Preallocating everything before you use it and reuseing what you’ve allocated is always the good approach. Therefore, I chose to implement so.

If every node is run on a different thread, there will be a lot of threads at sometime and what’s slowing things down would be their own contention. On an N core system, if HyperThreading is enabled there is no way that you are going to get more than 2N threads running at the sametime. Therefore it’s useful to have a mini scheduler that will schedule work of every node without the overhead of a lot of threads. Each thread will run a preset amount of processing functions. This is no way a finalized design, however it has the potential to make use of whatever hardware there is to its maximum without forcing the DSP or the playback to be of low quality.

So people(including me) struggle with FFmpeg. It’s very powerful and also a bit hard to understand for the average developer who hasn’t had a background in media decoding/encoding. I struggled with playback first, then I found a good implementation on stackoverflow and built on top of that. This post details that.

So soon after we get the playback working we need to get the seeking working. If you are a Vrok(<=3) user then you might have experienced the buggy seeking it does. As I dwelled on to FFmpeg this vacation it just got clear with the help of dranger‘s good ‘ol tutorial. PTS/DTS are detailed well there within the context of FFmpeg. The documentation of FFmpeg tells half the story of its design, which is based on multiple streams. Each stream has its own context and each context has its separate time base. What’s a time base? It’s just a number that decides how big the time scale is. If you have a bigger time base, then you can store fine grained time intervals. But this should be balanced with media duration and the storage size of the variable we are going to store this number in.

FFmpeg has a separate context for the whole file (container in code) this context has a time base which is defined as AV_TIME_BASE. It is this time base that is used to store durations about the whole file. Namely total duration! So the following code will give you total duration in seconds.

duration_in_seconds = container->duration / AV_TIME_BASE;

In the audio stream there exists a different time base. Which you can get by accessing,

container->streams[audio_stream_id].time_base

This is expressed as a AVRational, all you need to know about this is every rational number can be represented in decimal (numerator/denominator). So now you can again take any PTS (Presentation Time Stamp) in seconds using this time base too. This can be used to show the current playback position.

Is used to seek to the correct frame (it is not exact but does the job, FFmpeg only seeks to key frames). Here, seek_to is stored in seconds as you can see I’ve converted it to the time base version of time by dividing it by the time base (yes, the convention is a bit different from the AVFormatContext’s time base). Hope this clears out FFmpeg seeking! Refer to Vrok source for sample code.

For now it only includes string functions such as, split, to_string, from_string, to_upper, to_lower …etc. More will be added in the future, all functions are added to the std namespace if they are related to any std class, such as std::string. If they are not related they’ll be in the cpplib namespace.