I'm extending my Mumble library. I just realized that Murmur doesn't mix audio from different users and client has to do it by itself I have the problem to pinpoint, how this problem is solved in original Mumble client.

Let's assume I have decoded audio packets from several users. The packets can have different length (10, 20 or 60 ms). What are the optimal strategies to mix them in real time? How is Mumble doing it?

We have one jitter buffer (we use one from speex dsp) per speaking user into which the packets from the network are inserted (AudioOutputSpeech::addFrameToBuffer). The important information for this is the sequence number as well as the number of samples in the packet. The sequence number is part of the framing and tells us the the time this package belongs in the stream of that user. The number of samples tells us how long the audio in that packet is. This jitter buffer also handle reordering.

Mumble output then simply relies on the output device trying to keep its output buffer filled. We have a mix function (AudioOutput::mix) you tell how many samples you plan to output which then goes to all currently active speaker objects and asks them for that amount of samples (AudioOutputSpeech::needSamples). The needSamples function then decodes as many samples from the jitter buffer as needed to fullfill that need. If the jitterbuff does not have the right packets (missing or late) this is recognized and the packet loss correction of the codec is used and the jitter buffer size might get adjusted to prevent future underruns or misses. End of speech is signaled by a terminator flag in the packet framing.

TL;DR: Thanks to the dynamically sized jitter buffer we don't really care about how long the packets are. We assume the client streams enough of them and decode them when we need to mix them. No absolute alignment is attempted.