One thing became very clear to me during my trip to the Linux Audio Conference 2010
in Utrecht: even many pro audio folks are not sure what Jack does that PulseAudio doesn't do and what
PulseAudio does that Jack doesn't do; why they are not competing, why
you cannot replace one by the other, and why merging them (at least in
the short term) might not make immediate sense. In other words, why
millions of phones on this world run PulseAudio and not Jack, and why
a music studio running PulseAudio is crack.

To light this up a bit and for future reference I'll try to explain in the
following text why there is this seperation between the two systems and why this isn't
necessarily bad. This is mostly a written up version of (parts of) my slides
from LAC, so if you attended that event you might find little new, but I hope
it is interesting nonetheless.

This is mostly written from my perspective as a hacker working on
consumer audio stuff (more specifically having written most of
PulseAudio), but I am sure most pro audio folks would agree with the
points I raise here, and have more things to add. What I explain below
is in no way comprehensive, just a list of a couple of points I think
are the most important, as they touch the very core of both
systems (and we ignore all the toppings here, i.e. sound effects, yadda, yadda).

First of all let's clear up the background of the sound server use cases here:

Consumer Audio (i.e. PulseAudio)

Pro Audio (i.e. Jack)

Reducing power usage is a defining requirement, most systems are battery powered (Laptops, Cell Phones).

Power usage usually not an issue, power comes out of the wall.

Must support latencies low enough for telephony and
games. Also covers high latency uses, such as movie and music playback
(2s of latency is a good choice).

Minimal latencies are a
definining requirement.

System is highly dynamic, with applications starting/stopping, hardware added and removed all the time.

System is usually static in its configuration during operation.

User is usually not proficient in the technologies used.[1]

User is usually a professional and knows audio technology and computers well.

User is not necessarily the administrator of his machine, might have limited access.

User usually administrates his own machines, has root privileges.

Audio is just one use of the system among many, and often just a background job.

Audio is the primary purpose of the system.

Hardware tends to have limited resources and be crappy and cheap.

Hardware is powerful, expensive and high quality.

Of course, things are often not as black and white like this, there are uses
that fall in the middle of these two areas.

From the table above a few conclusions may be drawn:

A consumer sound system must support both low and high latency operation.
Since low latencies mean high CPU load and hence high power
consumption[2] (Heisenberg...), a system should always run with the
highest latency latency possible, but the lowest latency necessary.

Since the consumer system is highly dynamic in its use latencies must be
adjusted dynamically too. That makes a design such as PulseAudio's timer-based scheduling important.

A pro audio system's primary optimization target is low latency. Low
power usage, dynamic changeble configuration (i.e. a short drop-out while you
change your pipeline is acceptable) and user-friendliness may be sacrificed for
that.

For large buffer sizes a zero-copy design suggests itself: since data
blocks are large the cache pressure can be considerably reduced by zero-copy
designs. Only for large buffers the cost of passing pointers around is
considerable smaller than the cost of passing around the data itself (or the
other way round: if your audio data has the same size as your pointers, then
passing pointers around is useless extra work).

On a resource constrained system the ideal audio pipeline does not touch
and convert the data passed along it unnecessarily. That makes it important to
support natively the sample types and interleaving modes of the audio source or
destination.

A consumer system needs to simplify the view on the hardware, hide the its
complexity: hide redundant mixer elements, or merge them while making use of
the hardware capabilities, and extending it in software so that the same
functionality is provided on all hardware. A production system should not hide
or simplify the hardware functionality.

A consumer system should not drop-out when a client misbehaves or the
configuration changes (OTOH if it happens in exceptions it is not disastrous
either). A synchronous pipeline is hence not advisable, clients need to supply
their data asynchronously.

In a pro audio system a drop-out during reconfiguration is acceptable,
during operation unacceptable.

In consumer audio we need to make compromises on resource usage,
which pro audio does not have to commit to. Example: a pro audio
system can issue memlock() with little limitations since the
hardware is powerful (i.e. a lot of RAM available) and audio is the
primary purpose. A consumer audio system cannot do that because that
call practically makes memory unavailable to other applications,
increasing their swap pressure. And since audio is not the primary
purpose of the system and resources limited we hence need to find a
different way.

Jack has been designed for low latencies, where synchronous
operation is advisable, meaning that a misbehaving client call stall
the entire pipeline. Changes of the pipeline or latencies usually
result in drop-outs in one way or the other, since the entire pipeline
is reconfigured, from the hardware to the various clients. Jack only
supports FLOAT32 samples and non-interleaved audio channels (and that
is a good thing). Jack does not employ reference-counted zero-copy
buffers. It does not try to simplify the hardware mixer in any
way.

PulseAudio OTOH can deal with varying latancies, dynamically
adjusting to the lowest latencies any of the connected clients
needs. Client communication is fully asynchronous, a single client
cannot stall the entire pipeline. PulseAudio supports a variety of PCM
formats and channel setups. PulseAudio's design is heavily based on
reference-counted zero-copy buffers that are passed around, even
between processes, instead of the audio data itself. PulseAudio tries
to simplify the hardware mixer as suggested above.

Now, the two paragraphs above hopefully show how Jack is more
suitable for the pro audio use case and PulseAudio more for the
consumer audio use case. One question asks itself though: can we marry
the two approaches? Yes, we probably can, MacOS has a unified approach
for both uses. However, it is not clear this would be a good
idea. First of all, a system with the complexities introduced by
sample format/channel mapping conversion, as well as dynamically
changing latencies and pipelines, and asynchronous behaviour would
certainly be much less attractive to pro audio developers. In fact,
that Jack limits itself to synchronous, FLOAT32-only,
non-interleaved-only audio streams is one of the big features of its
design. Marrying the two approaches would corrupt that. A merged
solution would probably not have a good stand in the community.

But it goes even further than this: what would the use case for
this be? After all, most of the time, you don't want your event
sounds, your Youtube, your VoIP and your Rhythmbox mixed into the new
record you are producing. Hence a clear seperation between the two
worlds might even be handy?

Also, let's not forget that we lack the manpower to even create
such an audio chimera.

So, where to from here? Well, I think we should put the focus on
cooperation instead of amalgamation: teach PulseAudio to go out of the
way as soon as Jack needs access to the device, and optionally make
PulseAudio a normal JACK client while both are running. That way, the
user has the option to use the PulseAudio supplied streams, but
normally does not see them in his pipeline. The first part of this has
already been implemented: Jack2 and PulseAudio do not fight for the
audio device, a friendly handover takes place. Jack takes precedence,
PulseAudio takes the back seat. The second part is still missing: you
still have to manually hookup PulseAudio to Jack if you are interested
in its streams. If both are implemented starting Jack basically has
the effect of replacing PulseAudio's core with the Jack core, while
still providing full compatibility with PulseAudio clients.

And that I guess is all I have to say on the entire Jack and
PulseAudio story.

Oh, one more thing, while we are at clearing things up: some news
sites claim that PulseAudio's not necessarily stellar reputation in
some parts of the community comes from Ubuntu and other distributions
having integrated it too early. Well, let me stress here explicitly,
that while they might have made a mistake or two in packaging
PulseAudio and I publicly pointed that out (and probably not in a too
friendly way), I do believe that the point in time they adopted it was
right. Why? Basically, it's a chicken and egg problem. If it is not
used in the distributions it is not tested, and there is no pressure
to get fixed what then turns out to be broken: in PulseAudio itself,
and in both the layers on top and below of it. Don't forget that
pushing a new layer into an existing stack will break a lot of
assumptions that the neighboring layers made. Doing this must
break things. Most Free Software projects could probably use more
developers, and that is particularly true for Audio on Linux. And
given that that is how it is, pushing the feature in at that point in
time was the right thing to do. Or in other words, if the features are
right, and things do work correctly as far as the limited test base
the developers control shows, then one day you need to push into the
distributions, even if this might break setups and software that
previously has not been tested, unless you want to stay stuck in your
development indefinitely. So yes, Ubuntu, I think you did well with
adopting PulseAudio when you did.

Footnotes

[1] Side note: yes, consumers tend not to know what dB is, and expect
volume settings in "percentages", a mostly meaningless unit in
audio. This even spills into projects like VLC or Amarok which expose
linear volume controls (which is a really bad idea).

[2] In case you are wondering why that is the case: if the latency is
low the buffers must be sized smaller. And if the buffers are sized smaller
then the CPU will have to wake up more often to fill them up for the same
playback time. This drives up the CPU load since less actual payload can be
processed for the amount of housekeeping that the CPU has to do during each
buffer iteration. Also, frequent wake-ups make it impossible for the CPU to go
to deeper sleep states. Sleep states are the primary way for modern CPUs
to save power.