Issue 3-36, September 9, 1998

Be Engineering Insights: Changes in the BeOS Driver API

By CyrilMeurillon

Oh no, you think, yet another article about drivers. Are they crazy about
drivers at Be, or what? Ouaire iz ze beauty in driverz? The truth is that
I would have loved to write about another (hotter) topic, one that has
kept me very busy for the past few months, but my boss said I couldn't
(flame him at cyrilsboss@be.com ;-). I guess I'll have wait until it
becomes public information. In the meantime, please be a good audience,
and continue reading my article.

Before I get on with the meat of the subject, I'd like to stress that the
following information pertains to our next release, BeOS Release 4.
Because R4 is still in the making, most of what you read here is subject
to change in the details, or even in the big lines. Don't write code
today based on the following. It is provided to you mostly as a hint of
what R4 will contain, and where we're going after that.

Introduction of Version Control

That's it. We finally realized that our driver API was not perfect, and
that there was room for future improvements, or "additions." That's why
we'll introduce version control in the driver API for R4. Every driver
built then and thereafter will contain a version number that tells which
API the driver complies to.

In concrete terms, the version number is a driver global variable that's
exported and checked by the device file system at load time. In
Drivers.h
you'll find the following declarations:

#define B_CUR_DRIVER_API_VERSION 2
extern _EXPORT int32api_version;

In your driver code, you'll need to add the following definition:

#include <Drivers.h>
...
int32api_version = B_CUR_DRIVER_API_VERSION.

Driver API version 2 refers to the new (R4)
API. Version 1 is the R3 API.
If the driver API changes, we would bump the version number to 3. Newly
built drivers will have to comply to the new API and declare 3 as their
API version number. Old driver binaries would still declare an old
version (1 or 2), forcing the device file system to translate them to the
newer API (3). This incurs only a negligible overhead in loading drivers.

But, attendez, vous say. What about pre-R4 drivers, which don't declare
what driver API they comply to? Well, devfs treats drivers without
version number as complying to the first version of the API—the one
documented today in the Be Book. Et voila.

New Entries in the device_hooks Structure

I know you're all dying to learn what's new in the R4 driver API... Here
it is, revealed to you exclusively! We'll introduce scatter-gather and (a
real) select in R4, and add a few entries in the device_hooks structure
to let drivers deal with the new calls.

These calls let you read and write multiple buffers to/from a file or a
device. They initiate an IO on the device pointed to by fd, starting at
position pos, using the count buffers described in the array vec.

One may think this is equivalent to issuing multiple simple reads and
writes to the same file descriptor—and, from a semantic standpoint, it
is. But not when you look at performance!

Most devices that use DMA
are capable of "scatter-gather." It means that
the DMA can be programmed to handle, in one shot, buffers that are
scattered throughout memory. Instead of programming N times an IO that
points to a single buffer, only one IO needs to be programmed, with a
vector of pointers that describe the scattered buffers. It means higher
bandwidth.

At a lower level, we've added two entries in the device_hooks structure:

Devices that can take advantage of scatter-gather should implement these
hooks. Other drivers can simply declare them NULL. When a readv() or
writev() call is issued to a driver that does not handle scatter-gather,
the IO is broken down into smaller IO using individual buffers. Of
course, R3 drivers don't know about scatter-gather, and are treated
accordingly.

Select

I'm not breaking the news either with this one. Trey announced in his
article last week the coming of select(). This is another call that is
very familiar to UNIX programers:

rbits, wbits and ebits
are bit vectors. Each bit represents a file
descriptor to watch for a particular event:

rbits: wait for input to be available (read returns something
immediately without blocking)

wbits: wait for output to drain (write of 1 byte does not block)

ebits: wait for exceptions.

select() returns when at least one event has occurred, or when it times
out. Upon exit, select() returns (in the different bit vectors) the file
descriptors that are ready for the corresponding event.

select() is very convenient because it allows a single thread to deal
with multiple streams of data. The current alternative is to spawn one
thread for every file descriptor you want to control. This might be
overkill in certain situations, especially if you deal with a lot of
streams.

select() is broken down into two calls at the
driver API level: one hook
to ask the driver to start watching a given file descriptor, and another
hook to stop watching.

cookie represents the file descriptor to watch. event tells what kind of
event we're waiting on for that file descriptor. If the event happens
before the deselect hook is invoked, then the driver has to call:

extern voidnotify_select_event(selectsync *sync, uint32ref);

with the sync and ref
it was passed in the select hook. This happens
typically at interrupt time, when input buffers are filled or when output
buffers drain. Another place where notify_select_event() is likely to be
called is in your select hook, in case the condition is already met there.

The deselect hook is called to indicate that the file descriptor
shouldn't be watched any more, as the result of one or more events on a
watched file descriptor, or of a timeout. It is a serious mistake to call
notify_select_event() after your deselect hook has been invoked.

Drivers that don't implement select() should
declare these hooks NULL.
select(), when invoked on such drivers, will return an error.

Now, they're encapsulated in the PCI bus manager. The same happened for
the ISA, SCSI and IDE
bus related calls. More busses will come. This
makes the kernel a lot more modular and lightweight, as only the code
handling the present busses are loaded in memory.

A New Organization for the Drivers Directory

In R3, /boot/beos/system/add-ons/kernel/drivers/ and
/boot/home/config/add-ons/kernel/drivers/ contained the drivers. This
flat organization worked fine. But it had the unfortunate feature of not
scaling very well as you add drivers to the system, because there is no
direct relation between the name of a device you open and the name of the
driver that serves it. This potentially causes all drivers to be searched
when an unknown device is opened.

That's why we've broken down these directories into subdirectories that
help the device file system locate drivers when new devices are opened.

../add-ons/kernel/dev/
mirrors the devfs name space using symlinks and
directories

../add-ons/kernel/bin/
contains the driver binaries

For example, the serial driver publishes the following devices:

ports/serial1
ports/serial2

It lives under ../add-ons/kernel/bin/
as serial, and has the following
symbolic link set up:

../add-ons/kernel/drivers/dev/ports/serial -> ../../bin/serial

If "fred", a driver, wishes to publish a ports/XYZ device, then it should
setup this symbolic link:

../add-ons/kernel/drivers/dev/ports/fred -> ../../bin/fred

If a driver publishes devices in more than one directory, then it must
setup a symbolic link in every directory in publishes in. For example,
driver "foo" publishes:

This new organization speeds up device name resolution a lot. Imagine
that we're trying to find the driver that serves the device
/dev/fred/bar/machin.
In R3, we have to ask all the drivers known to
the system, one at a time, until we find the right one. In R4, we only
have to ask the drivers pointed to by the links in
../add-ons/kernel/drivers/dev/fred/bar/.

Future Directions

You see that the driver world has undergone many changes in BeOS Release
4. All this is nice, but there are other features that did not make it
in, which we'd like to implement in future releases. Perhaps the most
important one is asynchronous IO. The asynchronous read() and write()
calls don't block—they return immediately instead of waiting for the
IO to complete. Like select(), asynchronous IO makes it possible for a
single thread to handle several IOs simultaneously, which is sometimes a
better option than spawning one thread for each IO you want to do
concurrently. This is true especially if there are a lot of them.

Thanks to the driver API versioning, we'll have no problems throwing the
necessary hooks into the device_hooks structure while remaining backward
compatible with existing drivers.

Be Engineering Insights: Higher-Performance Display

By Jean-BaptisteQuéru

In application writing, the Interface Kit (and the Application Server
which runs underneath the Kit) are responsible for handling all the
display that finally goes on screen. They provide a nice, reasonably fast
way to develop a good GUI for your application.

Sometimes however, they aren't fast enough, especially for game writing.
Using a windowed-mode BDirectWindow sometimes helps (or doesn't slow
things down, in any case), but you still have to cooperate with other
applications whose windows can suddenly overlap yours or want to use the
graphics accelerator exactly when you need it. Switching to a full-screen
BDirectWindow improves things a little more, but you may still want even
higher performance. What you need is a BWindowScreen.

The BWindowScreen basically allows you to establish an (almost) direct
connection to the graphics driver, bypassing (almost) the whole
Application Server. Its great advantage over BDirectWindow is that it
allows you to manipulate all the memory from the graphics card, instead
of just having a simple frame buffer. Welcome to the world of double- (or
triple-) buffering, of high-speed blitting, of 60+ fps performance.

Looks quite exciting, hey? Unfortunately, all is not perfect.
BWindowScreen is a low-level API. This means that you'll have to do many
things by hand that you were used to having the Application Server do for
you. BWindowScreen is also affected by some hardware and software bugs,
which can make things harder than they should be.

BWindowScreen reflects the R3 graphics architecture. That architecture is
going away in R4, since it was becoming dated. The architecture that
replaces it will allow some really cool things in later releases.
BWindowScreen is still the best way to get high-performance full screen
display in R4, though it too will be replaced by something even better in
a later release.

There are some traps to be aware of before you begin playing with the
BWindowScreen:

About BWindowScreen(), SetSpace()
and SetFrameBuffer():

The constructor
does not completely initialize the BWindowScreen internal data.

You should call Show(), SetSpace()
and SetFrameBuffer() *in that order*
if you want the structures returned by CardInfo()
and FrameBufferInfo()
to be valid.

You should call Show() just after
constructing the BWindowScreen
object, and call SetSpace() and
SetFrameBuffer() in ScreenConnected()
*each time* your BWindowScreen is connected (not just the first time).

You should neither call SetSpace() without
SetFrameBuffer() nor call
SetFrameBuffer() without SetSpace().
Always call SetSpace() *then*
SetFrameBuffer() for the best results.

Choosing a good color_space and a good framebuffer size:

You should be aware that in R3.x some drivers do not support 16 bpp,
and some others do not support 32 bpp. You should also know that some
graphics cards do not allow you to choose any arbitrary framebuffer size;
some will not accept a framebuffer wider than 1600 or 2048, or higher
than 2048, some will only be able to use a small set of widths.

I recommend not using a framebuffer wider than the display area (except
for temporary development reasons or if you don't care about
compatibility issues). It's also a good idea not to use the full graphics
card memory but to leave 1kB to 4kB unused (for the hardware cursor).

Here are some height limits you should not break if you want your
program to be compatible with the mentioned cards:

Although the Be Book says that MoveDisplayArea() can be used for hardware
scrolling, you shouldn't try to use it that way. Some graphics cards are
known to not implement hardware scrolling properly. You should try to use
MoveDisplayArea() only with x=0, and only for page-flipping (not for real
hardware scrolling).

CardHookAt(10) ("sync"):

One of the keys to high-performance—the graphics card hooks must be
treated with special attention. If there is a sync function (hook number
10), all other hooks can be asynchronous. Be careful to call the sync
hook when it's needed (e.g., to synchronize hardware acceleration and
framebuffer access, or to finish all hardware accelerations before
page-flipping or before being disconnected from the screen).

ScreenConnected() and multiple monitors:

While R3 does not support any form of multiple monitors, future releases
will. You should keep in mind that a BWindowScreen might be disconnected
from one screen and reconnected to another one. Consequently, you must
refresh the card hooks each time your BWindowScreen is connected, as well
as any variable that could be affected by a change in CardInfo() or
FrameBufferInfo().

MoveDisplayArea() and the R3 Matrox driver:

In R3.x, MoveDisplayArea() returns immediately but the display area is
not effective until the next vertical retrace, except for the Matrox
driver. The default Matrox driver actually waits until the next vertical
retrace before returning (and sometimes misses a retrace and has to wait
until the next one). There is an alternate Matrox driver at
ftp://ftp.be.com/pub/beos_updates/r3/intel/Matrox.zip which returns
immediately, but the display area is effective immediately as well. Seen
from the program, this driver has the same behaviour as all other
drivers, at the cost of a little tearing. It's advisable to use that
driver when developing BWindowScreen applications under R3. (All drivers
will have the same behaviour in R4.)

About 15/16bpp:

We have discovered the bugs in the R3 drivers that affected 5/16bpp
WindowScreens with ViRGE and Matrox cards. There are some updated drivers
available at:
ftp://ftp.be.com/pub/beos_updates/r3/intel/Matrox.zip and
ftp://ftp.be.com/pub/beos_updates/r3/intel/virge.zip

Also be aware that some drivers do not support both 15bpp and 16bpp. Even
worse, the old Matrox driver would use a 15bpp screen when asked for
16bpp. Update your drivers!

Developers Workshop: Yet Another Locking Article

By StephenBeaulieu

It is funny, but somewhat fitting that many times the Newsletter article
you intend to write is not really the Newsletter article you end up
writing. With the best of intentions, I chose to follow a recent trend in
articles and talk about multithreaded programming and locking down
critical sections of code and resources. The vehicle for my discussion
was to be a Multiple-Reader Single-Writer locking class in the mode of
BLocker, complete with Lock(),
Unlock(), IsLocked() and an
Autolocker-style utility class. Needless to say, the class I was
expecting is a far cry from what I will present today.

In the hopes of this being my first short Newsletter article, I will
leave the details of the class to the sample code. For once it was
carefully prepared ahead of time and is reasonably commented. I will
briefly point out two neat features of the class before heading into a
short discussion of locking styles. The first function to look at is the
IsWriteLocked() function, as it shows a way to cache the index of a
thread's stack in memory, and use it to help identify a thread faster
than the usual method, find_thread(NULL).

The stack_base method is not infallible, and needs to be backed up by
find_thread(NULL) when there is no match, but it is considerably faster
when a match is found. This is kind of like the benaphore technique of
speeding up semaphores.

The other functions to look at are the register_thread() and
unregister_thread() functions. These are debug functions that keep state
about threads holding a read-lock by creating a state array with room for
every possible thread. An individual slot can be set aside for each
thread and specified by performing an operation: thread_id %
max_possible_threads. Again, the code itself lists these in good detail.
I hope you find the class useful. A few of the design decisions I made
are detailed in the discussion below.

I want to take a little space to discuss locking philosophies and their
trade-offs. The two opposing views can be presented briefly as "Lock
Early, Lock Often" and "Lock Only When and Where Necessary." These
philosophies sit on opposite ends of the spectrum of ease of use and
efficiency, and both have their adherents in the company (understanding
that most engineers here fall comfortably in the middle ground.)

The "Lock Early, Lock Often" view rests on the idea that if you are
uncertain exactly where you need to lock, it is better to be extra sure
that you lock your resources. It advises that all locking classes should
support "nested" calls to Lock(); in other words if a thread holds a lock
and calls Lock() again, it should be allowed to continue without
deadlocking waiting for itself to release the lock. This increases the
safety of the lock, by allowing you to wrap all of your functions in
Lock() / Unlock() pairs
and allowing the lock to take care of knowing if
the lock needs to be acquired or not. An extension of this are
Autolocking classes, which acquire a lock in their constructor and
release it in their destructor. By allocating one of these on the stack
you can be certain that you will safely hold the lock for the duration of
your function.

The main advantage of the "Lock Early, Lock Often" strategy is its
simplicity. It is very easy to add locking to your applications: create
an Autolock at the top of all your functions and be assured that it will
do its magic. The downside of this philosophy is that the lock itself
needs to get smarter and to hold onto state information, which can cause
some inefficiencies in space and speed.

At the other end of the spectrum is the "Lock Only When and Where
Necessary." This philosophy asserts that programmers using the "Lock
Early, Lock Often" strategy do not understand the locking requirements of
their applications, and that is essentially a bug just waiting to happen.
In addition, the overhead added to applications by locking when it is
unnecessary (say, in a function that is only called >from within another
function that already holds the lock) and by using an additional class to
manage the lock makes the application larger and less efficient. This
view instead requires programmers to really design their applications and
to fully understand the implications of the locking mechanisms chosen.

So, which is correct? I think it often depends on the tradeoffs you are
willing to make. With locks with only a single owner, the state
information needed is very small, and usually the lock's system for
determining if a thread holds the lock is fairly efficient (see the
stack_base trick mentioned above to make it a bit faster.) Another
consideration is how important speed and size are when dealing with the
lock. In a very crucial area of an important, busy program, like the
app_server, increasing efficiency can be paramount. In that case it is
much, much better to take the extra time to really understand the locking
necessary and to reduce the overhead. Even better would be to design a
global application architecture that makes the flow of information clear,
and correspondingly makes the locking mechanisms much better (along with
everything else.)

The MultiLocker sample code provided leans far to the efficiency side.
The class itself allows multiple readers to acquire the lock, but does
not allow these readers to make nested ReadLock() calls. The overhead for
keeping state for each readers (storage space and stomping through that
storage space every time a ReadLock() or
ReadUnlock() call was made) was
simply too great. Writers, on the other hand, have complete control over
the lock, and may make ReadLock() or
additional WriteLock() calls after
the lock has been acquired. This allows a little bit of design
flexibility so that functions that read information protected by the lock
can be safely called by a writer without code duplication.

The class does have a debug mode where state information is kept about
readers so you can be sure that you are not performing nested
ReadLock()s. The class also has timing functions so that you can see how
long each call takes in both DEBUG mode and, with slight modifications to
the class, the benefits of the stack-based caching noted above. I have
included some extensive timing information from my computers that you can
look at, or you can run your own tests with the test app included. Note
that the numbers listed are pretty close to the raw numbers of the
locking overhead, as writers only increment a counter, and readers simply
access that counter.

The sample code can be found at:

ftp://ftp.be.com/pub/samples/support_kit/MultiLock.zip

The class should be pretty efficient, and you are free to use it and make
adjustments as necessary. My thanks go out to Pierre and George from the
app_server team, for the original lock on which this is based, and for
their assistance with (and insistence on) the efficiency concerns.

Is the A/V Space a Niche Market?

By Jean-LouisGassée

And, if it is, are we wrong to focus on it? Can we pace off enough
running room to launch the virtuous ascending spiral of developers
begetting users begetting developers? Is the A/V space large enough to
swing a cat and ignite a platform?

Perhaps there's another way to look at the platform question, one that's
brought to mind by the latest turn of Apple's fortunes. Back in 1985,
Apple had a bad episode: The founders were gone, the new Mac wasn't
taking off and the establishment was dissing Apple as a toy company with
a toy computer. The advice kept pouring in: reposition the company,
refocus, go back to your roots, find a niche where you have a distinctive
advantage. One seer wanted to position Apple as a supplier of
Graphics-Based Business Systems, another wanted to make the company the
Education Computer Company. Steve Jobs, before taking a twelve year
sabbatical, convinced Apple to buy 20% of Adobe, and thus began the era
of desktop publishing and the Gang of Four (Apple, Adobe, Aldus and
Canon).

Apple focused on publishing, and is still focused on publishing (as
evidenced by the other Steve—Ballmer—ardently promoting NT as *the*
publishing platform). Does that make Apple a publishing niche player? Not
really. iMac buyers are not snapping up the "beetle" Mac for publishing,
they just want a nice general-purpose computer. Although Apple is still
thrown into the publishing bin, the Mac has always strived to be an
everyday personal computer, and the numbers show that this isn't mere
delusion: For example, Macs outsell Photoshop ten to one. But let's
assume that at the company's zenith, publishing made up as much as 25% of
Apple sales. Even then, with higher margin CPUs, Apple couldn't live on
publishing alone, hence the importance of a more consumer-oriented
product such as the iMac and hence, not so incidentally, the importance
of keeping Microsoft Office on the platform.

The question of the viability of an A/V strategy stems from us being
thrown into the same sort of bin as our noble hardware predecessor. But
at Be we have an entirely different business model. A hardware company
such as Apple can't survive on a million units per year. Once upon a time
it could, but those were the salad days of expensive computers and 66%
gross margins. We, on the other hand, have a software-only business model
and will do extremely well with less than a million units per year--and
so will our developers. As a result, the virtuous spiral will ignite
(grab a cat).

More important—and here we share Apple's "niche-yet-general" duality
-- the question may be one that never needs to be answered: While BeOS
shows its unique abilities in A/V, we're also starting to see
applications for everyday personal computing. I'm writing this column on
Gobe Productive and e-mailing it to the prose-thetic surgeon using
Mail-It, both purchased using NetPositive and SoftwareValet.