[haiku-appserver] Re: drawing thread

From: "Rudolf" <drivers.be-hold@xxxxxxxxxxxx>

To: haiku-appserver@xxxxxxxxxxxxx

Date: Thu, 21 Oct 2004 10:44:18 +0200 CEST

Hi Adi,
> We use ViewDriver because our instruction clipping code (when in an
> update you are allowed to draw only in a specified region) is not in
> place and we use BView::ConstrainClippingRegion(&reg) for that.
Nice! I didn't even know that such a option existed.. I guess it's
because I never really wrote an app yet :-/
> When Gabe finises the clipping code(read DisplayDrive it's done)
> we'll
> be able to use the full acceleration an accelerant can give us.
So, you can use DD and AccelerantDriver. Does Accelerantdriver already
has the engine stuff in now? Or will it do software exectution only at
the beginning? (just curious)
> > -->I am assuming here that these buffers reside on the
> > Graphicscard.
> In R1, no.
OK.
> > The sentence also implies that both the source and destination of
> > the
> > drawing actions are residing on the graphicscards RAM, or acc would
> > not
> > be possible (ATM at least).
>
> HW cursor is possible. The rest, no. It's no problem, we'll use
> MMX,
> SSE. That is until you provide us with drivers that can use pixel
> shaders. :-))))
OK, so you are indeed confirming here you know you won't be using the
ACC engine in the driver. You are using MMX and SSE which in my book is
software drawing, so no acceleration from the cards engine. It's indeed
the best you can do for now, so sounds OK to me :)
> > OK, here's my 'warning':
> > While you are correct in saying you can draw 'parallel' in those
> > regions inside the buffer, there may be important performance
> > penalties
> > if you don't serialize the access after all. The 'burst mode' of
> > writing across the PCI/AGP bus depends on serialized access (at
> > least
> > beyond those say 32bytes blocks, within these blocks it doesn't
> > matter). Burst mode (fully automatically generated by the system's
> > hardware) works in PCI mode, and in AGP mode. In AGP mode bursts
> > are
> > the ones being accelerated with the 'fastwrites' feature.
>
> How do we serialize access?
Locking, semaphores. Only one thread draws in the graphicsRAM at a
time. This ensures that you have the best chance that it's memory is
accessed in a more or less serial way. Of course, in the end it depends
on what every thread is doing. If you are updating some piece of
background window because it just became foreground (some other window
moved away), you will be doing it serially as well: every line of
screen is being updated (filled) from left to right (or vice versa).
You only make jumps in memory if you reach the end of such a updated
part of a line: the jump will be "bytes_per_row". After this serial
access again happens.
So this kind of doublebuffering would be fast (bursts, FW) and sounds
like a good plan (double-buffering 'source' bitmaps are in main mem).
If locking is done: but you could of course just benchmark both options
for yourself and see if I am correct ;-)
If you _don't_ use doublebuffering, then suddenly you can't 'guarantee'
every thread will do serialized access optimally: now the app's
behaviour will determine this. Video will off course be 'serialized' in
the app, but if some figure is to be drawn onscreen (a transparant box
or whatever), then this can't be done in a serial fashion (because the
content of the box is not touched).
> If we're doing a blit(copy line by line) from main mem to on-screen
> video mem, won't the HW generated burst step into action?
Yep. OK, you could call this HW acceleration also. In my book however,
I would not call it that way. I guess I still have to read you
app_server guys book of defines yet.. :-/ (or vice versa of course ;-)
I wouldn't even know what to call it per se, but something like 'bus
acceleration' as a name for it would be much better here..
Anyway: Indeed, burst and FW will be optimally as stated above 8-)
It's a good idea to have (at least) 32bit word alignment in place for
the source bitmaps in main mem BTW.
> > If indeed both the source and destination of the drawing action
> > reside
> > in graphicsRAM, this performance issue only exists if you use non-
> > accelerated drawing actions, because the CPU would need to read the
> > data across the PCI/AGP bus, and then write it back, while the
> > acceleration engine keeps it local on the graphicscard RAM.
>
> I don't think we'll be using videoMem that way, and I'm speaking
> about
> R2 here. R3, who knows, if we can use those pixel shader units... :-)
Good.
But, we have to talk about defines again. What do you mean by pixel
shader units? They have nothing to do with it (I can tell you that
without knowing exactly what they are... :)
The only thing we need to get 2D acc from main mem is me instructing
the engine to fetch the source from main mem instead of from local
graphicsmem (on matrox one single flag! (plus adressing of course). The
engine instructions themselves are identical (in theory) to what they
are now.
This is a step I am going to investigate at some point, hopefully in
the not (too) distant future. The main problem here is to get cache
coherency working OK I expect. There's a second step also (GART and
aperture), but that only improves speed if you are going to fetch
mutliple bitmaps 'simultaneously' (so that's one reason why normally
this stuff is only used for 3D acc: the second reason is stability. If
you want to use 2D acc later on, you should consider real AGP transfers
and so acc itself (by letting engine fetch from main mem) as being an
option. You should take into account that only FW is used, and in some
cases, even only standard PCI. (off course we'll have to see what PCIe
brings us as well..)
If you mean using real 3D for the desktop already (by mentioning pixel
shader units), then of course, AGP transfers are much less an option or
speed would get unworkable low probably. Although I guess it would
still be a good plan to let it be useable without 3D acc and acc engine
fetches from main mem (software mode via MESA without 3D acc drivers).
(Quake2 on my laptop runs at 3.9fps already with MESA6.1, so in openGL
mode (compared to those 92FPS in internal software rendering mode I
once talked about))
> After the talks we've had, I thought good and wanted to ask your
> opinion
> on this:
> Do you think it's good to draw(with the CPU) in vidMem off-screen
> surfaces(I mean double buffering in videoMem)? I say no because of
> the
> PCI bus; writing _and_ reading is expensive.
I say no too, because of just this reason.
>I think, a better solution
> is to do triple buffering. Have an off-screen surface in mainMem into
> which a window will draw. When drawing it's done, blit this in an
> _off-screen surface in video memory_, validate a flag it's OK to use
> that surface, and when a portion of that window is needed we use the
> 2D(/3D) engine to blit on-screen.
> What do you think?
Agreed. Although I did not mention this setup, it did cross my mind :)
Of course, in the end there maybe more to consider. Like, you are
running 3D apps in a window at the same time. Card memory you use for
triplebuffering the desktop is nolonger available for the 3D app,
lowering it's speed (by it needing more frequently fetching stuff from
main mem: assuming the bus stays a bottleneck.)
BTW: Talking about defines again: I find it interesting to see you talk
about blits when you actually mean copying. In my book the word 'blit'
in reserved for acc engine mem copying. (so acceleration). I mean, I
don't want 'my book' to be nessesarily right, but this _is_ a potential
problem: we could misunderstand each other easily by not having these
terms defined clearly...
======
Hey, I still miss something else that's interesting as well I think. I
understand what you mean by double/triple buffering, and I understand
why you want to do that. But still, I am wondering about another of
BeOS it's shortcomings: updating the screen in such a way that you
won't see distortions if you drag a window (for instance). I mean:
tearing. Some people talk about double buffering in this context:
having a copy of the entire screen in cardRAM that is switched between
during retraces. This of course requires both buffers to be updated
with everything, and you can talk and think a lot about strategies to
get this done with minimal overhead.
But the goal it serves would be a perfect, undistorted screen output at
all times, which would be nice to have as well at some point.. (just
updating the single screenbuffer during retrace is much too expensive I
guess, as this leaves very little time to update the buffer: the
retrace is a relative very short piece in time compared to 'full-time
acc', as is used now.)
Greetings!
Rudolf.