Re: [Dri-devel] Busmastering docs

Peter Surda wrote:
> I just believe that with current code it IS POSSIBLE TO DO THIS WITH R128, I
> just don't know how, I have too little experience with both busmastering and
> video drivers. I think I just need to paste (and slightly modify) something
> into the CopyData function and it will start working (that's how I managed
> to force my AIW to use the undocumented TV-Out feature a couple of days
> ago).
Is it just me, or does the drmDMA() function in
xc/programs/Xserver/hw/xfree86/os-support/linux/drm/xf86drm.c look promising?
:)
--
Earthling Michel Dänzer (MrCooper) \ Debian GNU/Linux (powerpc) developer
CS student, Free Software enthusiast \ XFree86 and DRI project member

Thread view

Dear developers!
I have been debugging XFree86 ATI driver because video data transfer while
doing Xv causes too much load on the CPU. It is a single memcpy that causes=
X
to eat 35% CPU both on mach64 and r128 (tdfx, mga and nvidia use the same
memcpy but probably can eat the data faster and don't choke because they do=
n't
have this problem). After talking to XFre86, gatos and kernel developers ab=
out
how to fix it, they suggested using busmastering instead of memcpy. I alrea=
dy
tried mmx-optimized-memcpy-in-assembly and it didn't help.
I searched for docs but can't find anything. Can someone give me a
crash-course to using busmastering? Like=20
----------------
#include "fsck.h";
init_something();
lock_something();
do_dma_transfer();
unlock_something();
----------------
I'd be very grateful. Oh and please CC, I'm not on the list.
Bye,
Peter Surda (Shurdeek) <shurdeek@...>, ICQ 10236103, +436505=
122023
--
Failure is not an option. It comes bundled with your Microsoft product.

On Monday 27 August 2001 17:18, Peter Surda wrote:
> I have been debugging XFree86 ATI driver because video data transfer while
> doing Xv causes too much load on the CPU. It is a single memcpy that causes
> X to eat 35% CPU both on mach64 and r128 (tdfx, mga and nvidia use the same
> memcpy but probably can eat the data faster and don't choke because they
> don't have this problem). After talking to XFre86, gatos and kernel
> developers about how to fix it, they suggested using busmastering instead
> of memcpy. I already tried mmx-optimized-memcpy-in-assembly and it didn't
> help.
That would require documentation from ATI- and it differs from chipset to
chipset, so you'd need different code for RagePRO and Rage128(PRO) chips.
It's not something in the kernel. Bus-mastering operation is where you tell
a peripheral device to write directly to a given space in memory, leaving the
CPU to do other tasks. As for being able to eat the data faster, I doubt it-
they're likely doing a bus-master operation on the other adapters because the
memcpy operation is a cycle hog that doesn't change much because it's limited
to the bus bandwidth of your CPU.
--
Frank Earl

On Mon, Aug 27, 2001 at 08:37:06PM -0400, Frank Earl wrote:
> That would require documentation from ATI- and it differs from chipset to=
=20
> chipset, so you'd need different code for RagePRO and Rage128(PRO) chips.
As r128 and radeon both have DRI, BusMastering already IS there for them (I
see when I play q3 that X doesn't eat any cpu time and the amount of data t=
hat
has to be transferred is surely comparable to that with Xv). Just Xv functi=
ons
are currently implemented in a way that don't use busmastering. Besides I h=
ave
close contact to at least 2 developers who have access to this docs and they
surely would help me if it was necessary.
And as for mach64, there isn't a usable DRI driver yet, but Utah-GLX DOES
support BusMastering, so theoretically there is a sample code for all needed
cards already. No need for extra docs from ATI. Only some programming.
> It's not something in the kernel. Bus-mastering operation is where you te=
ll
> a peripheral device to write directly to a given space in memory, leaving
> the CPU to do other tasks.=20
I thought it is not only for writing but also for reading, or am I wrong? H=
ow
else could textures be transferred to the card then?
> As for being able to eat the data faster, I doubt it-=20
It isn't about the data being transferred faster and function to return
sooner, but about the CPU not being blocked in between. A typical Xv client=
, a
video player, is usually multithreaded, which means that even if the video
drawing thread is blocked, video decoder thread can profit from the extra
saved CPU time.
Currently watching DVD-sized divx videos on r128 and mach64 simply sucks
because X eats between 35% and 60% CPU and video decoder starves and frames
have to be dropped.
> they're likely doing a bus-master operation on the other adapters because
> the memcpy operation is a cycle hog that doesn't change much because it's
> limited to the bus bandwidth of your CPU.
Look at the {drivername}CopyData422 functions in the source code. It is
EXACTLY THE SAME CODE for all 4 drivers I checked (mach64, r128, mga and
tdfx). Except mga and tdfx seem to be able to handle faster transfers so the
problem isn't that visible there.
I just need (as I said) a simple kick in the right direction, like "you have
to include dma.h and use the do_the_fscking_transfer() to actually transfer
the data".
I think that adding a busmastered version of Xv drawing for systems that
support it is "the right thing (TM)". Why not to use all the advantages the
hardware offers?
> Frank Earl
Bye,
Peter Surda (Shurdeek) <shurdeek@...>, ICQ 10236103, +436505=
122023
--
The dark ages were caused by the Y1K problem.

On Monday 27 August 2001 22:26, Peter Surda wrote:
> As r128 and radeon both have DRI, BusMastering already IS there for them (I
> see when I play q3 that X doesn't eat any cpu time and the amount of data
> that has to be transferred is surely comparable to that with Xv). Just Xv
> functions are currently implemented in a way that don't use busmastering.
> Besides I have close contact to at least 2 developers who have access to
> this docs and they surely would help me if it was necessary.
That's cool... I didn't think that the DRMs for those cards exposed the full
bus-master functionality. I'll check and see what it'd take for you to get
at it if it's there. It'd be nice to have at least the R128 and Radeon being
driven to peak performance by Xv. No promises, most of the bus-master code
I've seen has been geared to handling a rendering pipeline, not raw blits
from system memory.
> And as for mach64, there isn't a usable DRI driver yet, but Utah-GLX DOES
> support BusMastering, so theoretically there is a sample code for all
> needed cards already. No need for extra docs from ATI. Only some
> programming.
Funny you should mention that... We're trying to make DRI drivers for the
RagePRO happen. And, yes, you DO need docs from ATI (or wait for us to get a
"complete" DRI driver done for you...). The bus-master operation for GUI
pipeline operations (incl. 3D stuff) is _different_ from the bus-master
operations for blitting. Utah-GLX only has the first one which is of little
use to you.
> I thought it is not only for writing but also for reading, or am I wrong?
> How else could textures be transferred to the card then?
On the RagePRO, it's for moving commands to the engine from system memory,
moving data from system memory to the aperture (which is the frame buffer and
on-card texture store) and for moving data from the aperture to the aperture.
For 3D operations, the engine gets it's textures either from somewhere within
the AGP aperture or from off of the card's memory.
> Look at the {drivername}CopyData422 functions in the source code. It is
> EXACTLY THE SAME CODE for all 4 drivers I checked (mach64, r128, mga and
> tdfx). Except mga and tdfx seem to be able to handle faster transfers so
> the problem isn't that visible there.
If it's doing the same memcpy, it's because the memory access (and the memory
itself) is faster on those cards. The CPU is still blocking- it's just not
having to work as hard to get it copied in the number of cycles that it's got
allocated to the effort to push the video to the framebuffer.
> I just need (as I said) a simple kick in the right direction, like "you
> have to include dma.h and use the do_the_fscking_transfer() to actually
> transfer the data".
Since there's no code in _anything_ to give you a kick in the right direction
for the RagePRO, I can't directly help you in that manner for that chip- yet.
> I think that adding a busmastered version of Xv drawing for systems that
> support it is "the right thing (TM)". Why not to use all the advantages the
> hardware offers?
Uh, you're preaching to the choir here. Please don't misconstrue my comments
to you- I'm on your side (and would definitely LOVE to help), but my time's
limited so it may take a little while for the DRI support to happen for you.
--
Frank Earl

On Tue, Aug 28, 2001 at 09:21:39AM -0400, Frank Earl wrote:
> That's cool... I didn't think that the DRMs for those cards exposed the =
full=20
> bus-master functionality.
Well, as far as I know, busmastering data transfers work. I just can't find
the part of the code that actually does it. Or am I really wrong?
> I'll check and see what it'd take for you to get at it if it's there. It=
'd
> be nice to have at least the R128 and Radeon being driven to peak
> performance by Xv.=20
Sure, that's why i'm doing it :-)
> No promises, most of the bus-master code I've seen has been geared to
> handling a rendering pipeline, not raw blits from system memory.
Hmm I really don't see so deep into the code...
> Funny you should mention that... We're trying to make DRI drivers for th=
e=20
> RagePRO happen.
Oh thats great!
> And, yes, you DO need docs from ATI (or wait for us to get a "complete" D=
RI
> driver done for you...). The bus-master operation for GUI pipeline
> operations (incl. 3D stuff) is _different_ from the bus-master operations
> for blitting. Utah-GLX only has the first one which is of little use to
> you.
Hmm but as far as I guess, transfer of textures is basically the same thing
than transfer of video frames?
> > I thought it is not only for writing but also for reading, or am I wron=
g?
> > How else could textures be transferred to the card then?
> On the RagePRO, it's for moving commands to the engine from system memory=
,=20
> moving data from system memory to the aperture (which is the frame buffer=
and=20
> on-card texture store) and for moving data from the aperture to the apert=
ure.=20
Yes, the "from system memory to the aperture" is probably the thing I need =
...
> For 3D operations, the engine gets it's textures either from somewhere wi=
thin=20
> the AGP aperture or from off of the card's memory.
=2E.. or not? Sorry I'm not that much knowledge...=20
And I really want to concentrate on r128 as it has working DRI code and I
need it much more than mach64 support. I have both mach64 and r128, but I o=
nly
watch videos on the computer with mach64 if I can't use the one with r128.
> If it's doing the same memcpy, it's because the memory access (and the me=
mory=20
> itself) is faster on those cards. The CPU is still blocking- it's just no=
t=20
> having to work as hard to get it copied in the number of cycles that it's=
got=20
> allocated to the effort to push the video to the framebuffer.
Exactly. But the point is that because "top" lies to you when the memcpy ta=
kes
less than 10ms (in reality it's more complicated but you get the idea), it
actually eats even more CPU time, only it's "hidden" and usual monitoring
tools are unable to measure it. So basically it is "very bad thing (TM)".
> Since there's no code in _anything_ to give you a kick in the right direc=
tion=20
> for the RagePRO, I can't directly help you in that manner for that chip- =
yet.
Thats ok, I don't really need ragepro, only r128 currently. If rpro dri is
ready, I'll be happy to port the BM-transfers to its Xv functions.
I just believe that with current code it IS POSSIBLE TO DO THIS WITH R128, I
just don't know how, I have too little experience with both busmastering and
video drivers. I think I just need to paste (and slightly modify) something
into the CopyData function and it will start working (that's how I managed =
to
force my AIW to use the undocumented TV-Out feature a couple of days ago).
> Uh, you're preaching to the choir here. Please don't misconstrue my comme=
nts
> to you- I'm on your side (and would definitely LOVE to help), but my time=
's
> limited so it may take a little while for the DRI support to happen for y=
ou.
Hehe ok.
> Frank Earl
Bye,
Peter Surda (Shurdeek) <shurdeek@...>, ICQ 10236103, +436505=
122023
--
The computer revolution is over. The computers won.

Peter Surda wrote:
> I just believe that with current code it IS POSSIBLE TO DO THIS WITH R128, I
> just don't know how, I have too little experience with both busmastering and
> video drivers. I think I just need to paste (and slightly modify) something
> into the CopyData function and it will start working (that's how I managed
> to force my AIW to use the undocumented TV-Out feature a couple of days
> ago).
Is it just me, or does the drmDMA() function in
xc/programs/Xserver/hw/xfree86/os-support/linux/drm/xf86drm.c look promising?
:)
--
Earthling Michel Dänzer (MrCooper) \ Debian GNU/Linux (powerpc) developer
CS student, Free Software enthusiast \ XFree86 and DRI project member

On Tue, Aug 28, 2001 at 05:33:39PM +0200, Michel D=E4nzer wrote:
> Is it just me, or does the drmDMA() function in
> xc/programs/Xserver/hw/xfree86/os-support/linux/drm/xf86drm.c look promis=
ing?
> :)
THANK YOU SOOO MUCH!
After looking at the mentioned file and afterwards in DRI of r128 (2.4.9
kernel), I found that this IOCLT is indeed implemented for r128 and can
transfer data in both directions.
OTOH, the DRM part of the r128 driver doesn't seem to take advantage of this
at all. Radeon does, but only a little as far as I understand the code. I s=
ee
a lot of potential in this.
Now I only need to learn how to allocate these buffers properly and initial=
ize
the structures and it should work.
Thanks to all who replied.
Bye,
Peter Surda (Shurdeek) <shurdeek@...>, ICQ 10236103, +436505=
122023
--
The three Rs of Microsoft support: Retry, Reboot, Reinstall.

On Tue, Aug 28, 2001 at 04:26:51AM +0200, Peter Surda wrote:
> On Mon, Aug 27, 2001 at 08:37:06PM -0400, Frank Earl wrote:
> > That would require documentation from ATI- and it differs from chipset to
> > chipset, so you'd need different code for RagePRO and Rage128(PRO) chips.
>
> As r128 and radeon both have DRI, BusMastering already IS there for them (I
> see when I play q3 that X doesn't eat any cpu time and the amount of data that
> has to be transferred is surely comparable to that with Xv). Just Xv functions
> are currently implemented in a way that don't use busmastering. Besides I have
> close contact to at least 2 developers who have access to this docs and they
> surely would help me if it was necessary.
This is a false deduction. Just because the X server isn't involved
when you're running an OpenGL program that uses the DRI, doesn't mean
that there is busmastering involved.
When doing Xv, the client sends pixels to the server (either using XShm
or with plain old X calls), and the server writes them to the card's
frame buffer memory. So you see both the client and the X server using
CPU time.
When running a DRI program, the client maps the card's frame buffer
directly into its own address space, and does the appropriate writes
itself.
(Well, that's simplified -- depending on the card, the kernel module may
do some or all of the work in kernel space after verifying that the
write requested is valid.)
So, in the DRI case the X server isn't doing any work, but there is
still no bus-master transaction necessary.
-andy

On Tue, Aug 28, 2001 at 09:53:53AM -0500, Andy Isaacson wrote:
> When running a DRI program, the client maps the card's frame buffer
> directly into its own address space, and does the appropriate writes
> itself.
Finally I'm starting to understand this. Thanks for enlightenment!
> So, in the DRI case the X server isn't doing any work, but there is
> still no bus-master transaction necessary.
So does r128 DRI driver actually support busmastering transfers or not? If
yes, where can I find a sample code that does a simple BM-DMA transfer?
> -andy
Bye,
Peter Surda (Shurdeek) <shurdeek@...>, ICQ 10236103, +436505122023
--
Reboot America.