tag:blogger.com,1999:blog-4999358188713912074Sun, 05 Oct 2014 08:29:13 +0000g3dvlBitBlitterA blog (mostly) about my programming endeavours. For more info on the things I write about, check out <a href="http://www.bitblit.org/index.shtml">BitBlit.Org</a>.http://bitblitter.blogspot.com/noreply@blogger.com (Younes Manton)Blogger15125tag:blogger.com,1999:blog-4999358188713912074.post-1063949945380673039Mon, 19 Jan 2009 04:19:00 +00002009-01-18T23:19:22.018-05:00g3dvlYes I'm still decoding video using shadersIt's been a while since I've said much about my video decoding efforts, but there are two pieces of good news to share. Both are improvements to Nouveau in general, not specific to video decoding.<br /><br />First, we can now load 1080p clips. Thanks to a very small addition to Gallium and a few lines of code in the Nouveau winsys, a lot of brittle code was removed from the state tracker and memory allocations for incoming data are now dynamic and only done as necessary. The basic situation is we allocate a frame-sized buffer, map it, fill it, unmap it, and use it. On the next frame we map it again, fill it again, and so on. But what if the GPU is still processing the first frame? The second time we attempt to map it the driver will have to stall and wait until the GPU is done before it can let us overwrite the contents of the buffer.<br /><br />But do we have to wait? Not really, we don't need the previous contents of the buffer, we're going to overwrite the whole thing anyway, so we just need a buffer that we can map immediately. To get around this we were allocating N buffers at startup and rotating between them; filling buffer 0, then 1, and so on, which reduced the likelyhood of hitting a busy buffer. The problem with that is obvious, for high res video we need a ton of extra space, most of it not being used most of the time. Now if we try to map a busy buffer, the driver will allocate a new buffer under the covers if possible and point our buffer to it, deleting the old buffer when the GPU is done with it. If the GPU is fast enough and processes buffers before you attempt to map them again, everything is good and you'll have the minimum number of buffers at any given time. If not, you'll get new buffers as necessary, in the worst case until you run out of memory, in which case you'll get stalls when mapping. The best of both worlds.<br /><br />The second bit of good news is that we've managed to figure out how to use swizzled surfaces, which gave a very large performance boost. Up to now we've been using linear surfaces everywhere, which are not very cache or prefetch friendly. Rendering to swizzled surfaces during the motion compensation stage lets my modest AthonXP 1.5 GHz + GeForce 6200 machine handle 720p with plenty of CPU to spare. 1080p still bogs the GPU down, but the reason for that is pretty clear: we still render to a linear back buffer and copy to a linear front buffer. We can't swizzle our back or front buffers, so the next step will be to figure out how to get tiled surfaces working, which are similar, but can be used for back and front buffers. Hopefully soon we can tile the X front buffer and DRI back buffers and get a good speed boost everywhere, but because of the way tiled surfaces seem to work (on NV40 at least) I suspect it will require a complete memory manager to do it neatly.<br /><br />Beyond that there are still a few big optimizations that we can implement for video decoding (conditional tex fetching, optimized block copying, smarter vertex pos/texcoord generation, etc), but the big boost we got from swizzling gives me a lot of optimism that using shaders for at least part of the decoding process can be a big win. It probably won't beat dedicated hardware, but for formats not supported by hardware, or for decoding more than one stream at a time, we can probably do a lot of neat things in time.<br /><br />I've also been looking at VDPAU, which seems like a nice API but will require a lot of work to support on cards that don't have dedicated hardware. More on that later maybe.http://bitblitter.blogspot.com/2009/01/yes-im-still-decoding-video-using.htmlnoreply@blogger.com (Younes Manton)10tag:blogger.com,1999:blog-4999358188713912074.post-2225125453679789049Mon, 15 Sep 2008 00:20:00 +00002008-09-14T23:30:55.652-04:00g3dvlGSoC is over, how did Generic GPU-Accelerated Video Decoding do?So GSoC has come to a close, and this project was successful, in that there is a working XvMC implementation sitting in nouveau/mesa's gallium-0.1 branch. Currently the NV40 Gallium driver is the only one complete enough to run XvMC, and there are still a few missing features (no support for interlaced video, subpictures aren't implemented yet, only motion compensation is currently accelerated).<br /><br />In my last entry I mentioned that I was hoping to spend the last part of GSoC getting IDCT working, but I came to realize that this would probably require more work than I initially estimated, due to the limited render target formats GPUs support. We decided that we may also want to take advantage of fixed function IDCT hardware if it is available, and one of the other Nouveau contributors had been looking into this on NV40, so I'm hoping we can take advantage of his efforts and get that into the Gallium NV40 driver in some fashion. Instead I spent the last two weeks of GSoC and the first two weeks of the rest of my life focusing on performance and cleaning up a few bugs here and there.<br /><br />As far as performance goes, we managed to grab most of the low hanging fruit.<br /><br /><ul><li>We buffer an entire frame of content and fire that off with a few draw calls. Most frames, depending on their content, can be done in two draw calls.<br /><br /></li><li>Because we have to fill buffers with new content each frame, we don't necessarily want to wait until the GPU is done with those buffers before we map and update them. Since we don't need their old contents we can just allocate a set of buffers and rotate them, double buffer style.<br /><br /></li><li>For P and B frames many blocks are composed entirely of pixels from the reference frame(s), so we don't technically need to upload any new data.<br /><br />Previously we would clear that block of the source texture to black, so that it didn't contribute anything to the destination block. However, for most P and B frames a significant number of blocks fall into this category, and most frames are P or B frames, so that's a lot of useless clearing on the CPU side and texel fetching on the GPU side.<br /><br />To get around this we clear the first such zero block of each frame for the luma and two chroma channels, and for subsequent zero blocks we texture from that first block. This saves a nice chunk of CPU time, but doesn't do much for GPU texture bandwidth.<br /><br />Once I figure out how TGSI expresses flow control constructs I'm hoping we can just set the texcoords for zero blocks to the negative range and conditionally tex fetch, but for older hardware which doesn't support conditional execution the current path should be good.</li></ul>Having said all that however, 720p24 decoding is still not done in real time. It's kind of a mystery actually, because while the profiler seems to indicate that we are GPU limited rather than CPU limited, the numbers don't seem to add up. A 1280x720 video is composed of 80x45 macroblocks. Each macroblock is composed of 4 blocks, each block is rendered as two triangles, so thats 8 triangles per macroblock, or ~29K triangles per frame. At 24 fps thats ~696K tris/sec or ~2M vertices/sec. <a href="http://www.nvidia.com/page/geforce6200_agp.html">Nvidia quotes a GeForce 6200's vertex processing rate at 225M/sec</a>. Our vertex shaders are very simple, we use screen aligned tris in normalized coords, so we don't have to do any significant transforming, just move inputs to outputs.<br /><br />Similarly, a 1280x720 video is composed of ~922K pixels. At 24 fps we're rendering ~22M pixels/sec. In the worst case, each pixel requires 5 texel fetches (3 2-byte fetches, 2 4-byte fetches) and one 4-byte write to the frame buffer, so that brings us to 308M bytes/sec read and 88M bytes/sec write. The color conversion pass adds another 352M bytes/sec for read and 88M bytes/sec for write. Nvidia quotes a 6200's fillrate as 1.2-1.4B texels/sec, and assuming those texels are 32-bit, that works out to 4.8-5.6B bytes/sec. Again, our pixel shaders are not really complicated, mostly TEX2Ds, MULs, and ADDs. Omitting the tex fetching doesn't change much, neither does disabling color writes to the frame buffer. Regardless of how Nvidia calculates its marketing numbers we seem to be well below them so it probably doesn't matter how optimistic they are.<br /><br />All in all it seems very odd that, given the above, a 854x480 clip renders in real time, but the same clip at 1280x720 takes 4x longer, despite only being 2.25x larger. I suspect that either there is a very non-obvious bug in the state tracker, or that we are doing something odd in the driver, possibly in the way we set up the 3D state, submit commands and data, or with memory management, or possibly with how we get our frame buffer onto the X window.<br /><br />Either way, I hope to continue working on this now that GSoC is over, and anyone who is interested in contributing is free to do so. I'm hoping to move things over to Mesa's GIT sooner or later, and I'm curious to see how it does on Intel's hardware. I don't know how well the current rendering process fits with what Intel supports, but if single component signed 16-bit textures aren't a problem it should be very easy to get things up and running. At best all it needs is some minor changes in the Winsys layer.http://bitblitter.blogspot.com/2008/09/gsoc-is-over-how-did-generic-gpu.htmlnoreply@blogger.com (Younes Manton)1tag:blogger.com,1999:blog-4999358188713912074.post-72065611485220975Sat, 02 Aug 2008 14:36:00 +00002008-08-02T10:50:23.474-04:00g3dvlIDCT vs. the GPUI've come to understand a few things while talking to Stephane (marcheu) and trying to come up with a (hopefully fast) way of performing IDCT on a typical GPU: 1) It doesn't fit nearly as well as motion compensation does, and 2) it wouldn't necessarily take a radical departure from current designs to make it fit, just a few adjustments here and there. I think the second point is the more frustrating of the two.<br /><br />The problems so far are as follows:<br /><br /><ol><li>The input format, signed 12-bit integers, isn't amongst your GPU's favourite texture formats. We're pretty fortunate that signed 16-bit integers are available at least, even though we have to renormalize to 12 bits. Unfortunately signed 16-bit integers are usually not available with four components, which leads to...<br /><br /></li><li>Hefty texel fetch requirements. If we had signed 16-bit RGBA textures we could do some packing and cut down on the number of fetches by 4, but we don't. Therefore, a naive 2D IDCT would require 128 texel fetches and 64 MADDs per pixel. Even if we had the ability to pack our DCT coefficients into 4 components we would still be left with 32 texel fetches, 16 DP4s, and 15 additions. There are algorithms such as AAN, which trade multiplications for additions, but are not suited to being calculated a pixel at a time or taking advantage of our vector arithmetic units. Instead of a 2D IDCT we could opt for two 1D IDCTs, which would give us 16 texel fetches and 8 MADDs per pixel per pass, or 4 texel fetches, 2 DP4s, and 1 addition if we could do full packing. However, we really can't do 1D IDCTs efficiently because...<br /><br /></li><li>We can't render to signed 16-bit buffers of any sort, so we have to find something to do with the intermediate result. The only alternative at the moment would be to render to floating point buffers, but then we get lose a whole group of GPUs that can't render to FP buffers or can but do it very slowly. But even if we manage to do a respectable IDCT, we have one more issue to deal with...<br /><br /></li><li>The output of our IDCT has to be easily consumable during our motion compensation pass. Currently motion compensation is a very good fit for the GPU model, however if we have to jump through hoops to fit IDCT into the picture we don't want to make the output too cumbersome to use during motion compensation, which from an acceleration POV is the more important stage to offload.<br /></li></ol><br />Having said all of that, what we need to be able to do a fast IDCT isn't so much.<br /><br /><ul><li>I think simply having support for signed 16-bit RGBA textures would go a long way to making IDCT fit. We do have support for signed 16-bit two component textures, so at least we're half way there.<br /><br /></li><li>Signed 16-bit render targets would also be helpful, although going forward FP16 and FP32 support are probably better to target.<br /><br /></li><li>The dream of every programmer who has ever used SIMD instructions, the horizontal addition, would also be very useful.<br /><br /></li><li>Being able to swizzle and mask components as part of a texel fetch would also help, since we receive planar data and have to pack it at some point.<br /></li></ul><br />However, we don't have any of that, as far as I know. What we do have is an MC implementation that currently fits very well and still has room for improvement, so at least that's a bright spot. I also have some promising ideas to tackle IDCT and its issues, and 3-4 weeks to figure it out.http://bitblitter.blogspot.com/2008/08/idct-vs-gpu.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-1965077689323799597Mon, 21 Jul 2008 04:08:00 +00002008-07-21T00:53:16.131-04:00g3dvlUp and running on real hardware<p>I reached a nice milestone today: working playback on my Geforce 6200. Most of the work went into the winsys layer, with some bug fixes and workarounds in other places, but everything is up and running now. Unfortunately the output isn't perfect, there is some slight corruption here and there. I'm guessing it has to do with some dodgy assumptions I made about shader arithmetic (rounding, saturation, etc) that SoftPipe went along with but the GPU didn't. The other issue is that there is some severe slowdown when any 2D drawing happens on the rest of the desktop. I'm guessing this may be due locking when copying the backbuffer to the window, or maybe I'm completely soaking up the CPU.</p> <p>Currently nothing is optimized, I'm not even turning on compiler optimization, and I have a really slow prototype IDCT implementation performed on the CPU in place of the hardware version, so I'm sure I'm eating up a lot more CPU time than I will be by the end of the summer. I have a lot of different ideas on optimization that will target CPU usage and GPU fillrate usage, but given that I get almost full speed playback currently, I'm pretty confident that I'll be able to get HD playback by the end of SoC.</p> <p>As far as the winsys goes, I was able to use most of the current Nouveau winsys. Unfortunately the DRI stuff is buried within Mesa right now, so I had to extract a lot of things and create a standalone library to handle screens, drawables, the SAREA, etc. to be able to use DRI without including and linking with half of Mesa. The winsys interface is also simpler than Mesa's; there are only a few client calls, the backbuffer is handled in the state tracker, and the winsys doesn't have to create or call into the state tracker. It took me a while to realize why the Mesa winsys was set up the way it was, and that I could simpify things on my end.</p> <p>Here are some screen grabs:</p> <a href="http://www.bitblit.org/gsoc/g3dvl/img9.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img9_thumb.jpg" alt="Coffee mug containing two pens and a feather." height="135" width="181" /></a> <a href="http://www.bitblit.org/gsoc/g3dvl/img10.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img10_thumb.jpg" alt="Woman on the phone." height="159" width="182" /></a> <a href="http://www.bitblit.org/gsoc/g3dvl/img11.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img11_thumb.jpg" alt="Windmill in the middle of a field of yellow flowers." height="135" width="181" /></a> <br /> <a href="http://www.bitblit.org/gsoc/g3dvl/img12.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img12_thumb.jpg" alt="Two cartoon characters (desktop visible)." height="225" width="360" /></a> <a href="http://www.bitblit.org/gsoc/g3dvl/img13.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img13_thumb.jpg" alt="Fighter jet flying through a clear sky (desktop visible)." height="225" width="360" /></a>http://bitblitter.blogspot.com/2008/07/up-and-running-on-real-hardware.htmlnoreply@blogger.com (Younes Manton)4tag:blogger.com,1999:blog-4999358188713912074.post-4368296083275916244Sun, 29 Jun 2008 05:41:00 +00002008-07-07T22:14:12.299-04:00g3dvlOne hurdle down, many more to go<p>Just a quick update; after some re-reading of the MPEG2 spec, debugging, and clean up I've finally got correct output for the MC stage for progressive video clips that use frame based and field based motion prediction. There are two other motion prediction methods, 16x8 and dual-prime, but they don't seem to be too common and shouldn't be too hard to implement anyway. It took a bit of tweaking, but comparing the output to that of other media players I see no difference, which means one hurdle down. Next steps are to revisit IDCT and start working with real hardware.</p> <p>Here are some screen grabs from various test clips:</p> <a href="http://www.bitblit.org/gsoc/g3dvl/img5.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img5_thumb.jpg" alt="Construction site on a field." height="152" width="183" /></a><br /><a href="http://www.bitblit.org/gsoc/g3dvl/img6.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img6_thumb.jpg" alt="Windmill in the middle of a field of yellow flowers." height="134" width="181" /></a><br /><a href="http://www.bitblit.org/gsoc/g3dvl/img7.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img7_thumb.jpg" alt="Coffee mug containing two pens and a feather." height="135" width="181" /></a><br /><a href="http://www.bitblit.org/gsoc/g3dvl/img8.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img8_thumb.jpg" alt="Woman on the phone." height="159" width="181" /></a>http://bitblitter.blogspot.com/2008/06/one-hurdle-down-many-more-to-go.htmlnoreply@blogger.com (Younes Manton)1tag:blogger.com,1999:blog-4999358188713912074.post-7644277604273747655Fri, 27 Jun 2008 03:56:00 +00002008-07-07T22:11:44.188-04:00g3dvlProgress<p>I put some work into getting field-based prediction working, and I think I have it mostly right. I ran into what I think is a bug in SoftPipe, which has to do with locking and updating textures. For some reason the surface and texture cache does not get invalidated in such cases, leading to stale texels being read and displayed. I manually flush the texture cache after mapping textures, and that seems to take care of it. It took a lot of debugging to track that one down and is probably fixed upstream, but at least it's another issue out of the way. At the moment some macroblocks are still not rendered correctly, but I'm hoping to get those out of the way.</p><p>The one thing I really can't stand is writing shader code for Gallium. The amount of C code you need write to generate a token stream for even a simple shader is obscene. Currently I have 12 shaders and each is about 200-300 lines of code for 10-15 shader instructions, so most of that code is noise. On more than one occasion I've made changes to the wrong shader just because it's so hard to wade through the code. What I wouldn't do for a simple TGSI assembler right about now. I'll have to do something about that, it's a huge eye sore.</p> <p>It's not surprising that I'm a little behind on my schedule. I started on IDCT a while back but put that code down to focus on MC. Luckily IDCT isn't strictly necessary as XvMC allows for MC-only acceleration, so I can test things and move forward on MC without having to worry about IDCT. I'm hoping the next step of the project, getting things running on real hardware, will be as painless as possible allowing me to get IDCT working. However, considering all the little unforseen issues that have cropped up with SoftPipe I wouldn't be surprised if I ran into more of the same with the Nouveau driver.</p>http://bitblitter.blogspot.com/2008/06/progress.htmlnoreply@blogger.com (Younes Manton)1tag:blogger.com,1999:blog-4999358188713912074.post-1952336933390011040Mon, 09 Jun 2008 15:05:00 +00002008-07-07T22:08:44.010-04:00g3dvlMoving alongThings are moving along in the right direction. I finally got a chance to push my work to date to Nouveau's mesa git, you can check it out <a href="http://gitweb.freedesktop.org/?p=nouveau/mesa.git;a=shortlog;h=gallium-0.1">here</a>. I have I, P, and B macroblocks working correctly when rendering frame pictures and using frame-based motion compensation. All that's left is to implement is field-based motion compensation (which is surprisingly very common, even in progressive content), and rendering field-based pictures (i.e. interlaced content). I think I've figured out a way to efficiently render macroblocks that use field-based prediction in one pass. Frame-based prediction works by grabbing a macroblock from a previously rendered surface and adding a difference to form the new macroblock. Field-based prediction works the same way, but references two macroblocks on the previously rendered surface, one for even scanlines and other for odd. My plan is to read from both reference macroblocks every scanline and choose which one to keep based on whether or not the scanline is even or odd. This can easily be done with a lerp(). It would be preferable to avoid the unecessary texture read, but it's simple and works in a single pass. Other alternatives include rendering the macroblock twice (once with even scanlines only, then with odd scanlines, using texkill to discard alternating scanlines), and rendering even and odd scanlines using line lists (which I understand makes sub-optimal usage of various caches in the pixel pipeline).http://bitblitter.blogspot.com/2008/06/moving-along.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-8138302614651073767Mon, 26 May 2008 16:13:00 +00002008-07-07T22:06:01.020-04:00g3dvlSomething to show<p>I've put some more work into getting P and B frames rendering correctly and things are proceeding very well. Currently texturing from the reference frame works for P frames. All I have to do is add the differentials, which is a little tricky. The problem is that differentials are 9 bits, which means that in an A8L8 texture we get 8 bits in the L channel and 1 bit in the A channel. This shouldn't be too hard, just a bit of arithmetic in the pixel shader code. A more interesting problem is dealing with field-based surfaces, both when rendering and when using them in motion prediction. There's no straightforward way to render to even/odd scanlines on conventional hardware, so this will require some special attention. Currently I'm thinking I will have to render line lists instead of triangle lists when a macroblock uses field-based motion prediction and for rendering even/odd scanlines.</p> <p>Here are some images from mpeg2play_accel, which I've been using as a test program:</p> <a href="http://www.bitblit.org/gsoc/g3dvl/img1.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img1_thumb.jpg" alt="Initial I-frame of the video." height="151" width="183" /></a> <p>Initial I-frame of the video.</p> <a href="http://www.bitblit.org/gsoc/g3dvl/img2.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img2_thumb.jpg" alt="Next frame, only P macroblocks using frame-based motion prediction are currently displayed, the rest are skipped." height="151" width="183" /></a> <p>Next frame, only P macroblocks using frame-based motion prediction are currently displayed, the rest are skipped.</p> <a href="http://www.bitblit.org/gsoc/g3dvl/img3.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img3_thumb.jpg" alt="Next frame, more macroblocks are rendered, and it looks mostly correct, except for the fine details. This is because the differentials are not taken into account yet." height="151" width="183" /></a> <p>Next frame, more macroblocks are rendered, and it looks mostly correct, except for the fine details. This is because the differentials are not taken into account yet.</p> <a href="http://www.bitblit.org/gsoc/g3dvl/img4.jpg"><img src="http://www.bitblit.org/gsoc/g3dvl/img4_thumb.jpg" alt="Next frame, a few more unhandled macroblocks in this one." height="151" width="183" /></a> <p>Next frame, a few more unhandled macroblocks in this one.</p>http://bitblitter.blogspot.com/2008/05/something-to-show.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-7058353907596020593Sun, 18 May 2008 05:48:00 +00002008-07-07T22:03:31.314-04:00g3dvlIntra-coded macroblocks? CheckAfter a few weeks of work I've made some good progress. Basic rendering of intra-coded macroblocks is working. What this means is that if you view a video you'll see the occasional full frame displayed correctly, and some macroblocks from the frames in between displayed correctly. Intra-coded macroblocks are the simplest to deal with, since they don't depend on motion compensation; all the data is present and you just have to render it. Every nth frame of an MPEG2 stream is composed entirely of intra-coded macroblocks. It's these frames that are currently being displayed correctly. Other frames are composed of some intra-coded macroblocks, but mostly inter-coded macroblocks. Inter-coded macroblocks depend on motion compensation and their samples are usually differentials. These I haven't gotten yet.<br /><br />I've also cleaned things up a bit, added some error checking, and added some more tests. It's taken a lot of stepping through Gallium code to get things right, in leau of documentation, but thanks to GDB, and even more to Insight, I've gotten this far. Stephane has answered my questions, mostly on how to efficiently do things, and even the folks in #mplayerdev have been helpful on XvMC and general decoding matters, so all in all I would say things are going smoothly.<br /><br />One thing I'm sure of is that no one reads this thing currently. The X.Org folks have asked that their students keep a blog and also submit it to <a href="http://planet.freedesktop.org/">planet.freedesktop.org</a> but I was told it only accepts RSS feeds. Currently this entire web site is maintained using a text editor, so I'll have to work out something more sophisticated in the near future. :-/ (<span style="font-weight: bold; font-style: italic;">Update</span><span style="font-style: italic;">: Since then I've been using BlogSpot and you're probably reading this post there instead of the old page</span>.)http://bitblitter.blogspot.com/2008/05/intra-coded-macroblocks-check.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-7035896186364998282Fri, 02 May 2008 03:04:00 +00002008-07-07T22:00:16.995-04:00g3dvlUp and runningToday I managed to get the basic color conversion step up and running using SoftPipe. Most of the difficulty came in understanding Gallium more than implementing the color conversion stuff. I spent many hours trying to figure out why I couldn't get any geometry to show up in my window. Copying surfaces to the frame buffer worked fine, but rendering a triangle left me staring at a black screen. It turns out that you have to set the pipe_blend_state.colormask bits for the channels you want to write to. First, I didn't even consider that state because I disabled blending. Second, setting the mask to allow writes was the opposite of what I would assume. It took several hours of stepping through Gallium to find that everything was OK until we got to the fragment shader, where it skipped the frame buffer write back.<br /><br />Other issues included getting a handle on writing TGSI shader code and figuring out how to get Gallium and XvMC APIs to agree. At the moment generating TGSI isn't a pretty process, you can look in <a href="http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=blob;h=5f8d12191d49a8d4c4c4ecc11a11f6ad69fe5810;hb=gallium-0.1;f=src/gallium/auxiliary/util/u_simple_shaders.c">gallium/auxiliary/util/u_simple_shaders.c</a> for an example. As for Gallium and XvMC agreeing, most of the problem came from the fact that XvMC functions all accept a Display*. What do you do if the client creates an XvMC context with one Display*, creates a surface with another Display*, and so on? Well, hopefully no one will do that, but one has to wonder why it's even allowed. Then there's the issue of some calls only taking an XvMCSurface*, and not the associated context. Unfortunately the context is where I keep the Gallium pipe context, so every surface has to have a reference to the context it was created with. Luckily this works out since some functions that do take a surface and context require that we check that they match, so at least it makes that simple.http://bitblitter.blogspot.com/2008/05/up-and-running.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-5701817620921420105Sat, 26 Apr 2008 02:31:00 +00002008-07-07T21:58:29.607-04:00g3dvlAccepted, digging through codeAfter a long interim period I now know that the proposal has been accepted. Rather than sit around I've been working on getting things up and running, so I'm glad I got a head start on things. It took a lot of digging through Mesa code and some questions to the dri-devel mailing list and IRC channel, but I've managed to get some basic initialization out of the way. I've also implemented enough functionality and stubs to get some basic test cases compiling and running successfully. I found a port of mpeg2play on Mark Vojkovich's <a href="http://www.xfree86.org/%7Emvojkovi/">web site</a> that uses XvMC and have managed to get that compiling and running. By running I mean not crashing, it doesn't display anything as of yet, but at least I'm heading in the right direction.<br /><br />Now I'll need to figure out how to get XvMC surfaces onto X drawables with Gallium3D. For some reason most of the XvMC functions don't take the XvMCContext as an argument, so I have to store that along with each surface, and yet they all take a pointer to Display, which I don't see a use for. A headache more than anything else, but it seems counter-intuitive to me. Also, the Gallium3D API is new to me and it will take some time to figure out. Keith Whitwell provided me with an in depth explanation of how to start on the state tracker and winsys thankfully. I'm hoping by the end of this weekend I'll have something on screen, even if it's garbage (i.e. the video frames before IDCT). I'm also hoping to get started on writing shader code to do the color conversion. Stephane Marchesin, my mentor for this project was kind enough to point me to the current Xv implementation for the Nouveau driver, which does color conversion and bicubic interpolation in shaders currently.http://bitblitter.blogspot.com/2008/04/accepted-digging-through-code.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-2254005024568494335Tue, 08 Apr 2008 03:07:00 +00002008-07-07T21:56:46.379-04:00g3dvlSubmittion day reduxSo after last week's deadline extension today became the deadline for the proposal submission. It hasn't really affected me, but it would have been nice to know by now if this project had been accepted by now. Instead the accepted proposals will be announced April 21st.<br /><br />In the interim I've been looking through the libXvMC and Mesa sources. The source to libXvMC is a little confusing, partly because of the wrapper library that comes with it, but after a little reading, grep-ing, and peeking at the openChrome XvMC driver I think I've got a handle on how things work. As far as I can see, the libXvMC module provides implementations for all the hardware-agnostic XvMC API calls, and leaves the rest to the driver. It also exports some functions that the driver can use to interface with X. The wrapper module, libXvMCW is intended for clients to link against and exports all the XvMC functions the client expects. The wrapper doesn't contain any implementation, but instead attemps to dynamically load the libXvMC module for the hardware-agnostic functions, and a hardware-specific driver (e.g. libXvMCNvidia) for the rest of the functions. The driver is left to implement the surface/block/rendering related functions. So with that I think it's pretty clear which functions I would have to provide in terms of Gallium3D to complete the implementation.<br /><br />In addition to that I've gotten back into using Matlab for some prototyping. Matlab is a great tool for this sort of thing because it allows you to easily visualize your data, and I've been using it to test some CSC and IDCT routines.http://bitblitter.blogspot.com/2008/04/submittion-day-redux.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-5997645188005728659Mon, 31 Mar 2008 17:27:00 +00002008-07-07T21:55:39.243-04:00g3dvlSubmittion dayToday is the deadline for the proposal submission. I hope it is accepted, I think it's a perfect fit for GSoC. I had some trouble trimming it down to 7500 characters because I initially misread the guidelines and thought it said 7500 words, if it wasn't for someone on the dri-devel mailing list reminding me I'd have submitted all 8K characters. I'm feeling lucky already!http://bitblitter.blogspot.com/2008/03/submittion-day.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-5223661769178557401Sat, 29 Mar 2008 19:24:00 +00002008-07-07T21:53:27.182-04:00g3dvlPotential benefits of accelerated video decoding?I've started considering some of the potential benefits of hardware accelerated video decoding from a user's perspective. The biggest one for me is being able to play back HD streams in real-time. I have a modest machine and it does struggle with HD streams, but having read <a href="http://research.microsoft.com/%7Ejackysh/publications/Accelerate%20video%20decoding%20with%20generic%20GPU.pdf">this paper</a> I'm encouraged by one statement in particular. Testing on a machine equipped with a Pentium III @ 667 MHz, 256 MB of memory, and a GeForce 3 Ti200, they state that they were able to play back a 720p ~24-frame/s stream encoded with WMV at 5 Mb/s. That hardware is pretty ancient by today's standards, and yet with the GPU handling MC and CSC they get a 3.16x speed-up over the CPU w/ MMX implementation. That's pretty encouraging. I'm sure all the folks out there who use their machines as HTPCs would really appreciate that sort of performance.<br /><br />Another benefit would be that implementing this in terms of a Gallium3D front-end allows it to be used on all hardware that has a Gallium3D back-end. Currently I believe certain Intel GPUs are supported very well, as well as Nvidia through Nouveau, but I'm thinking specifically of AMD/ATI. As far as I know their Linux drivers have never supported any sort of video decoding acceleration, even though the hardware is very capable and the functionality is implemented on Windows. Recently they released hardware specs for some of their GPUs, but this did not include any of the dedicated video decoding hardware if I recall correctly. However, with specs for the newer GPUs and various reverse-engineered drivers for older GPUs already existing, comprehensive Gallium3D support for ATI GPUs will probably happen. I think the point is obvious by now: hardware accelerated video decoding on ATI GPUs finally.http://bitblitter.blogspot.com/2008/03/potential-benefits-of-accelerated-video.htmlnoreply@blogger.com (Younes Manton)0tag:blogger.com,1999:blog-4999358188713912074.post-5481642082965122582Tue, 25 Mar 2008 01:03:00 +00002008-07-07T21:48:10.280-04:00g3dvlJumping right inTo get myself familiar with how a Gallium3D front-end works I downloaded the source to Mesa and built the library. I had a little trouble trying to figure out which make target built Mesa using SoftPipe, but someone on #dri-devel was kind enough to tell me. I also downloaded openChrome's libXvMC source, but it's not immediately clear to me how the library works. It appears to do some work in terms of Xlib, xext, and others, but expects a few functions (the ones that actually touch the hardware) to be provided and linked in. Odds are this is where the Gallium3D calls will end up.http://bitblitter.blogspot.com/2008/03/jumping-right-in.htmlnoreply@blogger.com (Younes Manton)0