Created attachment 32942[details]
intel_gpu_dump output
The following command is pretty good at locking up the GPU on this i845 machine:
x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1
It doesn't always freeze on the same test, just somewhere random within this range. However, I've run this several times, and so far it hasn't made it to the end without freezing.
GPU is the following:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
I observed this on Ubuntu Karmic. Jaunty is unaffected.

Hmm, that test is surviving in a continuous loop on my i845. I do note one substantial difference between our environments (besides my system using the current tip) is that I have a rev 03 compared to your rev 01. Could this be another notorious gen2 h/w bug? Given that the batchbuffer appears perfectly innocent, we don't have enough information to deduce how the h/w got itself into this state.
Thanks for reducing this to a small test case, though it is still baffling. The biggest change between Jaunty and Karmic, would be that the i8xx was blacklisted and acceleration disabled for Jaunty due to the severe number of bugs (some extremely nasty cache coherency issues in particular). It would be useful to still if the bug is still reproducible on the latest stack --in particular, there has been a couple of fence flushing issues spotted (though I don't think the tests you performed would have hit those paths) and we now upload images differently (which may stress the h/w differently, for better or for worse). But it will help to narrow the differences between our systems.

Created attachment 32947[details]
dmesg with newest stuff
Also freezes on a Lucid install with kernel 2.6.33-rc6 and the xorg-edgers PPA, which has the current xserver-xorg-video-intel and a very recent libdrm and mesa.
What do you make of this stuff in dmesg?
[ 2.043712] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[ 2.142832] render error detected, EIR: 0x00000010
[ 2.142838] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[ 2.142852] render error detected, EIR: 0x00000010

(In reply to comment #2)
> What do you make of this stuff in dmesg?
> [ 2.043712] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor
> 0
> [ 2.142832] render error detected, EIR: 0x00000010
That's an unrelated and apparently harmless error. The most serious side-effect it has at the moment is it prevents i915_error_state from capturing the later hang.
I'm now close to 6 hours of runtime with 'x11perf -range copywinpix10,comppixwin500 -time 1 -repeat 1', still no hang on this i845. :(

I've been running a PPA to try to bisect random freezing here:
https://bugs.launchpad.net/bugs/456902
In fact, that's what led me to find this test case. Anyway, I just asked people to report results with x11perf, and someone just reported that it freezes with rev 03 hardware:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)

Created attachment 33268[details]
dmesg on 2.6.33-rc8 with drm.debug=0x06 and tests srect10-srect500
The stippled rectangle tests can also trigger a freeze. Here's a dmesg from 2.6.33-rc8 with xorg-edgers. It took eight cycles of this, on a freshly started Gnome desktop with nothing else going on:
while ! ( dmesg | grep 'render error' ); do x11perf -range srect10,srect500 -time 1 -repeat 1; done
I set DISPLAY=:0 and ran this test from an ssh login, because I think updating a terminal on the same screen interferes with the test.

Created attachment 33269[details]
i915_error_state with srect tests
With the console initialization bug fixed, I can now get an i915_error_state corresponding to this freeze.
Let me know if you need any more debug info.

I'm finding ASCII characters in the IPEHR register sometimes, as if the wrong data is being sent to the graphics card. For example, one run looked like this:
EIR: 0x00000000
PGTBL_ER: 0x00000000
INSTPM: 0x00000000
IPEIR: 0x00000000
IPEHR: 0x45494c43
INSTDONE: 0x00ffffc1
ACTHD: 0x01a33008
With byte order swapped, the value of IPEHR corresponds to the string 'CLIE'.
I just applied Chris Wilson's latest batch buffer reporting patch. I now have the system running the x11perf test on bootup until it freezes, then saving /sys/kernel/debug/dri/0 and rebooting to do the test again.
So far I see suspicious values of IPEHR in 3 out of 13 runs. I don't see anything strange like that in the batchbuffer dumps so far, though. I'll leave this test running overnight.

Created attachment 33331[details]
i915_error_state with batchbuffer dump
Kernel is the current linux master branch (v2.6.33-rc8-26-g0813e22) with the batchbuffer dumping patch added.
$ apt-cache policy libdrm2 xserver-xorg-video-intel | grep Installed
Installed: 2.4.17+git20100210.4f0f8717-0ubuntu0sarvatt
Installed: 2:2.10.0+git20100211.00e7312d-0ubuntu0sarvatt
This is a dump that does not have recognizable string data in IPEHR. It was made with a couple runs of x11perf while running a program that repeatedly forked and waited on its children to stress the CPU, and no drm debug messages turned on. The x11perf command was this:
x11perf -range srect10,srect500 -time 1 -repeat 1

Created attachment 33345[details]
i915_error_state with IPEHR = wtf
I figured out exactly where the string data in IPEHR is coming from. In fact, I was able to plant my own data into that register. Here's a dump where
IPEHR = 0x0a667477 = "wtf" (followed by newline)
All it took was "yes wtf > wtf" during the x11perf run, which writes lines of 'wtf' continuously to a file.
I guess the graphics card is getting data being sent to the hard drive. That's why dmesg fragments wound up in the register: that's something that gets logged.

Finally found the missing module and tracked down the broken patch that was preventing my brnach booting on my i845... And now I cannot reproduce the freeze with "x11perf -range srect10,srect500 -time 1 -repeat 1".

First successful workaround:
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f1fcc97..0dcf761 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3530,6 +3530,9 @@ i915_dispatch_gem_execbuffer(struct drm_device *dev,
if (exec->flags & I915_GEM_NO_DISPATCH)
return 0;
+ msleep(10);
---
*screams*
So it appears that the memory barriers are having little effect and the uncached write to the ringbuffer by the CPU and subsequent read of the batch buffer by the GPU is occurring before main memory has been flushed.
Adding extra memory barriers or invalidations, or writes from the GPU to memory were insufficient.

That appears to stabilize it, however I've noticed at least one scenario where it causes severe slowdown. When wine is in charge of drawing its own windows because the "emulate a virtual desktop" option is turned on, the gradient on the window's title bar takes a good couple of seconds to draw. I can see it being filled in left to right.
Also, I did manage to cause a freeze in one scenario. In my normal test, I run Xorg, then xclock, then I run x11perf over and over. I can't get that test to freeze with this patch. But if I skip xclock, then Xorg resets the display after every run of x11perf because the last client has disconnected, and in this case I caused a GPU hang once. I'll do some more testing and gather data on this. Maybe it's just a different bug.
I put a kernel with this patch in an Ubuntu PPA, and I'm going to get some feedback from users experiencing random hangs. Let's see if this patch affects that issue...

So I had the bright idea of using a GTT mapping to avoid the chipset flushing (and associated delay) which restores performance... and the original problem. Conclusion: not even GTT mappings are coherent.

this bug does not duplicate bug 24789 as I can't even boot to Xorg in enough time to run the testcase; furthermore, if I use the suggest msleep(10) patch I get crashes almost instantly.
So patches drm-intel-big-hammer.patch and 855nolid.patch are both necessary for me, and this cannot be a dupe of 24789; it might be the other way round instead.
Please test these patches and say if bug is (almost) fixed for you:
- http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
jbarnes
- http://bugzilla.kernel.org/attachment.cgi?id=25084 -
drm-intel-big-hammer.patch from FC13 kernel patches
I said almost because bug can still be triggered under heavy CPU/GPU load, like when watching a video, but system is definitively usable (I used it to post these comments)

(In reply to comment #26)
> this bug does not duplicate bug 24789 as I can't even boot to Xorg in enough
> time to run the testcase; furthermore, if I use the suggest msleep(10) patch I
> get crashes almost instantly.
Sounds like plymouth is doing something just as funky to write to the framebuffer. Looks like a GTT map [ http://cgit.freedesktop.org/plymouth/tree/src/plugins/renderers/drm/ply-renderer-i915-driver.c ], which as pointed out earlier also suffers from exactly the same coherency issues, but is not covered by the AGP chipset flush.
In short, you have the same bug and the wbinvd() just happens to cause sufficient delay on all batchbuffers that it happens to work most of the time.
> So patches drm-intel-big-hammer.patch and 855nolid.patch are both necessary for
> me, and this cannot be a dupe of 24789; it might be the other way round
> instead.
>
> Please test these patches and say if bug is (almost) fixed for you:
>
> - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> jbarnes
Upstream, no effect (obviously).
> - http://bugzilla.kernel.org/attachment.cgi?id=25084 -
> drm-intel-big-hammer.patch from FC13 kernel patches
As mentioned much earlier, no effect.

thanks for testing those patches Chris, so the conclusion is that bug are different, not duplicate
I also believe that patch for this bug could fix also bug 24789, but that's a mere supposition we don't have evidence until such patch exists. and until that we have two different bugs triggered in two different ways.

(In reply to comment #27)
> (In reply to comment #26)
> Upstream, no effect (obviously).
>
> > - http://bugzilla.kernel.org/attachment.cgi?id=25084 -
> > drm-intel-big-hammer.patch from FC13 kernel patches
>
> As mentioned much earlier, no effect.
>
Are you saying that this patch does not fix the bug for you?
So we have your hardware with patch in attachment 33593[details][review] which works, while my hardware (855GM rev02) which doesn't. Correct?
I really don't see why the bugs should be merged considering different triggering conditions and different workaround patches

(In reply to comment #30)
> Are you saying that this patch does not fix the bug for you?
> So we have your hardware with patch in attachment 33593[details][review] which works, while my
> hardware (855GM rev02) which doesn't. Correct?
>
> I really don't see why the bugs should be merged considering different
> triggering conditions and different workaround patches
The two bugs in question are both due to the GPU executing the command stream prior to GMCH completing its write, thus hanging on illegal instructions that do not match the batch buffer dumped.
The drm-intel-big-hammer.patch adds a wbinvd() [write-back invalidate to flush all levels of CPU cache] instruction to i915_gem_execbuffer(). For all intents and purposes, this simply adds a delay since the caches are flushed later anyway. However as is demonstrated by your own statements, and I confirm, this is insufficient to ensure that all writes are completed prior to the GPU performing its DMA to main memory. The reason why the msleep() hack does not solve everything is that it is limited to the AGP chipset flush which is only performed on invalidating the CPU domain. The truly astonishing thing about this bug is that the GTT domain appears to be similarly affected. Hence why the wbinvd() patch appears to be more successful in some scenarios than the msleep(), but is still fundamentally flawed.

(In reply to comment #31)
> (In reply to comment #30)
> > Are you saying that this patch does not fix the bug for you?
> > So we have your hardware with patch in attachment 33593[details][review] [details] which works, while my
> > hardware (855GM rev02) which doesn't. Correct?
> >
> > I really don't see why the bugs should be merged considering different
> > triggering conditions and different workaround patches
>
> The two bugs in question are both due to the GPU executing the command stream
> prior to GMCH completing its write, thus hanging on illegal instructions that
> do not match the batch buffer dumped.
>
[offtopic]I know that this ends up with something broken being identified in the hardware/firmware, I am just waiting for somebody to clearly say it...but still believing that some magic (read hack) can eventually fix this up[/offtopic]
> The drm-intel-big-hammer.patch adds a wbinvd() [write-back invalidate to flush
> all levels of CPU cache] instruction to i915_gem_execbuffer(). For all intents
> and purposes, this simply adds a delay since the caches are flushed later
> anyway. However as is demonstrated by your own statements, and I confirm, this
> is insufficient to ensure that all writes are completed prior to the GPU
> performing its DMA to main memory. The reason why the msleep() hack does not
> solve everything is that it is limited to the AGP chipset flush which is only
> performed on invalidating the CPU domain. The truly astonishing thing about
> this bug is that the GTT domain appears to be similarly affected. Hence why the
> wbinvd() patch appears to be more successful in some scenarios than the
> msleep(), but is still fundamentally flawed.
>
I fully agree with your statements, also taking those for which I have no knowledge as true; as per my testing I have found a qualitative difference in the two patches: the msleep() workaround works a lot worse for me, and it is rarely distinguishable from the vanilla kernel's situation, while the wbinvd() approach "makes it usable", although I should use it with the certainness that it will crash - sooner or later. So right now I have a quick testcase for invalidating the msleep() patch while the wbinvd() patch works for longer and is not tied to a magic number (10), possibly dependant on the load of my machine.
If you want I can calibrate that magic number for my box, but that would just be experimentation without an usable feedback.
Also there was an user (M.Nowak) saying that FC13 is totally free of this bug; if this fact was true, then FC13 has some other interesting patch to look at (which I couldn't identify up to now), otherwise it's just harder for the bug to be triggered with FC13's patched kernel (I believe this), bringing no news to us.

(In reply to comment #32)
> Also there was an user (M.Nowak) saying that FC13 is totally free of this bug;
> if this fact was true, then FC13 has some other interesting patch to look at
> (which I couldn't identify up to now), otherwise it's just harder for the bug
> to be triggered with FC13's patched kernel (I believe this), bringing no news
> to us.
>
Forgot to add: for me FC13 is broken as any other kernel, so I asked M.Nowak to test vanilla kernel + drm-intel-big-hammer.patch, but he has not yet provided results.
Some other test results from people with this hardware would be very welcome.

Chris, I've thought a bit about what failure-mode could possibly explain all these different corruptions. Could it be that the GTT _table_ contains stale entries? Yes I know, this sounds crazy but I haven't yet found another failure mode that could nicely explain what's going on ...
For the gtt corruption case:
- map new bo into gtt
- start writing
- new gtt mappings become effective
- further writes
gtt cpu writes are wc, i.e. the cpu can send out only 4 byte sized writes, agp doesn't cache them. This would explain a single "wtf " in the command buffer (stale data that's been in the page that's just been assigned to this new bo).
For the non-gtt write
- write stuff to mem
- map bo into gtt
- gpu starts using them
- new gtt mappings become effective
- gpu reads crap
Of course, in the case of gtt writes, the write should end up somewhere else in system memory. But where mapping a dummy page for all empty gtt entries, right? So it's quite likely they end up in there. If this dummy page never gets corrupted, I'm obviously wrong.
<crazymode />
I can't test this theory thoug, because I can't reproduce the bug on my i855.
btw, the reason a came up with this: Just yesterday I've experienced a strangely corrupted pixmap on my i855GM: 8 pixels high (TILE_X height) and about half the screen wide (around 512 pixels, i.e. 4 pages of TILE_X tiles, as wide as the pixmap) with nice colorful garbage. This brought me into thinking that maybe we're dealing with corruptions in GTT_PAGE_SIZE quantities. This was after about an hour of hitting the box with x11perf and filling the disk with wtfs, also the first time I've ever seen something like this.

Yes, I had thought it possible that this could be a missing flush after updating the PTEs. I added a few more flushes along those paths, just in case. (Though that does not conclusively rule that out.)

(In reply to comment #34)
> I can't test this theory thoug, because I can't reproduce the bug on my i855.
Daniel, do you have a patch that would enable testing your theory?
I could apply it and see if my system keeps alive GPU for more than a few hours (it wedges at least once a day, usually more often and half of the time X exists)
Here it wedges with normal desktop usage (Enlightenment + GTK applications)

Created attachment 33750[details]
i915_error_state from my i855
I'm not really fluent in reading these dumps, but the block of zeros right before ACTHD looks very fishy: 64 bytes in length and nicely size-aligned, i.e. a cache-line gone wrong.
The hang occured right after a fresh bootup, i.e. the memory has been cleared by the power reset. That might explain the zeros instead of some other random crap.

(In reply to comment #37)
> Chris, I've tried to prove my theory one way or another by massively increasing
> the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list
> with the following patch:
Sadly that also forces the domain change to CPU, so stresses the CPU flushing paths as well; not quite clear cut.
The code to poke around in is drivers/char/agp/intel-agp.c. In particular, intel_i8xx_tlbflush(). But there does some to be some other odd discrepancies between gen2 and later.

(In reply to comment #40)
> (In reply to comment #27)
> > (In reply to comment #26)
> > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> > > jbarnes
> > Upstream, no effect (obviously).
> >
> Where is this upstream? I am using linus' tree and patch is not there; I need
> this patch for a different bug: without it the screen is OFF
Still waiting for Linus to pull, apparently. It's been in drm-intel-next for a couple of weeks now - surprisingly since it is marked for stable.

(In reply to comment #41)
> (In reply to comment #40)
> > (In reply to comment #27)
> > > (In reply to comment #26)
> > > > - http://bugzilla.kernel.org/attachment.cgi?id=25019 - 855nolid.patch by
> > > > jbarnes
> > > Upstream, no effect (obviously).
> > >
> > Where is this upstream? I am using linus' tree and patch is not there; I need
> > this patch for a different bug: without it the screen is OFF
>
> Still waiting for Linus to pull, apparently. It's been in drm-intel-next for a
> couple of weeks now - surprisingly since it is marked for stable.
>
Yes because on these laptops it's really necessary, otherwise the display is not even recognized.
Perhaps I should also use drm-intel-next kernel for these tests?
I am also available to try other patches and run other tests; apparently my i855GM (rev 02) is very quick at crashing; unfortunately I am not following you both very well in the reasonings, but there seem to be some light.
The worst thing is that bug is apparent in different ways and development efforts are seemingly scattered around (all distros' bugtrackers are filled up with reports about i8xx devices)

> --- Comment #39 from Chris Wilson <chris@chris-wilson.co.uk> 2010-03-04 01:03:40 PST ---
> (In reply to comment #37)
> > Chris, I've tried to prove my theory one way or another by massively increasing
> > the gtt map/unmap: I simply unmap every bo as soon as it hits the inactive list
> > with the following patch:
>
> Sadly that also forces the domain change to CPU, so stresses the CPU flushing
> paths as well; not quite clear cut.
Of course, you're right, this test only strongly points at a problem in
either object_unbind and/or object_bind. But I think we can rule out cpu
flushing with a high probability:
- We only clflush on moving away from the cpu domain.
- Most access is via gtt, no intermediate cpu access. So the clflush
should be a no-op (for the hw).
> The code to poke around in is drivers/char/agp/intel-agp.c. In particular,
> intel_i8xx_tlbflush(). But there does some to be some other odd discrepancies
> between gen2 and later.
My next step is to check whether this gtt writes end up someplace else
(i.e. most likely on the agp scratch page). If they do, we have mixed up
gtt entries somewhere, if they don't the problem is definitely somewhere
else.

Created attachment 33751[details][review]
Flush the GTT by disabling/enabling it.
The Broadwater errata notes that PTE entries that have been prefetched are not correctly invalidated when the GTT is updated. It goes on to note that in these situations: (1) don't do that, (2) flush the GTT - but helpfully forgets to mention how to actually enact the flush.
Instead, zap the GTT by disabling it and re-enabling it after every update. Fortunately, this does not appear to impact on throughput too much.

Created attachment 33752[details]
Xorg 1.7.5 log with patched kernel
(In reply to comment #44)
> Created an attachment (id=33751) [details]
> Flush the GTT by disabling/enabling it.
>
Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as always, however I noted no font glitches (apparently) and only a minor glitch (a couple of lines missing) on a desktop icon.
Xorg lasted less than 2 minutes, and starting firefox most probably nuked it

(In reply to comment #45)
> Xorg lasted less than 2 minutes, and starting firefox most probably nuked it
*sigh* it was doing so well here, surviving the x11perf test using both CPU and GTT mappings. Did you manage to grab an i915_error_state, so that we can see what manner of corruption remains?

(In reply to comment #45)
> Created an attachment (id=33752) [details]
> Xorg 1.7.5 log with patched kernel
>
> Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as
> always, however I noted no font glitches (apparently) and only a minor glitch
> (a couple of lines missing) on a desktop icon.
With the patch on top of the latest intel-drm-next kernel you can grab <debugfs>/dri/0/i915_error_state after the hang. That would probably be useful for Chris and Daniel. http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git

Created attachment 33756[details]
i915_error_state with GTT enable/disable patch
Can confirm said crash appearing reliably a few seconds into X. My chipset is also 82852/855GM (rev 02), so this should probably be relevant to the bug at hand.

(In reply to comment #47)
> (In reply to comment #45)
> > Created an attachment (id=33752) [details] [details]
> > Xorg 1.7.5 log with patched kernel
> >
> > Just tried this one on vanilla linus' tree, it crashes (see Xorg log) as
> > always, however I noted no font glitches (apparently) and only a minor glitch
> > (a couple of lines missing) on a desktop icon.
>
> With the patch on top of the latest intel-drm-next kernel you can grab
> <debugfs>/dri/0/i915_error_state after the hang. That would probably be useful
> for Chris and Daniel.
> http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git
>
It will take about 15 minutes to git-clone it, I had deleted it some days ago out of frustration; next I'll grab i915_error_state after the crash and attach it here.
I have the same hardware as 2points so you can already look at his attachment 33756[details]; however we'll later check that they talk about the same bug, as double-check.

As usual, something is strange in that dump. The strange part is that it looks perfectly fine, even the IPEHR shouldn't have been a trigger for a hang. The odd part is that the last loaded instruction (IPEHR) corresponds to several instructions prior to ACTHD (ACTive HeaD, where the DMA engine is currently grabbing the next QWord from) - presuming what the CPU read back is consistent with what is being read by the GPU. Hmm.

(In reply to comment #50)
> As usual, something is strange in that dump. The strange part is that it looks
> perfectly fine, even the IPEHR shouldn't have been a trigger for a hang. The
> odd part is that the last loaded instruction (IPEHR) corresponds to several
> instructions prior to ACTHD (ACTive HeaD, where the DMA engine is currently
> grabbing the next QWord from) - presuming what the CPU read back is consistent
> with what is being read by the GPU. Hmm.
>
Maybe this is another bug (evil twin bug 24789)? That needs another patch?
My laptop has one of the early intel centrino (single-core of course) CPU (1.6 Ghz), with the wicked i8042 controller, but I wouldn't infer that it is on the CPU side either...

Created attachment 33760[details]
dri debugfs dumps for i855GM + Xorg.0.log
OK, I built drm-intel with Chris' patch and rebooted; first time I forgot "nomodeset" active and it *detonated* back to boot screen (it must be able to successfully trigger some kernel/CPU/BIOS failsafe reboot, some way), this must be an unique feature of most recent 2.6.33 kernels.
Now back on topic: I successfully started Xorg, and it looked great (like if everything was fixed, but more probably I didn't have time to find any glitch), however when I opened some directory windows (XFCE) it crashed. The mouse cursor was changing when I tried to drag a window, so the underlying system was still breathing (it always happens with bug 24789). I popped up the VT where Xorg was started and hit Ctrl+C it since crash was already lasting for 3-4 seconds, and the characterizing I/O error lines were already printed.
In the attachment you can find /sys/kernel/debugfs/dri/{1,64}; does anybody know if it is normal to have entries 1 and 64? With the old intel driver I have only dri/1

I have uploaded the compiled drm-intel (kernel, initramfs and modules) with Chris' patch:
http://www.iragan.com/linux/i855GM/
you can find it under the *kernel directory.
This was built with my .config so might not work properly on all laptops, but surely should boot

@legolas558 - I booted the kernel you provided, and it booted, but no better than the one I compiled myself with the lid and big hammer patches. Still froze after several minutes of flipping between VT's and running graphics-intensive programs (inkscape, tuxpaint, etc).
My hardware is:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
I'm afraid I'm not at the level of understanding the level of the conversation here, but I'm willing to compile and test kernels on my hardware :) I'm looking up now how to collect the info you need with the intel-gpu-tools. Let me know what I can do to help.
Scott

(In reply to comment #55)
> @legolas558 - I booted the kernel you provided, and it booted, but no better
> than the one I compiled myself with the lid and big hammer patches. Still froze
> after several minutes of flipping between VT's and running graphics-intensive
> programs (inkscape, tuxpaint, etc).
That was drm-intel-next with Chris' patch; in my case it doesn't even last 1 minute, while the kernel patched with the big hammer lasts for longer.
> My hardware is:
> 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE
> Chipset Integrated Graphics Device (rev 01)
>
Mine is:
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
So we can infer that Chris' patch works (a bit) with rev01 but not with rev02.
Thanks for taking time to test this.

Created attachment 33835[details]
Debug logs from unpatched drm-intel-next kernel freeze
I compiled the drm-intel-next kernel from git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git (I hope that's correct for drm-intel-next). I didn't use any patches initally for testing. Also, I would have been unable to apply Chris Wilson's msleep patch from comment #14, as the code has been changed and I couldn't see where it should go. The kernel booted and ran fine for about 15 minutes under load from running XFCE with movie trailers, switching VT's and using tuxpaint, all of which are typically what will crash it the quickest.
I've enclosed dmesg output as well as the contents of /sys/kernel/debug/dri/0 -- hopefully that's all you need for now. I didn't patch with the big-hammer patch because the last kernels I tried with that patch didn't seem to improve things much, but I can do that and provide the logs if it would help.
Let me know what I can test next.
Thanks!
Scott

Created attachment 34016[details][review]
Hack that prevents freezing
At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is this command being sent out properly?
Obviously, this patch causes fairly garbled graphics.

In reply to comment #58)
> At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a
> patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is
> this command being sent out properly?
>
> Obviously, this patch causes fairly garbled graphics.
>
I am using a freedesktop git development stack, created with this script:
http://bit.ly/b2sJVO
I have applied your patch before building xf86-video-intel and recompiled my drm-intel kernel to be modular (CONFIG_DRM=m,CONFIG_DRM_i915=m and also CONFIG_FB=m) so that the freedesktop compiled modules can be used instead.
However, I can't boot with drm being modular! It simply shows a black screen (tuned OFF) and there's no way to put this into a working console... (I can login and send commands but I am blind without a screen)
Does anybody have any hint? I have never been able to boot with KMS enabled and drm modular; the mkinitcpio configuration facilities (for customization of the initramfs) of this Arch Linux box do not seem to do anything.

Created attachment 34194[details][review]
(hopefullyy) fix gtt cache coherency
This patch seems to fix any gtt related cache coherency problems, at least for my i855GM.
It's quite large, but that's just due to me having first needed to clean up intel-agp.c before seeing clear what's going on. This patch is also not yet polished, so don't look at it and expect beauty ;)
It contains a totally paranoid cache coherency checker with the absolutely minimal set of memory barriers and cache flush before/after the chipset flush. If it detects any inconsistency, it prints a warning + backtrace in the dmesg. Ratelimited to one warning per direction per minute. Don't use this cache coherency checker for unpatched kernels. On my i855GM, cpu->gtt transfers fail with > 1% chance, gtt->cpu transfers fail with > 50% chance on unpatched kernels. So it'll only spam your dmesg.
This patch also includes the unbind-inactive-objects patch to really trash on the gtt stuff. Also trashes performance, so expect a sluggish feel when testing this.
Patch also prints out the number of completed chipset flushes in regular intervals. If you test this, wait until at least 1 million chipset flushes have been done (or a chipset flush failed) before declaring that it works. Even better is 10 million. On my i855 that's about two hours of glxgears & openarena.
Suspend/resume only lightly tested. It might break the cache coherency checker (but should not).
In case this patch doesn't work and you get backtraces about failed flushes, please attach you full dmesg. If it works, please report on what hw (lspci -nn) and how many chipset flushes (more than 10M would be great) have been done.
Patch is against -rc1 but should apply to latest drm-intel, too. Don't use any other patches when testing this.
Thanks, Daniel

Created attachment 34220[details]
dmesg with gtt cache coherency patch
Nearly got to one million flushes before X froze. Noticed a few warnings about failed flushes before, but apparently to no critical effect (yet).

> --- Comment #63 from 2points@gmx.org 2010-03-18 14:58:51 PST ---
> Created an attachment (id=34220)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34220)
> dmesg with gtt cache coherency patch
>
> Nearly got to one million flushes before X froze. Noticed a few warnings about
> failed flushes before, but apparently to no critical effect (yet).
Thanks for testing this. Looks like it doesn't work as advertised, given
that you have a i855GM, too. I need to get back to the drawing board.

> Thanks for testing this. Looks like it doesn't work as advertised, given
> that you have a i855GM, too. I need to get back to the drawing board.
Would you still like the patch to be tested on other hardware, or should we consider it obsolete now?

> --- Comment #65 from Geir Ove Myhr <gomyhr@gmail.com> 2010-03-18 15:38:34 PST ---
> > Thanks for testing this. Looks like it doesn't work as advertised, given
> > that you have a i855GM, too. I need to get back to the drawing board.
>
> Would you still like the patch to be tested on other hardware, or should we
> consider it obsolete now?
Well, Chris tested it on his i845 and it had no effect there. If you have
something else around and some time to waste, testing wouldn't hurt.
Perhaps my cache coherency checker uncovers some other stuff. Anyway, I
have the feeling that the i845 and the i855GM bugs are two different
things, so I've created a new bug to keep track of my crusade to fix the
i855 here:
bug # 27187
So if you test, please report your findings there.

Created attachment 34233[details]
section of dmesg with GTT flush failures
(In reply to comment #61)
> Created an attachment (id=34194) [details]
> (hopefullyy) fix gtt cache coherency
>
> This patch seems to fix any gtt related cache coherency problems, at least for
> my i855GM.
>
Hi Daniel, I finally took time to test this.
I also have your hardware and with this patch I finally can use Xorg 1.7.5 and the modern intel driver!
So this patch obsoletes the DRM big hammer patch that I was previously using and that gave me about 5 minutes of working Xorg.
With your patch Xorg can be used for long time (more than 1 hour now and not crashed yet), but I have seen GTT flush failures in dmesg, please see attachment.
Did you see that? Flush failures at flush number 16384 and 32768! Am I just lucky or is there a reason for such numbers being powers of two?
So I will be using your patch since even with some failures it doesn't crash Xorg as the linus/drm-intel trees do; I'd even propose it for submission, because it makes the hardware usable!
Thanks

It finally crashed when playing a video, but I think this is a totally separate bug, perhaps related to the hangcheck timer; by the way, can somebody check if bug 26723 is duplicate of bug 24789 or of bug 26345 (this one)?
I am asking because with this patch I get:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
after the crash triggered by playing a video.
Also I can see a pattern in the failed flushes:
~$ dmesg|grep -F "flush no"
[ 30.890838] chipset flush no. 0
[ 245.758686] chipset flush no. 16384
[ 403.685617] chipset flush no. 32768
[ 594.005756] chipset flush no. 49152
Do we have a crazy carry bit somewhere?

> --- Comment #68 from legolas558 <legolas558@email.it> 2010-03-19 04:38:05 PST ---
> Also I can see a pattern in the failed flushes:
>
> ~$ dmesg|grep -F "flush no"
> [ 30.890838] chipset flush no. 0
> [ 245.758686] chipset flush no. 16384
> [ 403.685617] chipset flush no. 32768
> [ 594.005756] chipset flush no. 49152
>
> Do we have a crazy carry bit somewhere?
Nothing crazy is going on. This just prints out the number of chipset
flushes done every 16*1024 flushes. This is just to know how reliable the
thing works. The cache coherency problems caught by my checker print out
"chipset flushed failed". So you have to count these (plus add in the ones
supressed by the ratelimiting code, watch out for "xx callbacks supressed"
in your demsg). Then divided them by the number of flushes (as printed in
your dmesg snippet above) and you have a ballbark figure for how reliable
the chipset flushing is. Obviously anything bigger than zero is
unacceptable.

(In reply to comment #58)
> Created an attachment (id=34016) [details]
> Hack that prevents freezing
>
> At least some of the hangs seem to be related to XY_COLOR_BLT, so I made a
> patch to disable that in the DDX. And now my testcase doesn't hang the GPU. Is
> this command being sent out properly?
Gah, missed this. Sorry Brian. Yes it does seem that we could emit a solid fill that exceed the surface bounds. Not quite sure what is generating such nonsense, but it will at least be resolved by:
commit 0c47195ca805881e3fbd5b9224be5c930feeeb8c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed Mar 24 17:37:39 2010 +0000
i830: Clip solid fills to surface.
There is a reasonable surfeit of evidence to support this error,
for instance: http://bugs.freedesktop.org/attachment.cgi?id=34417

Created attachment 34429[details]
Chipset flushing quality script for freedesktop bug 26345
fixed script to correctly show failures ratio
now using patch v5, no failures within first 32768 flushes - will report further when a big amount of flushes have been done

Created attachment 34437[details]
dmesg after 7 GTT flush failures
I got 7 failures within the first 300k flushes; I have attached dmesg and debugfs dri dumps.
Failures seem harder to trigger now. I had to open 4 glxgears and one xeyes to trigger them.
I am using latest drm-intel kernel with patch in attachment 34377[details][review]
I am using the stock packages from Arch Linux (xorg-server 1.7.5-902, xf86-video-intel 2.10.0, intel-dri 7.7, libdrm-git).
It still crashes with the hangcheck bug when playing videos, but very rarely now.

Tried Daniel's patch from comment #61 with drm-intel-next. Didn't work on my VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01). In fact it was a little worse than the stock drm-intel-kernel as far as running time until freezing (using dwm -- time running xfce was about the same: mere seconds)
Scott

@Scott: in my case it lasts some more time, but I have recently experienced sudden crashes...so D.Vetter's patch might be the way to go but Intel has clearly not released a good open source driver for these devices since the beginning (Xorg 1.6 and the old driver (non-KMS) are definitively usable, at least).

hi there,
I'm also bitten by occasional x freezes on a i855GM rev 02 and I am following the respective bug reports.
I think a lot has been tested already, but in my case, I get those freezes *only* in a dual-head setup using LVDS and VGA together. I *never* had a freeze using only the LVDS on my laptop, linux 2.6.33 + libdrm 2.4.18-3 + xserver-xorg-video-intel 2.10.903 (all on debian) seems reasonably stable in this case.
hth, ben

Created attachment 34664[details][review]
Graphics-breaking workaround: skip i830_uxa_solid
At least with x11perf, I could not reproduce the hang with this patch. We should figure out if there are cases where the GPU can still hang if XY_COLOR_BLT is never called.
That most recent dump appears to point to a different operation, but in my testing, not all of my dumps implicated XY_COLOR_BLT and yet eliminating that prevented all hangs as far as I could tell.
If it never hangs without using XY_COLOR_BLT, perhaps we could find a substitute. If it can hang on other operations, we could eliminate them one-by-one to make a list of the problematic opcodes and see what they might have in common.

(In reply to comment #86)
> Created an attachment (id=34664) [details]
> Graphics-breaking workaround: skip i830_uxa_solid
Brian, are you saying that commenting out most of i830_uxa_solid still works better, even after the clip solids commit from comment # 72?

I haven't had a chance to test on the affected machine lately, so I don't know if that patch fixed the x11perf hangs. However, I didn't see invalid bounds in any of the dumps, so I don't see how it could. I'll test x11perf again next chance I get, though.

Created attachment 34676[details]
output of intel gpu dump after a crash.
The attached file is the output of intel_gpu_dump after a crash obtained while using the patch in Comment #86. I'm sorry I have no idea what the relevant part can be, so I just upload it all. See below for a small excerpt, though (containing the lines which mention 'HEAD' and 'TAIL').
Also I must say that the patch has some side effects : some texts (mostly Gnome menus or texts in gnome applets) sometimes don't appear on the screen and eventually appear if I roll over them with the mouse, or highlight them in some way. I guess these were expected... if not, I can take screenshots.
Preview of the attached file~:
ACTHD: 0x0686c000
EIR: 0x00000000
EMR: 0xffffff69
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x18000001
IPEIR: 0x00000000
INSTDONE: 0x01ffffc1
(7840 lines not shown)
0x00007a5c: 0x0686c001: MI UNKNOWN
0x00007a60: 0x0686c01c: MI UNKNOWN
0x00007a64: HEAD 0x00000000: MI_NOOP
0x00007a68: 0x02000004: MI_FLUSH
0x00007a6c: 0x00000000: MI_NOOP
0x00007a70: 0x10800001: MI_STORE_DATA_INDEX
0x00007a74: 0x00000080: dword 1
0x00007a78: 0x000ddabf: dword 2
0x00007a7c: 0x01000000: MI_USER_INTERRUPT
0x00007a80: 0x02000000: MI_FLUSH
0x00007a84: 0x00000000: MI_NOOP
0x00007a88: 0x10800001: MI_STORE_DATA_INDEX
0x00007a8c: 0x00000080: dword 1
0x00007a90: 0x000ddac0: dword 2
0x00007a94: 0x01000000: MI_USER_INTERRUPT
0x00007a98: TAIL 0x02000004: MI_FLUSH
0x00007a9c: 0x00000000: MI_NOOP
0x00007aa0: 0x18000001: MI UNKNOWN
(24921 more lines not shown)

Thanks for testing. I just wanted to make sure there were other graphics operations that could hang the GPU. Turns out there are. Yeah, that patch messes up graphics, because I'm ignoring all requests to fill a rectangular region with a solid color, which is needed to clear off a pixmap before drawing on it.

I tried to save the output of intel_gpu_dump for the last few crashes that happened to me. I don't know how to interpret these, but I thought I would share this with you :
# grep IPEHR *
intelgpudump-2010-04-05_17:23:34:IPEHR: 0x18000001
intelgpudump-2010-04-06_12:25:37:IPEHR: 0x41500000
intelgpudump-2010-04-06_13:24:52:IPEHR: 0x18000001
intelgpudump-2010-04-06_14:16:10:IPEHR: 0x41600000
intelgpudump-2010-04-06_14:32:29:IPEHR: 0x05000000
intelgpudump-2010-04-06_16:54:39:IPEHR: 0x18000001
intelgpudump-2010-04-07_13:29:26:IPEHR: 0x54300004
intelgpudump-2010-04-07_13:48:38:IPEHR: 0x04caf6e4
intelgpudump-2010-04-09_11:07:47:IPEHR: 0x18000001
intelgpudump-2010-04-09_11:58:45:IPEHR: 0x18000001
intelgpudump-2010-04-09_14:46:27:IPEHR: 0x18000001
intelgpudump-2010-04-09_16:34:45:IPEHR: 0x05000000
intelgpudump-2010-04-10_14:16:59:IPEHR: 0x0a103078
intelgpudump-2010-04-10_15:34:04:IPEHR: 0x00000000
(the date/time is when the dump was taken, which obviously is a few seconds/minutes after the crash happens)
From what I could understand from comment #10, the value of IPEHR could be almost anything, depending on (disk) activity. Am I experiencing distinct bugs ? Is the value of IPEHR not linked to the crash ? I have no idea.
Full logs (intel_gpu_dump output, usually also the dmesg output) are at http://dl.free.fr/v9nxyAGHx as a tar.bz2 file, in case they are of any interest.

I have been testing different kernels under Lucid on a desktop computer equipped with a 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01) with KMS and dri enabled. Analysing the underlying code is beyond my capabilities, nevertheless I would like to help.
1) The standard Lucid kernel leads to a crash usually within the first 10 min of usual usage (Openoffice, Firefox).
2) drm-intel-next kernels as provided by http://kernel.ubuntu.com/~kernel-ppa/mainline/ or https://launchpad.net/~brian-rogers/+archive/experimental, the latter with Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel crashes usually after 1 to 3h the later ones do not survive the very first mode switch even before Xorg is started, regardless V9 applied or not. This kind of crash does not allow for ssh or gracefully rebooting.
3) The most interesting kernel seems to me Daniel Baumann's ( https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8 patch applied. Usually it crashes within the first 3h of usage. But now I have hit a >20h period and cannot kill it, by normal usage, running GL screensaver, switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL screensaver again ..., passing the x11perf test several times, .... /sys/kernel/debug/dri/0/i915_error_state, however shows
Time: 1274367670 s 906969 us
EIR: 0x00000010
PGTBL_ER: 0x00000049
INSTPM: 0x00000000
IPEIR: 0x00000000
IPEHR: 0x01000000
INSTDONE: 0x00ffffc0
ACTHD: 0x00000048
That one seems to happen always when the computer starts.
Given the randomness of the time after which the crashes/stucks happen and to which extent different i8xx based machines are affected (e.g. my i855 based laptop runs prefectly with a V8 patched kernel) it is difficult to interpret above findings on the 82845G. What is not random, I think, are the early deaths with later drm-intel-next kernels (2), and also the times to crash of (3) does not look Gaussian distributed. The comparison of (1) and (3) seems to indicate that the V8 patch fixes one kind of mishap also for 82845G based hardware.

(In reply to comment #94)
> 2) drm-intel-next kernels as provided by
> http://kernel.ubuntu.com/~kernel-ppa/mainline/ or
> https://launchpad.net/~brian-rogers/+archive/experimental, the latter with
> Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel
> crashes usually after 1 to 3h the later ones do not survive the very first mode
> switch even before Xorg is started, regardless V9 applied or not. This kind of
> crash does not allow for ssh or gracefully rebooting.
Are you sure that V9 is correctly applied? That early crash has always been an indicator of missing Daniel Vetter's patch (at least up to now).
> 3) The most interesting kernel seems to me Daniel Baumann's (
> https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8
> patch applied. Usually it crashes within the first 3h of usage. But now I have
> hit a >20h period and cannot kill it, by normal usage, running GL screensaver,
> switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting
> to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL
> screensaver again ..., passing the x11perf test several times, ....
> /sys/kernel/debug/dri/0/i915_error_state, however shows
> Time: 1274367670 s 906969 us
> EIR: 0x00000010
> PGTBL_ER: 0x00000049
> INSTPM: 0x00000000
> IPEIR: 0x00000000
> IPEHR: 0x01000000
> INSTDONE: 0x00ffffc0
> ACTHD: 0x00000048
> That one seems to happen always when the computer starts.
>
You mean that no crashes are found in dmesg during these >20h sessions? Do you have the logs to check it out?
> Given the randomness of the time after which the crashes/stucks happen and to
> which extent different i8xx based machines are affected (e.g. my i855 based
> laptop runs prefectly with a V8 patched kernel) it is difficult to interpret
> above findings on the 82845G. What is not random, I think, are the early deaths
> with later drm-intel-next kernels (2), and also the times to crash of (3) does
> not look Gaussian distributed. The comparison of (1) and (3) seems to indicate
> that the V8 patch fixes one kind of mishap also for 82845G based hardware.
Have you considered variability of the rest of code (drm-intel-next, Xorg, drivers/libraries), if there is any in your tests?
On my 855GM (rev2) I have reached a very stable situation, except that overlays (created by VLC or mplayer) do always crash the machine in a fairly short amount of time. I have also some background corruption but that might be a new bug in libdrm or in something else.
Can you confirm that the crashes you experienced were someway connected to video playing?

(In reply to comment #95)
> (In reply to comment #94)
> > 2) drm-intel-next kernels as provided by
> > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or
> > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with
> > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel
> > crashes usually after 1 to 3h the later ones do not survive the very first mode
> > switch even before Xorg is started, regardless V9 applied or not. This kind of
> > crash does not allow for ssh or gracefully rebooting.
> Are you sure that V9 is correctly applied? That early crash has always been an
> indicator of missing Daniel Vetter's patch (at least up to now).
It is the kernel which Brian Rogers has build and announced in https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492?comments=allcomment #105, I somehow trust him. I also tried to patch the provided sources with V9 to find that it had been applied already.
>
> > 3) The most interesting kernel seems to me Daniel Baumann's (
> > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8
> > patch applied. Usually it crashes within the first 3h of usage. But now I have
> > hit a >20h period and cannot kill it, by normal usage, running GL screensaver,
> > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting
> > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL
> > screensaver again ..., passing the x11perf test several times, ....
> > /sys/kernel/debug/dri/0/i915_error_state, however shows
> > Time: 1274367670 s 906969 us
> > EIR: 0x00000010
> > PGTBL_ER: 0x00000049
> > INSTPM: 0x00000000
> > IPEIR: 0x00000000
> > IPEHR: 0x01000000
> > INSTDONE: 0x00ffffc0
> > ACTHD: 0x00000048
> > That one seems to happen always when the computer starts.
> >
> You mean that no crashes are found in dmesg during these >20h sessions? Do you
> have the logs to check it out?
First of all I wish to emphsize that it is one lucky strike which I have hit. More frequently this kernel gets stuck as well. Therefore I left the machine running the screensaver and it is still doing well. In dmesg is a single drm related error during startup:
[ 23.288050] render error detected, EIR: 0x00000010
[ 23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
[ 23.288085] render error detected, EIR: 0x00000010
now the last line is:
[54005.830290] svc: failed to register lockdv1 RPC service (errno 97).
>
> > Given the randomness of the time after which the crashes/stucks happen and to
> > which extent different i8xx based machines are affected (e.g. my i855 based
> > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret
> > above findings on the 82845G. What is not random, I think, are the early deaths
> > with later drm-intel-next kernels (2), and also the times to crash of (3) does
> > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate
> > that the V8 patch fixes one kind of mishap also for 82845G based hardware.
> Have you considered variability of the rest of code (drm-intel-next, Xorg,
> drivers/libraries), if there is any in your tests?
>
> On my 855GM (rev2) I have reached a very stable situation, except that overlays
> (created by VLC or mplayer) do always crash the machine in a fairly short
> amount of time. I have also some background corruption but that might be a new
> bug in libdrm or in something else.
>
> Can you confirm that the crashes you experienced were someway connected to
> video playing?
On my i855 based laptop I have tested video only occasionally, not with mplayer or VLC and found no preferential crashing.

(In reply to comment #96)
> (In reply to comment #95)
> > (In reply to comment #94)
> > > 2) drm-intel-next kernels as provided by
> > > http://kernel.ubuntu.com/~kernel-ppa/mainline/ or
> > > https://launchpad.net/~brian-rogers/+archive/experimental, the latter with
> > > Daniel Vetter's V9 patch, behave differently. While the 2010-04-19-lucid kernel
> > > crashes usually after 1 to 3h the later ones do not survive the very first mode
> > > switch even before Xorg is started, regardless V9 applied or not. This kind of
> > > crash does not allow for ssh or gracefully rebooting.
> > Are you sure that V9 is correctly applied? That early crash has always been an
> > indicator of missing Daniel Vetter's patch (at least up to now).
> It is the kernel which Brian Rogers has build and announced in
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/541492?comments=all
> comment #105, I somehow trust him. I also tried to patch the provided sources
> with V9 to find that it had been applied already.
>
I see - can you pick the dump generation script (a small daemon) in attachment 34922[details] to hopefully get snapshots right before and after the total crash? It has worked for me also without keyboard or ssh, and perhaps one of those dumps might contain information about what's happening there.
> >
> > > 3) The most interesting kernel seems to me Daniel Baumann's (
> > > https://launchpad.net/~dnjl/+archive/kernel ), a standard Lucid one with V8
> > > patch applied. Usually it crashes within the first 3h of usage. But now I have
> > > hit a >20h period and cannot kill it, by normal usage, running GL screensaver,
> > > switching VTs, watching HTML5 Youtube videos normal and fullscreen, reverting
> > > to the standard Lucid xserver-xorg-video-intel and libdrms at stopped Xorg, GL
> > > screensaver again ..., passing the x11perf test several times, ....
> > > /sys/kernel/debug/dri/0/i915_error_state, however shows
> > > Time: 1274367670 s 906969 us
> > > EIR: 0x00000010
> > > PGTBL_ER: 0x00000049
> > > INSTPM: 0x00000000
> > > IPEIR: 0x00000000
> > > IPEHR: 0x01000000
> > > INSTDONE: 0x00ffffc0
> > > ACTHD: 0x00000048
> > > That one seems to happen always when the computer starts.
> > >
> > You mean that no crashes are found in dmesg during these >20h sessions? Do you
> > have the logs to check it out?
> First of all I wish to emphsize that it is one lucky strike which I have hit.
> More frequently this kernel gets stuck as well. Therefore I left the machine
> running the screensaver and it is still doing well. In dmesg is a single drm
> related error during startup:
> [ 23.288050] render error detected, EIR: 0x00000010
> [ 23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
> [ 23.288085] render error detected, EIR: 0x00000010
> now the last line is:
> [54005.830290] svc: failed to register lockdv1 RPC service (errno 97).
>
svc is not related.
In my case (drm-intel-next with patched v9) it never crashes unless I pick some video or some screen-intensive wine application. Perhaps your configuration is not as stable because you are using a compositing window manager?
> >
> > > Given the randomness of the time after which the crashes/stucks happen and to
> > > which extent different i8xx based machines are affected (e.g. my i855 based
> > > laptop runs prefectly with a V8 patched kernel) it is difficult to interpret
> > > above findings on the 82845G. What is not random, I think, are the early deaths
> > > with later drm-intel-next kernels (2), and also the times to crash of (3) does
> > > not look Gaussian distributed. The comparison of (1) and (3) seems to indicate
> > > that the V8 patch fixes one kind of mishap also for 82845G based hardware.
> > Have you considered variability of the rest of code (drm-intel-next, Xorg,
> > drivers/libraries), if there is any in your tests?
> >
> > On my 855GM (rev2) I have reached a very stable situation, except that overlays
> > (created by VLC or mplayer) do always crash the machine in a fairly short
> > amount of time. I have also some background corruption but that might be a new
> > bug in libdrm or in something else.
> >
> > Can you confirm that the crashes you experienced were someway connected to
> > video playing?
> On my i855 based laptop I have tested video only occasionally, not with mplayer
> or VLC and found no preferential crashing.
I suspect that some video-intensive application or window manager could make it crash more frequently; in such case crashes like mine, happening only under specific stress, would be masked because the GPU would be under constant stress. Can this be your case e.g. video-intensive desktop environment or applications being used?

(In reply to comment #97)
> (In reply to comment #96)
> > (In reply to comment #95)
> > > (In reply to comment #94)
snip
> >
> I see - can you pick the dump generation script (a small daemon) in attachment
> 34922 [details] to hopefully get snapshots right before and after the total crash? It has
> worked for me also without keyboard or ssh, and perhaps one of those dumps
> might contain information about what's happening there.
>
Thanks a lot. I will try that soon, once I have access to the 82845G/GL based hardware again. Currently, I have only a 82852/855GM (rev 02) based laptop, I guess its similar to yours and rather well behaving with a V8 patched (see below) Lucid kernel.
> > >
snip
> > related error during startup:
> > [ 23.288050] render error detected, EIR: 0x00000010
> > [ 23.288062] [drm:i915_handle_error] *ERROR* EIR stuck: 0x00000010, masking
> > [ 23.288085] render error detected, EIR: 0x00000010
> > now the last line is:
> > [54005.830290] svc: failed to register lockdv1 RPC service (errno 97).
> >
> svc is not related.
>
> In my case (drm-intel-next with patched v9) it never crashes unless I pick some
> video or some screen-intensive wine application. Perhaps your configuration is
> not as stable because you are using a compositing window manager?
>
My 82845G/GL has much poorer performance then the 855GM and must be different also in other respects. I guess that i8482845G/GL5 hardware is affected by an additional shortcoming which is not covered by the V8/V9 patches.
> > >
snip
> > > Have you considered variability of the rest of code (drm-intel-next, Xorg,
> > > drivers/libraries), if there is any in your tests?
> > >
> > > On my 855GM (rev2) I have reached a very stable situation, except that overlays
> > > (created by VLC or mplayer) do always crash the machine in a fairly short
> > > amount of time. I have also some background corruption but that might be a new
> > > bug in libdrm or in something else.
> > >
> > > Can you confirm that the crashes you experienced were someway connected to
> > > video playing?
> > On my i855 based laptop I have tested video only occasionally, not with mplayer
> > or VLC and found no preferential crashing.
> I suspect that some video-intensive application or window manager could make it
> crash more frequently; in such case crashes like mine, happening only under
> specific stress, would be masked because the GPU would be under constant
> stress. Can this be your case e.g. video-intensive desktop environment or
> applications being used?
This is for the 82852/855GM now: I found that Daniel Baumann's kernel ( https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when confronted with Xv-overlay in a very similar way as you describe it. Stefan Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of crash. In his Changelog it says "* i915-kernel module includes patch to get Xv-overlay mode working again." . That must be something in addition to V8/V9 (I really have to study the sources). If I use the latter modules with Lucid's xserver-xorg-video-intel and libdrms video overlay in totem player only appears when I fiddle around resizing the window. If I use xserver-xorg-video-intel 2.11.0 and the libdrms 2.4.20 as provided by https://launchpad.net/~glasen/+archive/intel-driver video overlay works, for me better than ever before on the 855GM. But Compiz gives in when changing from fullscreen to normal view. I guess one will have to recompile Compiz and all its dependencies against the new driver.
So I think for the 855GM the solution is really close, for the 82845G/GL not yet, I am afraid.

(In reply to comment #98)
> This is for the 82852/855GM now: I found that Daniel Baumann's kernel (
> https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when
> confronted with Xv-overlay in a very similar way as you describe it. Stefan
> Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German
> only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of
> crash. In his Changelog it says "* i915-kernel module includes patch to get
> Xv-overlay mode working again." . That must be something in addition to V8/V9
> (I really have to study the sources). If I use the latter modules with Lucid's
> xserver-xorg-video-intel and libdrms video overlay in totem player only appears
> when I fiddle around resizing the window. If I use xserver-xorg-video-intel
Where is such patch? I am begging anybody to bring it under my nose, because my Xorg crashes after a few seconds of video watching and it's becoming very stressing...and furthermore the system is often not recoverable by using VTs.
Flash videos are less likely to trigger the crash, while fullscreen or maximized windows do it best.
I can't find the changelog you are mentioning, where is it? Also looks like he is using only the patches from this bug tracker entry, since he only mentions it on his launchpad page.
> 2.11.0 and the libdrms 2.4.20 as provided by
> https://launchpad.net/~glasen/+archive/intel-driver video overlay works, for me
> better than ever before on the 855GM. But Compiz gives in when changing from
> fullscreen to normal view. I guess one will have to recompile Compiz and all
> its dependencies against the new driver.
> So I think for the 855GM the solution is really close, for the 82845G/GL not
> yet, I am afraid.
It probably has more quirks to be worked out - some hackwork is needed.

"../../intel/intel_bufmgr_gem.c:901: Error setting to CPU domain 3: Input/output error"
Just after "Failed to submit batchbuffer: Input/output error". Only gets written to the console. I'm getting this when I kill off X and try to start it with startx, consistently.
Up to date Lucid (clean, recent reinstall).
(II) intel(0): Integrated Graphics Chipset: Intel(R) 845G

(In reply to comment #99)
> (In reply to comment #98)
> > This is for the 82852/855GM now: I found that Daniel Baumann's kernel (
> > https://launchpad.net/~dnjl/+archive/kernel ) crashes on my hardware when
> > confronted with Xv-overlay in a very similar way as you describe it. Stefan
> > Glasenhardt's 855GM - fixed modules ( http://glasen-hardt.de/ , a lot in German
> > only, https://launchpad.net/~glasen/+archive/855gm-fix ) avoid this kind of
> > crash. In his Changelog it says "* i915-kernel module includes patch to get
> > Xv-overlay mode working again." . That must be something in addition to V8/V9
> > (I really have to study the sources). If I use the latter modules with Lucid's
> > xserver-xorg-video-intel and libdrms video overlay in totem player only appears
> > when I fiddle around resizing the window. If I use xserver-xorg-video-intel
>
> Where is such patch? I am begging anybody to bring it under my nose, because my
> Xorg crashes after a few seconds of video watching and it's becoming very
> stressing...and furthermore the system is often not recoverable by using VTs.
>
I do not know whether you are using an Ubuntu patched kernel. If so, my best bet would be that you need
http://launchpadlibrarian.net/44195111/xv_overlay_mode_fix.diff
A short account on the background is given in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554432comment #15. Upstream kernels do not need that patch, e.g. the one from Brian Rogers mentioned up in comment #97, for reverting a previous patch.
> Flash videos are less likely to trigger the crash, while fullscreen or
> maximized windows do it best.
>
> I can't find the changelog you are mentioning, where is it? Also looks like he
> is using only the patches from this bug tracker entry, since he only mentions
> it on his launchpad page.
Rather hidden in one of his German texts he mentions the xv fix. The changelog is the one in Stefan Glasenhardt's package /usr/share/doc/855gm-fix-dkms/changelog.gz when installed.

(In reply to comment #102)
> (In reply to comment #99)
> > Where is such patch? I am begging anybody to bring it under my nose, because my
> > Xorg crashes after a few seconds of video watching and it's becoming very
> > stressing...and furthermore the system is often not recoverable by using VTs.
> >
> I do not know whether you are using an Ubuntu patched kernel. If so, my best
> bet would be that you need
> http://launchpadlibrarian.net/44195111/xv_overlay_mode_fix.diff
> A short account on the background is given in
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/554432comment #15.
> Upstream kernels do not need that patch, e.g. the one from Brian Rogers
> mentioned up in comment #97, for reverting a previous patch.
>
Your bet was correct! I can witness that watching even 5 videos altogether does not crash the system anymore!
The green bands glitches on vertical resync are still there (and pretty disturbing), but this is probably a separate bug.
> > Flash videos are less likely to trigger the crash, while fullscreen or
> > maximized windows do it best.
> >
> > I can't find the changelog you are mentioning, where is it? Also looks like he
> > is using only the patches from this bug tracker entry, since he only mentions
> > it on his launchpad page.
> Rather hidden in one of his German texts he mentions the xv fix. The changelog
> is the one in Stefan Glasenhardt's package
> /usr/share/doc/855gm-fix-dkms/changelog.gz when installed.
Thanks, I was not looking there because I downloaded the non-DKMS version.

I got one crash while watching a long video; I can't say right now if it is exactly the same kind of crash experienced before, I'll collect debug data next time. Anyway the xv_overlay_mode_fix.diff seems to reduce drastically the crash occurrencies, or possibly totally (if I experienced a different bug instead).

I can confirm this same bug on a Dell Optiplex GX260. I see there is a hack for this, but this bug (at least this report) alone has been open for 6 months now with no sign of fixing. Being that this is a show-stopper for this chipset I would imagine that someone would have fixed it by now.
Because of this I am offering a $20 reward (payable via Paypal) to the FIRST person that fixes this issue properly, without simply commenting out code, and successfully feeds it into upstream. Once this is done please email me. :)
Thanks,
--Nick Betcher

I wonder how much work it would take if, on the broken chipsets, we just pre-allocate all GTT-mapped memory and make a copy in case a buffer is moved between CPU and GTT domains (that is, like the classic memory manager, only with a new API?). If my understanding is correct, this would eliminate the need for chipset flushes except when the GTT-mapped memory is first allocated (since the memory may have been touched by the CPU), where any failure would likely be detected early.

I'm currently a recent version of libdrm and xf86-video-intel (not the most up to date, though) and I have to say that, although there are still messages like:
[11876.299009] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[11876.299300] render error detected, EIR: 0x00000000
[11876.299925] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 543130 at 543128)
in dmesg, X is usable and does not crash anymore. This is enough for me (having to reboot three or four times a day, sometimes much more, really was annoying), so thank you for the work on this !

As a workaround, I've pushed a shadow branch to http://cgit.freedesktop.org/~ickle/xf86-video-intel/log/?h=shadow
which disables GPU acceleration and uses a static shadow buffer and uncached memory accesses. This avoids the dynamic reallocation of the GTT and the i845 errata and the general i8xx incoherency problems.
To enable use of the shadow, add
Section "Driver"
Option "Shadow" "True"
EndSection
It is surviving the wtf test on my i845.

Nice work, Chris!
But, I can't get it to build:
intel_uxa.c: In function ‘intel_shadow_create’:
intel_uxa.c:1039: error: ‘size’ undeclared (first use in this function)
intel_uxa.c:1039: error: (Each undeclared identifier is reported only once
intel_uxa.c:1039: error: for each function it appears in.)
intel_uxa.c: In function ‘intel_uxa_create_screen_resources’:
intel_uxa.c:1255: error: ‘intel_screen_private’ has no member named ‘front_stride’
The master branch builds fine, however.
When I get this compiling, I'll put it in an Ubuntu PPA for people to try.

After applying
commit 15056d2c06862627ead868e035fcacc59dce1b1a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Tue Dec 21 17:04:23 2010 +0000
drm/i915: Flush pending writes on i830/i845 after updating GTT
There is an erratum on these two chipsets that causes the wrong PTE
entries to be invalidate after updating the GTT and when used from the
BLT engine. The workaround is to flush any pending writes before those
PTEs are used by the BLT.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
this reduces to the general i8xx incoherency, bug 27187. For which I have a patch which appears to work on my i845; passing both the wtf test and Daniel's cache-coherency checker!

I have the notorious i845 rev01 chipset:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)
I don't know anything about graphics internals, but if there is anything I can do to help test things, let me know. Thanks.

I've been testing my Fujitsu-Siemens Amilo 7400M with newer kernel and Intel driver releases with openSUSE and thus far while progress has been made, this is not yet completely fixed. I did work fine in the 2.6.27 kernel, but not as well since. There is a thread here on the subject: http://forums.opensuse.org/forums/english/get-technical-help-here/pre-release-beta/438965-intel-gpu-8xx-issues-will-11-3-have-them-too.html
I made a video as to how this worked properly under openSUSE-11.1 here with the 2.6.27 kernel: http://www.youtube.com/watch?v=lfnAPDt_bn0
Until openSUSE-11.4 Milestone-6, the behaviour in openSUSE was not very good at all, although with milestone-6 there have been 'some' improvements with milestone-6.
I made a video as to how this works now under 32-bit openSUSE-11.4 milestone-6 (KDE liveCD version) with the 2.6.37-20 kernel and the recent Intel 2.14.0 video driver: http://www.youtube.com/watch?v=QRRyQn_h03Y
I made a video as to how this works now under 32-bit openSUSE-11.4 milestone-6 (Gnome liveCD version) with the 2.6.37-20 kernel and the recent Intel 2.14.0 video driver: http://www.youtube.com/watch?v=9-X3ZiYUbcc
The prevention of a total crash/freeze with the newer 2.6.37-20 kernel (w/Intel 2.14.0 driver) on openSUSE-11.4 milestone-6 is significantly superior to the 2.6.28 and later kernels, but still not as good as the 2.6.27 kernel on openSUSE-11.1.
I have not (yet) tried the older Intel 2.9.1 video driver with the 2.6.37-20 kernel.

Can someone explain what is the status of the solution for this bug?
Chris Wilson's last comment is that he still has hangs.
So I take there is still no upstream solution.
Upiter77 reports success with the latest debian packages. But what patches have caused this?
What patches should I apply and does anyone know the status of these patches wrt Ubuntu?
Regards,
Bert

Created attachment 51401[details][review]
New stab at working around the i845 tlb issues.
I'd be great if anyone with a still-booting i845 could test this. Obviously you need to disable Shadow. Also, expect some slowdown, but hopefully not that bad.

Created attachment 53024[details]
The relevant part of dmesg with Daniel's patch
Daniel's patch not works for me. I tested with 3.1 kernel, and I saw only some dotted line and a cursor when Xorg loadaded. (Attached the relevant part of dmesg.)
Current state of the driver with the 3.1 kernel:
- Still get random GPU hangs without ShadowFB. It works stable with ShadowFB.
- XVideo: contrast and saturation are misconfigured (Bug 42488)
- OpenGL without ShadowFB: it's fast and stable until a GPU hang. Once GPU hang occurred, OpenGL apps are no longer works.
- OpenGL with enabled ShadowFB: it works, but very slow, slower than llvmpipe.
- Trying to run GNOME Shell always cause an immediate GPU hang, even if ShadowFB enabled.

Hi, I've hit this too with:
00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 03)
I tried the patch in comment #138 against 3.2.0-rc4, and I had similar results as comment #139 - the display wasn't usable.
Is there anything else I can try? I have the system readily available.

I've put some recent patches into http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=845g which makes my 845g much more stable, though I'm still getting spurious GPU hangs under memory pressure. In that regard SNA is performing better (not only to being a more complete acceleration architecture) as it is thrashing the GTT far less.

Now in kernel form as well:
commit b75e53bac7f4164e1c53a636352faa3d177b4beb
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Sun Dec 16 18:08:07 2012 +0100
drm/i915: Implement workaround for broken CS tlb on i830/845
Now that Chris Wilson demonstrated that the key for stability on early
gen 2 is to simple _never_ exchange the physical backing storage of
batch buffers I've tried a stab at a kernel solution. Doesn't look too
nefarious imho, now that I don't try to be too clever for my own good
any more.
v2: After discussing the various techniques, we've decided to always blit
batches on the suspect devices, but allow userspace to opt out of the
kernel workaround assume full responsibility for providing coherent
batches. The principal reason is that avoiding the blit does improve
performance in a few key microbenchmarks and also in cairo-trace
replays.
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

(In reply to comment #154)
> These fixes are in the latest 3.8 kernel. Have the GPU hangs been fixed in
> earlier versions of the kernel?
The sna fixes work with any KMS/GEM (i.e. 2.6.29+) kernel. The kernel w/a is being backported by Julien Cristau for the debian kernel.

(In reply to comment #157)
> In regards to comment 145, is it recommended to use SNA? We are not using
> SNA and have seen GPU hangs on the 845G. Is it better to use SNA and apply
> the SNA patch to xorg-x11-drv-intel?comment 145 is stale, superseded by the genuine fixes in comments 152 and 153. I would recommend using SNA on gen2 as UXA pales in comparison.

Something is broken in the 3.8 kernel. When I'm using it, the colour depth is low, and my system freezes when I try to suspend the computer. I don't know if it caused by the applied workaround or not, but the problem gone if I downgrade to kernel version 3.7.

(In reply to comment #159)
> Something is broken in the 3.8 kernel. When I'm using it, the colour depth
> is low, and my system freezes when I try to suspend the computer. I don't
> know if it caused by the applied workaround or not, but the problem gone if
> I downgrade to kernel version 3.7.
Please file a new bug with a detailed description of your symptoms instead of cluttering this one.