Using ArchLinux 64, own built kernel (vanilla) with an HD 6950 (CAYMAN).
When testing kernel 3.7-rc1, the screen becomes unreadable. It's either completely black or in a flickering state making the display unusable. It is very similar to what was observed in bug 43655.
At the time of bug 43655, the problem had been fixed by http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=81ee8fb6b52ec69eeed37fe7943446af1dccecc5. So it was working OK with kernel 3.6.
I've bisected between kernel 3.6 and 3.7-rc1 and the faulty identified commit is the following:
62444b7462a2b98bc78d68736c03a7c4e66ba7e2 is the first bad commit
commit 62444b7462a2b98bc78d68736c03a7c4e66ba7e2
Author: Alex Deucher <alexander.deucher@amd.com>
Date: Wed Aug 15 17:18:42 2012 -0400
drm/radeon: properly handle mc_stop/mc_resume on evergreen+ (v2)
- Stop the displays from accessing the FB
- Block CPU access
- Turn off MC client access
This should fix issues some users have seen, especially
with UEFI, when changing the MC FB location that result
in hangs or display corruption.
v2: fix crtc enabled check noticed by Luca Tettamanti
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
:040000 040000 3e0d33c9b4eda29ced814fe9a863efe63e53f14c 4932561607b160734ec1eade927a9fe18c9f3f1b M drivers

(In reply to comment #1)
> Created attachment 68760[details][review] [review]
> possible fix
>
> Does this patch help?
Applied suggested patch on top of 62444b7462a2b98bc78d68736c03a7c4e66ba7e2 and it doesn't fix it.
By the way, I confirmed it is exactly as bug 43655 (same symptoms, same workaround).

So, I went further and I split commit 62444b in 3 patches. On for the register values, one for the stop function and one for the resume function. I applied the first and the second patches for now and it seems to work well. I suspended my computer, resumed and everything is going normal (for now at least). I'll test it a bit more, but the problem seems to be somewhere in the resume function.

Created attachment 69113[details][review]
possible fix
(In reply to comment #4)
> the bug appeared. So it seems blanking the display controllers with for(i =
> 0; i < rdev->num_crtc; i++) is not equivalent to the code that it replaces.
> The original code first wrote in the EVERGREEN_CRTC_UPDATE_LOCK registers,
> before setting EVERGREEN_CRTC_CONTROL registers and writing again in the
> EVERGREEN_CRTC_UPDATE_LOCK registers. On the other hand, the new code
> doesn't write in the EVERGREEN_CRTC_UPDATE_LOCK neither before nor after
> setting EVERGREEN_CRTC_CONTROL.
It should be equivalent. CRTC_UPDATE_LOCK turns off double buffering in the crtc which makes register updates atomic. The new code waits for the frame count to increase (the double buffered updates happen at vblank) so it should be equivalent. That said, it shouldn't hurt to take the lock. Does this patch help?

(In reply to comment #5)
> Created attachment 69113[details][review] [review]
> possible fix
>
> (In reply to comment #4)
> > the bug appeared. So it seems blanking the display controllers with for(i =
> > 0; i < rdev->num_crtc; i++) is not equivalent to the code that it replaces.
> > The original code first wrote in the EVERGREEN_CRTC_UPDATE_LOCK registers,
> > before setting EVERGREEN_CRTC_CONTROL registers and writing again in the
> > EVERGREEN_CRTC_UPDATE_LOCK registers. On the other hand, the new code
> > doesn't write in the EVERGREEN_CRTC_UPDATE_LOCK neither before nor after
> > setting EVERGREEN_CRTC_CONTROL.
>
> It should be equivalent. CRTC_UPDATE_LOCK turns off double buffering in the
> crtc which makes register updates atomic. The new code waits for the frame
> count to increase (the double buffered updates happen at vblank) so it
> should be equivalent. That said, it shouldn't hurt to take the lock. Does
> this patch help?
Sadly, it didn't help.

Alex, would it be possible to print what is going on or if an error occurred in evergreen_mc_stop()?
I see four things that could be going on:
1- we are not using the right path for CAYMAN -> (ASIC_IS_DCE6(rdev));
2- lock mechanism synced with vblank is not working properly;
3- all the registers should be locked at the same time, then all modified and finally unlocked together, which is not done with the for loop where we move through each at a time;
4- we are not setting the right registers.
What do you think?
I can test and investigate 1, 3 and 4. But what about 2?

Created attachment 69370[details][review]
possible fix
(In reply to comment #7)
> Alex, would it be possible to print what is going on or if an error occurred
> in evergreen_mc_stop()?
>
> I see four things that could be going on:
> 1- we are not using the right path for CAYMAN -> (ASIC_IS_DCE6(rdev));
cayman is DCE5. It is using the correct code path.
> 2- lock mechanism synced with vblank is not working properly;
Locking makes updates atomic rather than double buffered.
> 3- all the registers should be locked at the same time, then all modified
> and finally unlocked together, which is not done with the for loop where we
> move through each at a time;
doesn't matter.
> 4- we are not setting the right registers.
The existing sequence should be correct. It's the same sequence our hw team recommends. I can't reproduce this on my cayman boards unfortunately and this patch fixes the exact same problem you are having for a number of other people :/
Maybe an issue with the icon or cursor, but I think those should be disabled when we disable mem requests in the crtc. Does this patch help?

(In reply to comment #8)
> Created attachment 69370[details][review] [review]
> possible fix
>
> (In reply to comment #7)
> > Alex, would it be possible to print what is going on or if an error occurred
> > in evergreen_mc_stop()?
> >
> > I see four things that could be going on:
> > 1- we are not using the right path for CAYMAN -> (ASIC_IS_DCE6(rdev));
>
> cayman is DCE5. It is using the correct code path.
>
> > 2- lock mechanism synced with vblank is not working properly;
>
> Locking makes updates atomic rather than double buffered.
>
> > 3- all the registers should be locked at the same time, then all modified
> > and finally unlocked together, which is not done with the for loop where we
> > move through each at a time;
>
> doesn't matter.
>
> > 4- we are not setting the right registers.
>
> The existing sequence should be correct. It's the same sequence our hw team
> recommends. I can't reproduce this on my cayman boards unfortunately and
> this patch fixes the exact same problem you are having for a number of other
> people :/
>
> Maybe an issue with the icon or cursor, but I think those should be disabled
> when we disable mem requests in the crtc. Does this patch help?
Not working either. Also, with 3.7.0-rc3 + this patch, the computer freezes when coming back from suspend. I see Gnome-Shell for a couple of seconds, then some garbage and it stops. Kernel 3.6 works fine on that matter. I'll test with my split patches to see if it is caused by the same commit.
So, pretty much back to square 1, wherever that is.

Found what is wrong with the help of a few printk and by comparing to the code being replaced. All the logic is good (going through crtc, disabling them, waiting for vblank) BUT setting "tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;" is wrong.
If I do as in the previous code by setting tmp = 0 and then continuing with:
radeon_wait_for_vblank(rdev, i);
WREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i], tmp);
everything works fine as before.
What is expected from "tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;"? From what I read with printk, it is far from a 0 or a 1. Is this normal?

(In reply to comment #11)
> Found what is wrong with the help of a few printk and by comparing to the
> code being replaced. All the logic is good (going through crtc, disabling
> them, waiting for vblank) BUT setting "tmp |=
> EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;" is wrong.
>
> If I do as in the previous code by setting tmp = 0 and then continuing with:
> radeon_wait_for_vblank(rdev, i);
> WREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i], tmp);
> everything works fine as before.
>
> What is expected from "tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;"?
> From what I read with printk, it is far from a 0 or a 1. Is this normal?
That's the most important bit in the entire sequence. It's a bit field in a register (bit 24 to be exact). That bit is the bit that actually disables the requests from the display controller in the memory controller. The whole point of this code is to disable all clients of the memory controller (mc_stop()) so that we can change the location of vram within the GPU's address space. Once we've moved vram, we can re-enable the clients (mc_resume()) so that they point to the new vram location.

(In reply to comment #12)
> (In reply to comment #11)
> > Found what is wrong with the help of a few printk and by comparing to the
> > code being replaced. All the logic is good (going through crtc, disabling
> > them, waiting for vblank) BUT setting "tmp |=
> > EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;" is wrong.
> >
> > If I do as in the previous code by setting tmp = 0 and then continuing with:
> > radeon_wait_for_vblank(rdev, i);
> > WREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i], tmp);
> > everything works fine as before.
> >
> > What is expected from "tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;"?
> > From what I read with printk, it is far from a 0 or a 1. Is this normal?
>
> That's the most important bit in the entire sequence. It's a bit field in a
> register (bit 24 to be exact). That bit is the bit that actually disables
> the requests from the display controller in the memory controller. The
> whole point of this code is to disable all clients of the memory controller
> (mc_stop()) so that we can change the location of vram within the GPU's
> address space. Once we've moved vram, we can re-enable the clients
> (mc_resume()) so that they point to the new vram location.
Thank you, you confirmed what I had assumed. I had already understood that it was the most important part in the sequence since it is associated to a "write" instruction. I don't have the (decimal/binary) values with me right now, but I'll be able to give you what was read from the register and from the result returned from the bitwise OR assignment we are pushing in the register. I'll confirm which bit(s) are changing. I'm sure the assignment was setting one (or more) bit in the register to "1". Is bit 24 the only one expected to change in the register? Should it be put to "1" or "0"?
Another question: why were we setting "0" in the register before? By doing so, we were setting the whole register to "0" (the whole 32 bits), right? If I remember correctly, from what I saw yesterday with the help of printk, it seems we are setting at least one bit to "1" in this register, which would be the opposite of what was being done before and therefore of what was working correctly with my video card. I'm just trying to understand why we were doing something in the first place that was working correctly and that was introduce to fix this problem in the first place, both at boot time for grub (set gfxpayload=keep) and when suspending/resuming, and we are now doing the opposite, thus breaking the code for some setups. Is it possible that the bit/register logic is not the same for all Radeon GPUs? After all, we already introduced a different path for DCE6.
I'll also try your patch when I'll get home tonight.

(In reply to comment #14)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > Found what is wrong with the help of a few printk and by comparing to the
> > > code being replaced. All the logic is good (going through crtc, disabling
> > > them, waiting for vblank) BUT setting "tmp |=
> > > EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;" is wrong.
> > >
> > > If I do as in the previous code by setting tmp = 0 and then continuing with:
> > > radeon_wait_for_vblank(rdev, i);
> > > WREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i], tmp);
> > > everything works fine as before.
> > >
> > > What is expected from "tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;"?
> > > From what I read with printk, it is far from a 0 or a 1. Is this normal?
> >
> > That's the most important bit in the entire sequence. It's a bit field in a
> > register (bit 24 to be exact). That bit is the bit that actually disables
> > the requests from the display controller in the memory controller. The
> > whole point of this code is to disable all clients of the memory controller
> > (mc_stop()) so that we can change the location of vram within the GPU's
> > address space. Once we've moved vram, we can re-enable the clients
> > (mc_resume()) so that they point to the new vram location.
>
> Thank you, you confirmed what I had assumed. I had already understood that
> it was the most important part in the sequence since it is associated to a
> "write" instruction. I don't have the (decimal/binary) values with me right
> now, but I'll be able to give you what was read from the register and from
> the result returned from the bitwise OR assignment we are pushing in the
> register. I'll confirm which bit(s) are changing. I'm sure the assignment
> was setting one (or more) bit in the register to "1". Is bit 24 the only one
> expected to change in the register? Should it be put to "1" or "0"?
>
Setting bit 24 to 1 disables memory requests from the display controller. Setting bit 24 to 0, enables memory requests from the display controller.
> Another question: why were we setting "0" in the register before? By doing
> so, we were setting the whole register to "0" (the whole 32 bits), right? If
> I remember correctly, from what I saw yesterday with the help of printk, it
> seems we are setting at least one bit to "1" in this register, which would
> be the opposite of what was being done before and therefore of what was
> working correctly with my video card. I'm just trying to understand why we
> were doing something in the first place that was working correctly and that
> was introduce to fix this problem in the first place, both at boot time for
> grub (set gfxpayload=keep) and when suspending/resuming, and we are now
> doing the opposite, thus breaking the code for some setups. Is it possible
> that the bit/register logic is not the same for all Radeon GPUs? After all,
> we already introduced a different path for DCE6.
Bit 0 for CRTC_CONTROL turns on/off the entire crtc. If bit 0 is 0 the crtc is disabled. If bit 0 is 1, the crtc is enabled. If the crtc is disabled (bit 0 = 0) bit 24 is irrelevant. There are separate bits to enable the crtc and disable memory requests since there are cases (like this one) where we want to keep the crtc timing running so that the monitor stays synced, but disable reads from memory so we can reconfigure the memory controller. If we disable the crtc entirely, the monitor would lose sync and you would get additional flicker during boot up. Ideally, eventually we'd like to just hand over control from the firmware without touching the display hardware so the user gets a flicker free boot experience.
DCE4 and 5 have the same logic and bit layout for these registers. The logic is different on DCE6 chips. On DCE6, the the memory controller request bit is now tied in with the crtc blanking bit. When the crtc is blanked, memory requests are also disabled.

I'm about to test patches. But before, as promised, here are the values retrieved read and written to the registers. By the way, I have only a single monitor.
tmp = RREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i]); -> 272696081 (0001 0000 0100 0001 0000 0011 0001 0001)
tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE; -> 289473297 (0001 0001 0100 0001 0000 0011 0001 0001)
So, it is set as you explained me earlier. I'll be back soon with some news from the patches.

(In reply to comment #16)
> Created attachment 69573[details][review] [review]
> possible fix
>
> Actually, I think I found the problem. Missing index in mc_resume().
This seems to fix my resume problem I was experiencing where the display would come up, but then it would crash. However, it doesn't solve the boot/grub2 bug.
So, this patch should be kept (but not for the current bug).

(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #12)
> > > (In reply to comment #11)
> > > > Found what is wrong with the help of a few printk and by comparing to the
> > > > code being replaced. All the logic is good (going through crtc, disabling
> > > > them, waiting for vblank) BUT setting "tmp |=
> > > > EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;" is wrong.
> > > >
> > > > If I do as in the previous code by setting tmp = 0 and then continuing with:
> > > > radeon_wait_for_vblank(rdev, i);
> > > > WREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i], tmp);
> > > > everything works fine as before.
> > > >
> > > > What is expected from "tmp |= EVERGREEN_CRTC_DISP_READ_REQUEST_DISABLE;"?
> > > > From what I read with printk, it is far from a 0 or a 1. Is this normal?
> > >
> > > That's the most important bit in the entire sequence. It's a bit field in a
> > > register (bit 24 to be exact). That bit is the bit that actually disables
> > > the requests from the display controller in the memory controller. The
> > > whole point of this code is to disable all clients of the memory controller
> > > (mc_stop()) so that we can change the location of vram within the GPU's
> > > address space. Once we've moved vram, we can re-enable the clients
> > > (mc_resume()) so that they point to the new vram location.
> >
> > Thank you, you confirmed what I had assumed. I had already understood that
> > it was the most important part in the sequence since it is associated to a
> > "write" instruction. I don't have the (decimal/binary) values with me right
> > now, but I'll be able to give you what was read from the register and from
> > the result returned from the bitwise OR assignment we are pushing in the
> > register. I'll confirm which bit(s) are changing. I'm sure the assignment
> > was setting one (or more) bit in the register to "1". Is bit 24 the only one
> > expected to change in the register? Should it be put to "1" or "0"?
> >
>
> Setting bit 24 to 1 disables memory requests from the display controller.
> Setting bit 24 to 0, enables memory requests from the display controller.
>
> > Another question: why were we setting "0" in the register before? By doing
> > so, we were setting the whole register to "0" (the whole 32 bits), right? If
> > I remember correctly, from what I saw yesterday with the help of printk, it
> > seems we are setting at least one bit to "1" in this register, which would
> > be the opposite of what was being done before and therefore of what was
> > working correctly with my video card. I'm just trying to understand why we
> > were doing something in the first place that was working correctly and that
> > was introduce to fix this problem in the first place, both at boot time for
> > grub (set gfxpayload=keep) and when suspending/resuming, and we are now
> > doing the opposite, thus breaking the code for some setups. Is it possible
> > that the bit/register logic is not the same for all Radeon GPUs? After all,
> > we already introduced a different path for DCE6.
>
> Bit 0 for CRTC_CONTROL turns on/off the entire crtc. If bit 0 is 0 the crtc
> is disabled. If bit 0 is 1, the crtc is enabled. If the crtc is disabled
> (bit 0 = 0) bit 24 is irrelevant. There are separate bits to enable the
> crtc and disable memory requests since there are cases (like this one) where
> we want to keep the crtc timing running so that the monitor stays synced,
> but disable reads from memory so we can reconfigure the memory controller.
> If we disable the crtc entirely, the monitor would lose sync and you would
> get additional flicker during boot up. Ideally, eventually we'd like to
> just hand over control from the firmware without touching the display
> hardware so the user gets a flicker free boot experience.
>
> DCE4 and 5 have the same logic and bit layout for these registers. The
> logic is different on DCE6 chips. On DCE6, the the memory controller
> request bit is now tied in with the crtc blanking bit. When the crtc is
> blanked, memory requests are also disabled.
If I followed you correctly, setting bit 24 to "1" disables memory but keeps the CRTC state as it is (hopefully in sync with the monitor). However, what I see when grub2 kicks in with the "gfxpayload=keep" is an unsynced monitor. Sometimes the display will be black, other times it will only appear in the first couple of vertical lines, in others it will be vertically synced but shifted to the right at half or at two third of the screen. In other words, this really seems to be a sync problem. Should I try to combine patch 69113 and patch 69370 with the others?

(In reply to comment #19)
> If I followed you correctly, setting bit 24 to "1" disables memory but keeps
> the CRTC state as it is (hopefully in sync with the monitor). However, what
> I see when grub2 kicks in with the "gfxpayload=keep" is an unsynced monitor.
> Sometimes the display will be black, other times it will only appear in the
> first couple of vertical lines, in others it will be vertically synced but
> shifted to the right at half or at two third of the screen. In other words,
> this really seems to be a sync problem. Should I try to combine patch 69113
> and patch 69370 with the others?
Are you saying the monitor gets messed up when grub kicks in? I.e., before the OS loads? If so, that sounds like a grub issue.

(In reply to comment #21)
> (In reply to comment #19)
> > If I followed you correctly, setting bit 24 to "1" disables memory but keeps
> > the CRTC state as it is (hopefully in sync with the monitor). However, what
> > I see when grub2 kicks in with the "gfxpayload=keep" is an unsynced monitor.
> > Sometimes the display will be black, other times it will only appear in the
> > first couple of vertical lines, in others it will be vertically synced but
> > shifted to the right at half or at two third of the screen. In other words,
> > this really seems to be a sync problem. Should I try to combine patch 69113
> > and patch 69370 with the others?
>
> Are you saying the monitor gets messed up when grub kicks in? I.e., before
> the OS loads? If so, that sounds like a grub issue.
I'll take some pictures or record a small video of what's going on. That will be easier to understand. But yes, I'm beginning to think Grub2 is playing a bad trick somewhere in there and is not playing nicely when handling the framebuffer.

I'll retest it with the latest two patches, but I think setting gfxpayload=text prevents the bug. I know for sure that with the index patch, suspend/resume cycle is fixed. Also, if I remove gfxpayload=keep completely, my monitor works fine.

I've been playing with different options in Grub2 and the workaround for the boot sequence bug, beside reverting commit 62444b7462a2b98bc78d68736c03a7c4e66ba7e2, is to force gfxpayload=text.
I tried many things by changing gfxmode, forcing it to known good values, in combination with gfxpayload. But whatever I tried, if gfxpayload=keep, it was not synced correctly (as described, sometimes I had a black screen, some other times, it is vertically synced but shifted, etc.) As already explained, if I either revert commit 62444b7462a2b98bc78d68736c03a7c4e66ba7e2 or if I set tmp=0 before WREG32(EVERGREEN_CRTC_CONTROL + crtc_offsets[i], tmp), which is disabling the whole crtc, the boot process is fine.
Also, I tested kernel 3.7.0-rc5 (which contains attachment 69573[details][review]). Since it also contains modified code in evergreen_mc_stop() from commit 62444b7462a2b98bc78d68736c03a7c4e66ba7e2, it crashes on resume. In other words, attachment 69573[details][review] fixes a part of the suspend/resume problem, but doesn't the part coming from evergreen_mc_stop().
While I can live with working around the boot process/grub2 problem, the suspend part is really annoying.

Alex, a simple question: you said bit 0 in EVERGREEN_CRTC_CONTROL stops the CRTC sync. With the culprit commit, when is it set? I mean, I had a quick look in the driver's code and I couldn't find it. When going in suspend state, shouldn't it be set to 0? Then, on resume, shouldn't it be set to 1? I may just have missed it, but could this be something missing? Same questions goes for GPU soft reset.
Before the culprit commit, we were setting bit 0 to 0 on stop and setting it back to 1 on resume, which was working great. Why aren't we doing it anymore when suspending and resuming?

(In reply to comment #28)
> Alex, a simple question: you said bit 0 in EVERGREEN_CRTC_CONTROL stops the
> CRTC sync. With the culprit commit, when is it set? I mean, I had a quick
> look in the driver's code and I couldn't find it. When going in suspend
> state, shouldn't it be set to 0? Then, on resume, shouldn't it be set to 1?
> I may just have missed it, but could this be something missing? Same
> questions goes for GPU soft reset.
It's set by the vbios at boot before the OS loads. Once the driver loads, it's then handled by the modesetting code. crtcs are enabled or disabled based on what's connected and what displays the user has chosen to enable.
These functions have nothing to do with system suspend/hibernate and resume directly. evergreen_mc_stop() and evergreen_mc_resume() are for disabling MC clients when we make changes to the GPU memory controller. System suspend/hibernate are handled by radeon_suspend_kms() and radeon_resume_kms(). radeon_suspend_kms() explicitly disables all of the display hardware:
list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
drm_helper_connector_dpms(connector, DRM_MODE_DPMS_OFF);
}
And then the displays are enabled again in radeon_resume_kms():
/* blat the mode back in */
drm_helper_resume_force_mode(dev);
/* turn on display hw */
list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
drm_helper_connector_dpms(connector, DRM_MODE_DPMS_ON);
}
>
> Before the culprit commit, we were setting bit 0 to 0 on stop and setting it
> back to 1 on resume, which was working great. Why aren't we doing it anymore
> when suspending and resuming?
The previous code[1] didn't do that. It set bit 0 to 0 in evergreen_mc_stop() and then left it disabled in evergreen_mc_resume(). The crtc was not subsequently re-enabled until the displays were set up later by the modesetting code. Prior to that patch[1], we just saved and restored the value of the register which is basically what you are proposing, but that didn't work otherwise we wouldn't have needed the subsequent patches.
We'd like to avoid disabling the crtc during driver load or GPU reset to avoid the visible flicker it causes.
[1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=81ee8fb6b52ec69eeed37fe7943446af1dccecc5

(In reply to comment #29)
> (In reply to comment #28)
> > Alex, a simple question: you said bit 0 in EVERGREEN_CRTC_CONTROL stops the
> > CRTC sync. With the culprit commit, when is it set? I mean, I had a quick
> > look in the driver's code and I couldn't find it. When going in suspend
> > state, shouldn't it be set to 0? Then, on resume, shouldn't it be set to 1?
> > I may just have missed it, but could this be something missing? Same
> > questions goes for GPU soft reset.
>
> It's set by the vbios at boot before the OS loads. Once the driver loads,
> it's then handled by the modesetting code. crtcs are enabled or disabled
> based on what's connected and what displays the user has chosen to enable.
>
> These functions have nothing to do with system suspend/hibernate and resume
> directly. evergreen_mc_stop() and evergreen_mc_resume() are for disabling
> MC clients when we make changes to the GPU memory controller. System
> suspend/hibernate are handled by radeon_suspend_kms() and
> radeon_resume_kms(). radeon_suspend_kms() explicitly disables all of the
> display hardware:
>
> list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
> drm_helper_connector_dpms(connector, DRM_MODE_DPMS_OFF);
> }
>
> And then the displays are enabled again in radeon_resume_kms():
>
> /* blat the mode back in */
> drm_helper_resume_force_mode(dev);
> /* turn on display hw */
> list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
> drm_helper_connector_dpms(connector, DRM_MODE_DPMS_ON);
> }
>
> >
> > Before the culprit commit, we were setting bit 0 to 0 on stop and setting it
> > back to 1 on resume, which was working great. Why aren't we doing it anymore
> > when suspending and resuming?
>
> The previous code[1] didn't do that. It set bit 0 to 0 in
> evergreen_mc_stop() and then left it disabled in evergreen_mc_resume(). The
> crtc was not subsequently re-enabled until the displays were set up later by
> the modesetting code. Prior to that patch[1], we just saved and restored
> the value of the register which is basically what you are proposing, but
> that didn't work otherwise we wouldn't have needed the subsequent patches.
>
> We'd like to avoid disabling the crtc during driver load or GPU reset to
> avoid the visible flicker it causes.
>
> [1]:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;
> h=81ee8fb6b52ec69eeed37fe7943446af1dccecc5
Your explanations are appreciated, but that kills another idea. By the way, pictures and videos are coming, my camera had no power yesterday and I had to charge it during last night.

Here are the videos. Sorry, they were recorded with my iPod and they are inverted.
Normal boot sequence before the culprit commit (kernel 3.6)
http://www.filedropper.com/k36
A normal boot sequence with kernel 3.6 from grub 2 to booting console. Resolution is kept from grub to console, there is no sync problem and no purple flicker.
A problematic boot with kernel 3.7-rc5 using gfxpayload=keep in Grub 2.
http://www.filedropper.com/k37keep
It starts normally, the booting console uses the same resolution as Grub 2. However, possibly when radeon driver kicks in, there is a purple flicker followed by a wrong (badly synced) display (shifted by about half a screen to the right). The result could have been a black screen, a couple of interlaced vertical lines or a different shifted display.
A non problematic boot with kernel 3.7-rc5 using gfxpayload=text in Grub 2.
http://www.filedropper.com/k37text
It doesn't show the purple flicker we see in the previous video, it just flickers as a normal boot would do, but it does show the purple color from time to time. However, the display is OK.
A problematic suspend/resume sequence with 3.7-rc5 (gfxpayload doesn't change anything)
http://www.filedropper.com/k37suspend
Going in suspend mode just after booting up. When resuming, X/gdm briefly appears, then it crashes. The computer continue to run, but the display stays black and no input can be used (tried to switch console and reboot but it doesn't work). Often, it will reboot by itself after some time. Suspend and resume work great before the culprit commit or if I remove the changes in "...mc_stop()"

Created attachment 70223[details]
Partial kernel.log
Here is a partial kernel.log tracking what is going on at boot with some debug printk in evergreen_mc_stop() and evergreen_mc_resume(). Strangely, only one crtc is saved as enabled in the _stop function, but four are restored in the _resume function. Pretty sure it's not the problem we are looking for, but it is abnormal since only one crtc should be restored on resume. This is a bug by itself showing a problem in the save(stop)/restore(resume) mechanism.

(In reply to comment #32)
> Created attachment 70223[details]
> Partial kernel.log
>
> Here is a partial kernel.log tracking what is going on at boot with some
> debug printk in evergreen_mc_stop() and evergreen_mc_resume(). Strangely,
> only one crtc is saved as enabled in the _stop function, but four are
> restored in the _resume function. Pretty sure it's not the problem we are
> looking for, but it is abnormal since only one crtc should be restored on
> resume. This is a bug by itself showing a problem in the
> save(stop)/restore(resume) mechanism.
I should have said "... but six are restored..." instead of "... but four are restored..." I also made a small error in a debug string, but it doesn't change the read value. If you have any question on my patch, let me know.

(In reply to comment #33)
> > Here is a partial kernel.log tracking what is going on at boot with some
> > debug printk in evergreen_mc_stop() and evergreen_mc_resume(). Strangely,
> > only one crtc is saved as enabled in the _stop function, but four are
> > restored in the _resume function. Pretty sure it's not the problem we are
> > looking for, but it is abnormal since only one crtc should be restored on
> > resume. This is a bug by itself showing a problem in the
> > save(stop)/restore(resume) mechanism.
>
> I should have said "... but six are restored..." instead of "... but four
> are restored..." I also made a small error in a debug string, but it doesn't
> change the read value. If you have any question on my patch, let me know.
That issue is fixed in attachment 69573[details][review] which is already upstream.

(In reply to comment #36)
> (In reply to comment #33)
> > > Here is a partial kernel.log tracking what is going on at boot with some
> > > debug printk in evergreen_mc_stop() and evergreen_mc_resume(). Strangely,
> > > only one crtc is saved as enabled in the _stop function, but four are
> > > restored in the _resume function. Pretty sure it's not the problem we are
> > > looking for, but it is abnormal since only one crtc should be restored on
> > > resume. This is a bug by itself showing a problem in the
> > > save(stop)/restore(resume) mechanism.
> >
> > I should have said "... but six are restored..." instead of "... but four
> > are restored..." I also made a small error in a debug string, but it doesn't
> > change the read value. If you have any question on my patch, let me know.
>
> That issue is fixed in attachment 69573[details][review] [review] which is already
> upstream.
I know it should have been fixed and that's why I added a printk(KERN_INFO "Restoring crtc[%d] since it was enabled.\n", i) just after it to check it. For some reason it seems to not be working properly. Have a look at my patch (attachment 70233[details]) which was written on a 3.7.0-rc6 kernel. I'll add some other printk when I'll get home (like a printk just before this check to output its value).

In evergreen_reg.h, /* CRTC blocks at 0x6df0, 0x79f0, 0x105f0, 0x111f0, 0x11df0, 0x129f0 */ is not used to define the registers following this comment it. It seems to correspond to display controller offsets.
Is the comment at the wrong place or are the registers not defined with the right addresses?

(In reply to comment #40)
> In evergreen_reg.h, /* CRTC blocks at 0x6df0, 0x79f0, 0x105f0, 0x111f0,
> 0x11df0, 0x129f0 */ is not used to define the registers following this
> comment it. It seems to correspond to display controller offsets.
>
> Is the comment at the wrong place or are the registers not defined with the
> right addresses?
They are correct. The crtc blocks are repeated at different offsets within the register aperture. So if you wanted to access EVERGREEN_CRTC_CONTROL for crtc 0, you access: EVERGREEN_CRTC_CONTROL + EVERGREEN_CRTC0_REGISTER_OFFSET, for crtc 1:
EVERGREEN_CRTC_CONTROL + EVERGREEN_CRTC1_REGISTER_OFFSET, etc.

Going fishing: Alex, you said you were unable to reproduce this bug with your cayman device. Could it be related to the display connector? I'm using the DVI connector from my card and it is connected to a VGA connector on my monitor through a DVI -> VGA adaptor. I may be able to test with a different monitor/connector if you think this could be related.

(In reply to comment #42)
> Going fishing: Alex, you said you were unable to reproduce this bug with
> your cayman device. Could it be related to the display connector? I'm using
> the DVI connector from my card and it is connected to a VGA connector on my
> monitor through a DVI -> VGA adaptor. I may be able to test with a different
> monitor/connector if you think this could be related.
You can try , but I doubt it will make a difference. It seem to be related to the display controllers and access to memory when we make changes to the MC, so the connector type shouldn't matter.

(In reply to comment #49)
> What if the problem is not from this code, but underneath? I'll try to
> suspend and resume without having Xorg running. Would that help in any way?
If the commit you bisected is actually the culprit, this is an issue with the memory controller and the displays on the GPU. When we reprogram the memory controller we need to stop all the GPU memory clients. We used to disable the display controllers but this caused additional flicker on boot up and caused hangs on some cards. So we switched to the current method which was recommended by the hardware team. This avoids the flicker by just stopping the MC interface in the displays but leaving them enabled and also fixes hangs related to this on a number of chips. Unfortunately, it seems to cause other problems in certain causes.

(In reply to comment #50)
> (In reply to comment #49)
> > What if the problem is not from this code, but underneath? I'll try to
> > suspend and resume without having Xorg running. Would that help in any way?
>
> If the commit you bisected is actually the culprit, this is an issue with
> the memory controller and the displays on the GPU. When we reprogram the
> memory controller we need to stop all the GPU memory clients. We used to
> disable the display controllers but this caused additional flicker on boot
> up and caused hangs on some cards. So we switched to the current method
> which was recommended by the hardware team. This avoids the flicker by just
> stopping the MC interface in the displays but leaving them enabled and also
> fixes hangs related to this on a number of chips. Unfortunately, it seems
> to cause other problems in certain causes.
Since I had just installed kernel 3.9.0-rc2, I tried a suspend and resume cycle. As expected, it end up frozen (I saw some text from the console, Xorg/Gnome tried to display something, couldn't, reset, couldn't, reset and froze). I then rebooted and limited the runlevel to 2 (no Xorg in the way). From the command line, I launched a pm-suspend. Once I woke up my computer, everything was responsive as expected. Does it help in anyway?

Also, nice to know: a suspend/resume cycle under Gnome-Shell or KDE will end up hung. However, a suspend/resume cycle under XFCE will work correctly.
In other words, the problem seems not to be with the memory controller, but with something that as to do with how the OpenGL/3D/else state is being restored.

(In reply to comment #52)
> Also, nice to know: a suspend/resume cycle under Gnome-Shell or KDE will end
> up hung. However, a suspend/resume cycle under XFCE will work correctly.
>
> In other words, the problem seems not to be with the memory controller, but
> with something that as to do with how the OpenGL/3D/else state is being
> restored.
Both use the 3D engine. The 3D driver uses VM however, so it's probably a duplicate of bug 60439.
*** This bug has been marked as a duplicate of bug 60439 ***