This issue has been a personal bugbear for some time. What started out to be a seemingly simple patch-monkey task had turned into a tedious, labour-intensive slog – encompassing hidden pitfalls and unexpected, non-standard behaviors exhibited by reasonably well known OEMs (Original Equipment Manufacturers). When I initially found, reported and assigned myself to this bug, I competed a quick Google search and discovered a set of patches [1] which appeared to be pertaining to the same issue I was facing. I quickly applied the patches and re-built the affected kernel for test. The behavior was the same with the patches applied.

LP Bug #592295
omapdss DISPC error: SYNC_LOST_DIGIT

Normally, when a SYNC_LOST_DIGIT IRQ fires only one occurs. The driver code usually cleans up the odd sporadic one or two as they are almost expected from time to time. This bug however, cripples the system, rendering it quite useless. So many IRQs are fired that the console is completely over-whelmed. This particular bug only exhibits this behavior when an HDMI device is plugged in to the port, but power is not applied to it. A SYNC_LOST_DIGIT IRQ can be received for a plethora of reasons. Some of which I’ll mention:

1. If a device has more than one graphical display port, and the HDMI and primary (usually LCD) ports are enabled at the same time, then the HDMI sub-system would display SYNC_LOST_DIGIT. This is fairly irrelevant in this case as the DVI-D port on the Panda Board (the development device where this issue is prevalent) is neither enabled, nor connected.

2. A SYNC_LOST_DIGIT can sometimes display if a downscaled video has been recently viewed on the primary device and the produced overlay is switched to the HDMI monitor. This occurs due to the fact that HDMI does not currently contain downscale support. This issue is known to be rectified by issuing the correct out_width and out_height using sysfs. Again, this is neither the reason for my error, nor does the fix correct the symptoms.

3. The IVAHD (Imaging Video Audio – High Definition) is Texas Instruments’ (TI) own multimedia co-processor DSP (Digital Signal Processor) used to encode/decode High Definition video formats. A SYNC_LOST_DIGIT is normally raised if the first IVAHD frame takes longer than 1ms to decode. If the first frame takes more than 1000ms to appear a dsi_framdone is more likely. This is not believed to be the issue here. Worryingly there is some hacky code in the current kernel which ‘solves’ this issue by commenting out ~30 lines of code. I’ll take a look to resolve this sometime in the near future.

As I’m not a subject matter expert on HDMI, I decided to contact TI’s graphics department. A friendly, helpful girl from India called Mythri replied. I sent her the contents of my start-up logs and she immediately spotted some anomalies. For instance:

Apparently, the omap driver code requires a chunk of memory in the order of 32MB to be reserved at boot-up. Each buffer, of which there can be a maximum of 3, also need to be given a slice of the chunk [2]. The line below, although unrelated to this bug, placed on the kernel command-line eradicated these warnings.

vram=32M, omapfb.vram=0:8M

In its current state HDMI is enabled directly with no interaction with external hardware. This is considered a bug and may be the cause of the spurious SYNC_LOST_DIGIT IRQ fires. Instead, TI provided a patch to change ‘hdmi_panel_enable’ to ‘hdmi_enable_hpd’ (hpd meaning hot-plug-device). HDMI will now only be enabled when a monitor/TV is detected through the physical connect interrupt. At which point the driver will attempt to read the EDID (Extended Display Identification Data) [3] from the monitor/TV and enable it. If no device is detected HDMI will be started in min_mode complete with minimal clocks.

- .enable = hdmi_panel_enable,
+ .enable = hdmi_enable_hpd,

This patch cleared the spurious SYNC_LOST_DIGIT errors, but still nothing was displayed on the connected monitor. I assumed this was a related bug and together with Mythri’s help perused a working fix. LP Bug #592295 was no longer relevant, so I opened another.

LP Bug #605832
LG monitor behaving incorrectly when used in conjunction with the Panda board and HDMI

Mythri suggested adding some more arguments to the kernel command-line. The first argument below dictates which timing code the driver should use. The second specifies whether it should use DVI or HDMI code (zero and one respectively). You’ll notice that hdmicode has been set to 4. This tells the driver to use a very low resolution of 640×480. This should display at least something on any monitor. A table of the remaining timing codes has been provided for reference [4].

omapdss.hdmicode=4 omapdss.hdmimode=1

These new kernel command-line arguments provided a very shaky arcane image, along with the warning message “Out of range” displayed by the monitor. As the monitor is rated at 1080p (AKA: 1920×1080 or Full HD), 640×480 should not be a problem and certainly should not be out of range. The issue is believed to be caused by underlying timing issues. Once again I sent Mythri the log to try and debug exactly what was happening.

In the meantime I decided to have a poke around within sysfs to find any related nodes which may prove useful. When the following command was issued the monitor ‘just worked’. It wasn’t a great resolution, but the “Out of range” error disappeared and the font was crystal in clarity.

Once Mythri was briefed of the findings, she knew almost instantly what was going on. The display was not camping because the vsync and hsync timing values of the monitor do not sync with any HDMI known values, thus the LG W2261VP’s EDID contains non-standard settings. A vsync and hsync check within the driver is necessary to distinguish between 50Hz and 60Hz devices. The fix: instead of reading all EDID blocks and checking each of the horizontal (hsw, hfp, hbp) and vertical (vsw, vfp, vbp) values individually, a checksum will be devised using all of them encompassed. This is the current (borked) implementation:

The reason the monitor refused to display on boot-up and was happy to display correctly using sysfs is that sysfs ignores all EDID values and forcefully displays whichever values are passed in. Hence, on boot-up it was using very weird arcane fall-back values due to the fact that the EDID was unrecognised – resulting in a “EDID TIMING DATA supported NOT FOUND” boot-up message. This was the solution:

Now if the monitor is on during start-up a suitable image is displayed. However, if the monitor is in a Power Saving Mode (PSW) the monitor is taken out of sleep during boot-up, then nods off again once the boot-up sequence has finished. After a few seconds it would wake up then go to sleep again. If left, this behavior would persist indefinitely. A few settings were edited, but its behaviour did not differ. In order to send Mythri a full log which she could use to debug this issue, HDMI was placed into debug mode using the kernel command-line arguments below.

debug omapdss.debug=1

Oddly, when in debug mode the problem disappeared. It was confirmed that this behaviour was reproducible by turning debug mode on and off over subsequent boots. Whenever debug mode was activated the problem would not occur. The same was true in the converse. We decided this must be a timing delay feature. After some investigation and experimentation with her own monitors Mythri was able to isolate the problem and insert a delay in the correct place.

I’ve been working on the bug below on and off for ~4 weeks now. As it’s taken so long, I felt obliged to write a few words detailing the processes and methods used to debug this most horrific bug. The problem occurs when a laptop with an internal SD card reader is put to sleep (suspend or hibernate) when a card is inserted and mounted. Yesterday I discovered that this bug only affects cards which support an ext2 or ext4 filesystem (vfat and ext3 filesystems do not succumb).

This bug and others relating to it have been reported to Ubuntu over 130 times and ~20 bugs are related. A couple of those have an importance rating of ‘High’. Annoyingly, while I was working on this bug someone marked it as a duplicate of LP Bug #477106, so I hopped over and used that as the reference bug instead. It is there you can see the true range of this bug and where all other related bugs point. A list of related bugs can be found at the bottom of this page.

LP Bug #477106
[regression] lucid alpha-2 and earlier freeze upon suspend with sd card plugged
in with some hardware

The first debugging phase was to decipher which device was causing the issue. As the most common factor, I decided the best course of action was to hunt down the offending SD/MMC card reader driver and disable it. An ‘lsmod’ informed me that the sdhci and sdhci_pci drivers were in current operation. I ‘rmmod’ed them and attempted to sleep – good night ThinkPad. I had a dig around in the Power Management Utilities (pm-utils) and found the variable SUSPEND_MODULES which is sourced from the pm configs. I posted this on the bug report which seemed to please most:

Naturally, we’d like this bug to be eradicated forever and for Ubuntu to work seamlessly ‘out-of-the-box’. So I endeavored to find a more permanent solution within the kernel. After turning debug on within the sdhci driver and adding a few of my own DBGs, it was apparent that the system was coming to a screaming halt within the interrupt handler. My initial theory was that the SD card was being accessed following a full filesystem sync and free, thus causing a kernel oops or panic before the system could be fully suspended. I placed in some extra debug code to try and follow the thread of execution, but further attempts were seemingly futile.

(Thanks to cnd) I enlisted the help of ftrace to aid further kernel prodding, by enabling function_graph (Kernel hacking -> Tracers -> Kernel function tracer & Kernel function graph tracer). Once enabled it can be controlled from within sysfs. To enable and disable, echo ‘function_graph’ and ‘nop’ to /sys/kernel/debug/tracing/current_tracer respectively. Once enabled simply cat /sys/kernel/debug/tracing/trace_pipe, either to stdout or a file. This produces output similar to this:

The function tracer almost always ended in a spinlock, which lead me to my next theory, spinlock corruption. I enabled every configuration related to spinlocks and other locking mechanisms within the kernel. These included; Detect soft lockups, Detect hung tasks, Spinlock and rw-lock debugging, Mutex debugging, Lock debugging: detect incorrect freeing of live locks/prove locking and correctness and finally, Spinlock debugging. None of this helped.

Next I tried enabling the system console. Oddly, GDM decided it wanted to jump on the system console instead of it’s own tty. This made things a little more tricky. I killed it and tried to suspend manually using pm-suspend. A message informing me of the suspend would appear, then the console would go to sleep, rendering me blind as to what was happening. (thanks to cking for suggesting) CONFIG_PM_DISABLE_CONSOLE was the solution to this. It prevents the console from being allowed to sleep and subsequently turning off. After enabling this feature, it was plain to see, well… nothing! Even though I was on the system console, no kernel messages were produced. I knew they were there, as many of them had been saved in /var/log/kern.log and /var/log/messages, but they weren’t being displayed. I believe this has something to do with gdm and the differences in suspend regimes. I tried displaying pm-suspend using set +x. The script was displayed and nothing else.

Next on the agenda was to enable system console via serial. As I don’t own a ThinkPad dock, I had to use a serial->USB converter. Once up and running it was clear that the thread of execution was going a great deal further than the sdhci interrupt handler. As it turns out, the debug messages were finishing there because the final sync to disk was already performed. The drives (and therefore the logging) were no longer available. It is worth mentioning that suspend/resume logs are written on the resume, hence if suspend fails, the logs are never written – as with this case.

Now we have serial working, I figured it might be worth giving kgdb a shot. I enabled it in Kernel hacking and tried to configure it over serial. The arguments kgdbwait and kgdboc=ttyUSB0,115200 were placed onto the kernel commandline, but kgdb just spat out “kgdb: Unregistered I/O driver kgdbts, debugger disabled.” I then tried to place them in manually via sysfs, giving: “echo ttyUSB0 | sudo tee -a /sys/module/kgdboc/parameters/kgdboc”, only to receive “No such device”. Supplementing ttyUSB0 for ttyS0 froze my computer solid and presumably passed all control to the serial port (which I don’t have the expansion/dock for). Result, kgdb only supports true serial devices.

I found that it was possible to write the ftrace results out over serial, however, all messages stopped when the USB->Serial and USB went to sleep. I tried to keep them awake by enabling CONFIG_USB_SUSPEND_OFF and hacking early returns into usb_serial_suspend and uart_suspend_port, but it didn’t seem to make any difference.

More debugging options were then added over successive builds, including; PM verbose, ACPI debug, ThinkPad ACPI debug, and Kobject debugging. The latter was a bad idea, as it swamped the console and logs making them unreadable. No further useful data was produced by enabling them.

Finally I decided to put an early return to the sdhci suspend routine, just in case the thread of execution was reaching it – this worked, but resume was not possible, therefore neither was access to the logs. An early return in the resume routine allowed a successful resume of the system, providing me with access to log functionality. Now I had printk I would follow the execution path with an advancing return statement, placing breadcrumbs (printks) behind it. This is where debugging became _slow_ but steady.

Due to the fact that sdhci functionality is plugged in as a module, debugging pointers to functions was a little tricky. In the end I compiled all MMC/SD code into the kernel as built-ins. This allowed me to decode them by looking them up in the System.map. It then seemed as though the MMC remove routine was being called recursively. I placed in a static variable and an “if (count++ > 10) return;” to unwind any accidental recursion.

I followed the thread of execution though the following path:

(Remember: I discovered this by advancing a return statement and commenting out functions as I went. This kind of debugging is very _slow_ progress.)

I eventually found myself in an inline function called root_tag_set, which contained the following line of code.

root->gfp_mask |= (__force gfp_t)(1 << (tag + __GFP_BITS_SHIFT));

I tested this line to oblivion, trying to find out why it was failing. It turned out that it wasn’t failing at all, but it was making a difference. When this line was commented out suspend/resume would work perfectly, with the code left in, the system would come to a crashing halt. Another piece of code must be using the tag bit later down the line. The tag bit is used to mark memory pages which are dirty, or under write back [1]. Result, false alarm.

So I backed up to sync_filesystem and followed through another failing route I found. This went as follows:

I’m sure you can imagine the kind of debug vomit you’d expect from placing printks in here. Result, another dead end.

Unsure if it was happening during SD unload or something was referencing it after in the common code areas, I decided to insert debug in the tail-end of sdhci_suspend_host. But what? printks were no good. I had to run the code with the error residing, in which case I wouldn’t receive any output unless there was a successful resume. I ended up stealing a small for-loop (from cking) which flashed the CapsLk LED situated on the monitor several times. Result, the error was occurring somewhere during the SD unload.

I found another piece of information today which may prove to be useful:

- This problem only occurs when _both_ the HDD and SD drives are mounted. If either one of them are unmounted, the problem does not surface. I uncovered this by booting a live CD and suspending. Under normal live CD operation the system suspends without issue. When I manually mount the HDD, then try to suspend the computer crumbles.

From here I had two possible routes to follow up on:

- Working backwards, up the thread of execution from the tail end of sdhci_suspend_host (LED flashing).

- Change the suspend/resume sequence, so the sdhci suspends _before_ some of the useful debugging tools.

But before I had chance to act on them I met an Israeli gentleman named Maxim Levitsky. This wasn’t quite a chance meeting, I actually emailed him because it looked like he was toying with the same symptoms as me. It turned out has he is a bit of a subject matter expert in suspend/resume bugs within the Linux kernel (although, he denies it). He has written a patch which allow system logs to be printed into an area of RAM specified on the kernel commandline. Whenever a deadlock or hang occurs the system reboots automatically allowing power to be applied throughout and RAM to be persistent over said reboot. When the device reboots, the specified area of RAM can be read via sysfs and piped though strings for clarity. This one’s going in lag’s toolbox for sure.

So, what had been alluding us for all this time? The answer lies in the MMC subsystem. During a suspend operation, multiple areas of the kernel are being bedded down simultaneously. When the MMC subsystem is removed, a del_gendisk is attempted which eventually calls prepare_to_wait, but because the kernel and userspace threads have been frozen due to the suspend they will never respond. Result, OS fail!

To rectify situations such as these there is a feature called pm_notify. Once registered, pm_notify routines will be called _prior_ to any imminent sleep action (suspend or hibernate). It is here we must remove the SD card and sync its filesystem. This will ensure that the other kernel and user threads will respond when waited on.

For anyone interested the working patch written by Maxim Levitsky can be found here: