Bug Description

After upgrading to lucid, I find that xorg only starts successfully about 50% of the time. I am using nvidia-current. (Tried switching to nv and nouveau, but neither one worked properly.) I am attaching two xorg log files, one from a successful startup and one from an unsuccessful one.

We're now at 2.6.32-21, have you tried this again ? I am running the same driver -using a C51 though- on 64-bit and can' t reproduce your problem..
When/if it fails again, it woul dbe important to have complete information, by running:
apport-collect 561049

Hi, TJ and Fabián Rodríguez -- Thanks for your replies. Yes, I am still having the problem. I just did an apport-collect (see above), and am attaching kern.log. This was with kernel 2.6.32-19, however. I will update to 2.6.32-21 and see if I can reproduce it again.

After an apt-get update and an apt-get upgrade, I'm still experiencing intermittent failures to boot properly. Out of 9 boots, I had one failure, which was different than what I was experiencing before. I get a black screen with nothing in it but a blinking _ cursor. If I do a ctl-alt-f7, I see this message: "fsck from util-linux-ng 2.17.2 ..." I don't know if this is the same bug or a different one. I've seen reports that there is a problem with Plymouth that occurs when there is an fsck. This might be that problem. I'm unable to collect apprt info or kern.log, because I don't get a login prompt. I'll try some more and see if I can get the same behavior I saw before.

Okay, I got the original bug to happen again, with my freshly updated system. The above apport info is from this occurrence of the bug, and I'm attaching the kern.log file. Since this is occurring with a freshly updated system, I'm going to update the status of the bug from incomplete.

So there is a missing device node. This could be because the nvidia
(nvidia-current) kernel module failed to load successfully or some
problem with X driver.

The documentation says (/usr/share/doc/nvidia-current/README.txt.gz):

Q. How and when are the the NVIDIA device files created?

A. Depending on the target system's configuration, the NVIDIA device files
used to be created in one of three different ways:

o at installation time, using mknod

o at module load time, via devfs (Linux device file system)

o at module load time, via hotplug/udev

With current NVIDIA driver releases, device files are created or modified
by the X driver when the X server is started.

By default, the NVIDIA driver will attempt to create device files with the
following attributes:

UID: 0 - 'root'
GID: 0 - 'root'
Mode: 0666 - 'rw-rw-rw-'

Existing device files are changed if their attributes don't match these
defaults. If you want the NVIDIA driver to create the device files with
different attributes, you can specify them with the "NVreg_DeviceFileUID"
(user), "NVreg_DeviceFileGID" (group) and "NVreg_DeviceFileMode" NVIDIA
Linux kernel module parameters.

For example, the NVIDIA driver can be instructed to create device files
with UID=0 (root), GID=44 (video) and Mode=0660 by passing the following
module parameters to the NVIDIA Linux kernel module:

How do I get into recovery mode? Googling tells me to hit escape when prompted to do so by GRUB. I don't get any messages from GRUB displayed on my screen. I get BIOS messages, then the screen stays black for a long time, and then I get the purple gdm screen.

On Fri, 2010-04-23 at 21:24 +0000, bcrowell wrote:
> How do I get into recovery mode? Googling tells me to hit escape when
> prompted to do so by GRUB. I don't get any messages from GRUB displayed
> on my screen. I get BIOS messages, then the screen stays black for a
> long time, and then I get the purple gdm screen.

Hold down the Shift key when BIOS loads GRUB. GRUB will detect the shift
key and display the menu.

When I boot into recovery mode, the nvidia module is already loaded, but the device nodes don't exist. I then unload and reload the nvidia module. Nothing special happens, except for the warning "WARNING: Could not find old name in /lib/modules/2.6.32-19-generic/updates/dkms/nvidia-current.ko to replace!" Then I start x. After I start x, the device nodes have been created. I used the attached shell script to automate the process. Out of 36 boots, the results were:

31 successful starts of X
2 cases where I never got a shell prompt
2 cases where X failed to start, but I wasn't set up to do the logging
1 case where I think I got the necessary information. Not completely sure about this one, however, because it was one of my first attempts. I was still messing around with the perl script, and I wasn't paying close enough attention to what was happening.

In all cases, I was doing warm reboots.

I will post a log from a case where X worked, and the one where it failed.

bcrowell: You must be using a apt mirror that has not been updated in quite some time, please change it to one of the primary mirrors (like http://archive.ubuntu.com/ubuntu) and update with sudo apt-get update and then sudo apt-get dist-upgrade. Do you have /usr mounted on a seperate partition? This sounds like a bug that has been fixed quite some time ago where the blacklist wouldn't get loaded if the seperate partition it was on was being fscked and you just have not received the updates yet because of the mirror problem.

Thanks, Robert Hooker, for the suggestion in #52. But could you explain what makes you think that I have a stale version? My sources.list file points to us.archive.ubuntu.com/ubuntu, which is identical to what you suggested except for the "us." on the front. Is there something specific in the log files I've posted that convinces you conclusively that my system is in a stale state, or are you just guessing based on the symptoms of the problem? I updated from us.archive.ubuntu.com/ubuntu this morning, before my most recent reports.

TJ wrote:
"Well it gets more confusing! Both logs show the device nodes created. Are you *sure* the failed_log isn't accidentally a successful run?"

No, I'm not sure. That's what I was saying in #49. After spending several hours doing reboot cycles, I got a a few failures to boot properly, but most of them were cases where I didn't even get a login prompt after my system was done booting. As I said in #49, the case I posted was one of the ones from early on, where I wasn't paying proper attention and wasn't sure I noticed correctly what was going on. That was the only one where it failed to start X, but also gave me a login prompt.

Recently when I've rebooted normally (not in recovery mode), the failure to start x had been happening fairly often. Here are some statistics from a few days ago:
34 successful boots
2 boots that didn't result in a login prompt
6 boots that gave a login prompt but didn't start X

I see several possibilities here:
(a) There are two separate bugs, one of which causes X to fail, and one which causes me not even to get a login prompt.
(b) Same as a, but the X bug doesn't occur in recovery mode.
(c) There is only one bug, and I was just "unlucky" in not being able to reproduce it today after several hours of rebooting.

It takes less time for me to simply reboot that it does to go through the whole procedure involving recovery mode and logging. I'm going to try doing 30 cycles the quick way, and see if the X problem is still occurring or not.

Okay, I just did two reboots the normal way (not using recovery mode). The statistics were as follows:
1 boot that didn't result in a login prompt
1 boot that gave a login prompt but didn't start X

So to summarize what I've found so far:
(1) The bug is still present in a system freshly updated this morning from us.archive.ubuntu.com.
(2) The bug does not seem to occur when starting up in recovery mode. (In the session described in #39, I saw it either 0 or 1 time out of 36 attempts in recovery mode. In the session described in this post, I saw it on 1 out of 2 attempts when not using recovery mode.

I'm going to switch the status back from Incomplete to In progress, since I think Robert Hooker may have been making an incorrect inference in #54. Robert Hooker, maybe you could clarify: do you believe that us.archive.ubuntu.com is out of date compared to archive.ubuntu.com? This would surprise me (a lot), but if it is the case, i can certainly point my sources.list to archive.ubuntu.com and do another update.

As a test to see whether my apt server might be out of date, I tried deleting the "us." from all the URLs in my sources.list. After an apt-get update, an apt-get upgrade resulted in "0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded." So it looks to me like there was no problem with my apt server being stale.

and saying you were not receiving a kernel upgrade when the current current was 2.6.32-21. 2.6.32-20 was published 10 days before you said that so there is something wrong elsewhere if you are not receiving updates and an out of date mirror was the most likely culprit.

However your comment just now sounds like there is another issue since you are getting the "9 not upgraded" part. Can you please do a sudo apt-get dist-upgrade (not just sudo apt-get upgrade) and see what's offered? Do you have the package linux-generic installed?

Aha! An apt-get dist-upgrade brought me to kernel 2.6.32-21. Yeesh, I'm a little grumpy that it was so hard to find out about this. Maybe this was in a README that I didn't read? I had no idea that a dist-upgrade would do anything more than an upgrade, once you were already on the current version of the OS (letter of the alphabet in the ubuntu naming scheme).

I just did 30 reboots with no problems, so it looks like I'm all set now. Thanks, Robert Hooker, for the help!

Xorg.0.log is in one system boot while Xorg.1.log and Xorg.2.log are in the previous system boot.

Also attached is the messages file, which the nvidia driver suggests reading for an error message, but I can't find one that doesn't show up under a successful boot up, as well.

I did find in comparing the dmesg and dmesg.0 (the current and previous boot, respectively) that when the failure occurs, it seems the nForce2 driver is failing ("nForce2_smbus 0000:00:03.2: Error probing SMB2.") in dmesg.0 while in dmesg no such line can be found. Do you think it could be some sort of race condition on communicating with the nForce2 chipset between the nForce2 driver and nvidia's proprietary graphics driver? (It is a black box, it could be touching who knows what.)

Like ISV_Damocles, I'm finding that the problem wasn't actually fixed at the end of April. I've experienced it as recently as May 7, although it does seem much less frequent than it was in April. I'm going to change the status of the bug bag from Fix Released to In Progress.