VM Lock-Up after 300 seconds host uptime => Fixed in 2.0.4

I'm having a weird problem with VirtualBox 2. I'm trying to use it to
break my one server into two parts (the native server and a virtual
machine) and split services across the two. I'm using bridged HIF
networking, so the setup should look like two different machines.

The problem I'm seeing is that the virtual machine works fine for a
while (2-3 minutes), and then it becomes unresponsive. The VM's CPU
usage goes way down and the VM doesn't reliably talk to the network or
through VRDP. In the end I have to do a "VBoxManage controlvm testvbox1
poweroff" to get rid of the VM.

The weird thing is the problem only occurs the first time I run the VM
after I reboot the physical machine. After I see the problem and stop
the VM, I can restart it and it works fine from then on -- until I reboot the host again.

When I see the problem, the CPU usage drops from 0.3% to 2% cpu load for the idle VM to not showing up in "top" most refreshes. So, the VM is definitely doing less.

The problem happens whenever I run the VM the first time after boot. Whether I run it as part of the boot scripts or wait a few minutes and run it by hand makes no difference. The second run always works right.

The problem happens whether I run the headless frontend or the standard GUI frontend.

This is weird: During the 2-3 minutes of "good" time before the VM goes all wonky, I ssh (with X forwarding) from my laptop into the VM and run a xterm so I can type commands to the VM and see what's going on. When the VM start misbehaving, I can still type commands to the VM, but the output stalls. For example, I type "ps ax" and hit enter it will print a few lines of result and then stop. If I then move the mouse into and out of the xterm, it will print a few more lines. It seems like the focus events that the laptop's X server sends to the VM cause it to wake up and do work. It's like the VM is dropping interrupts or something.

A VRDP connection works fine for the "good" 2-3 minutes, but completely locks up when the VM goes wonky.

Network connections to the host work fine through all of this.

This happens with VirtualBox 2.0.0 and 2.0.2

Thanks for reading all this. I'd be grateful for any help you guys could offer.

Change History

I've never enabled USB on the guest, so maybe this is irrelevant, but the bug appears whether or not I do the mountdevsubfs.sh thing mentioned in the FAQ.

The problem doesn't seem to be related to HIF networking. I see the same problem if I temporarily switch the guest to use NAT. I didn't change any of the networking stuff on the host, though. So, I haven't ruled out the process of bringing up the vbox0 interface as being a factor yet.

Recently, I've been playing with two VMs. I start both VMs at the same time from my boot scripts, they work fine for a few minutes, and then they both stop working at the same time. But, if I start one VM by itself and it screws up, I can start up the other and it will work fine. In other words, it's the first VM of any kind that has the problem, not the first run of each individual VM.

I've been playing with a near-identical guest image on VB2 running on a Mac client. I have not seen any problems there at all.

I rebooted the host box. After the host box was up, I ran "VBoxManage startvm testvbox1 -type vrdp" and waited for the VM to lock up. I then copied out the VBox.log to get "VBox.log-broken". I then did a "VBoxManage controlvm testvbox1 poweroff" and copied the log to get "VBox.log-broken-after-poweroff".

Then, I restarted the same VM (with the same command as above) and copied the log to get "VBox.log-working". I waited a few minutes to prove the VM was working right, and then issued a "halt" in the guest OS and copied the log to get "VBox.log-working-after-halt".

I turned off the host's bridge, got rid of /etc/vbox/interfaces, and turned my VMs back to the NAT setting. I then rebooted. I still see the lock up. So, it definitely has nothing to do with HIF networking at all.

Thanks for your findings. Are you 100% sure that this is a 2.0.0 regression?

To help debugging this problem you could do the following: Start the VM with

gdb -args /usr/lib/virtualbox/VBoxHeadless -startvm testvbox1

When the guest does not respond anymore, force the process to terminate with a core dump. I've updated the instructions at http://www.virtualbox.org/wiki/Core_dump. Keep in mind to allow SUID root processes to dump core dumps and kill the process with -4 (as described there).

Send the core dump to frank _dot_ mehnert _at_ sun _dot_ com. If the compressed file is bigger than 4MB (very likely), try to make it available somehow for me for download (preferred) or use some file sharing service (megaupload.com, yousendit.com or similar).

All I know is that I saw the problem when I was running 2.0.0 and 2.0.2. I changed everything to version 1.6.6 and I wasn't able to reproduce the problem. (I did try for a while, too.) I just got through changing back to 2.0.2 and I see the problem again. I don't think I changed anything else of substance.

So, am I 100% sure it's a regression? No. It could be a bug in both versions that is masked by something else in 1.6.6 (a performance issue, perhaps). Or maybe I've made a mistake somewhere, but I don't think so.

I was trying to better quantify how long a VM had to be running before it hung and I made an interesting (and really weird) discovery. It seems that any VMs running on the host lock up exactly as the host's uptime (as shown by /proc/uptime) crosses the 300 second mark. It doesn't matter when in those first 300 seconds the VM starts up.

Test setup:

I have my boot scripts set up to sleep a bit and then start the VM. I wrote a script on the host with a loop that cats /proc/uptime and sleeps a second. I wrote a script in the guest that prints stuff to the screen every couple of seconds.

So, I reboot the host, log in, and start the script that watches /proc/uptime. I wait for the VM to boot. I then log in to that and start the other script. So, I can tell exactly when the VM locks up by watching the output of the script on the VM. It always stops when the host's uptime reaches 300. I can vary the amount of sleep in the boot script to show that it's not how long the guest has been running that causes the lock up -- It's how long the host has been up.

Investigating your core dump. Please could you try if your VM works correctly if you don't start the guest additions? That is, please make sure that /etc/init.d/vboxadd is not executed within the guest during boot (edit the script and prevent it from executing). If you have X running, you will loose mouse pointer integration but X should start anyway. Please check this with version 2.0.2.

I made sure no VB processes or modules were ever loaded in the guest. I still get the lock up. Still at exactly 300 seconds after the host boots.

Back in my kernel hacking days, I had problems with doing arithmetic on the "jiffies" variable (and the like) that were counters since the system booted. Things that would work right when the system was up for a while would screw up on a newly booted machine because my arithmetic would cause a underflow and my code wasn't expecting negative numbers. Could this be something similar?

A very interesting finding. Of course we cannot rule out such a bug. Are there any messages in the kernel log when this hang appears (dmesg)? Are there any services at your host which are executed after 300 seconds? Some cron jobs? Could you have a look at /var/log/daemon.log?

No messages in dmesg or any file in /var/log/. I turned off all the services I could in the host. It's only running udevd, syslogd, klogd, sshd, and a couple of gettys. So, no cron or any other possibility of services starting.

I haven't tried any guest other than Ubuntu yet, I guess I can try one. But on the other hand, the hang happens even if I stop the boot process and the computer sits in the Grub screen waiting for me to select which kernel I want to boot. I've tried with both the standard GUI front-end and the Headless front-end (w/ VRDP). At 300 seconds host uptime, the cursor keys stop moving the selection between the different kernels.

Even further, I created a dummy VM that doesn't have an OS installed on it. I "boot" it and connect with rdesktop-vrdp and I see a "FATAL: No bootable medium found! System Halted." message. For the first 300 seconds, I can connect and disconnect repeatedly and I always see the message. After 300 seconds, a new VRDP connection will just have a black screen with no message.

I was poking at all the places jiffies were referenced in the OSE code. I found usage of a macro I'd never seen before called INITIAL_JIFFIES. I looked up INITIAL_JIFFIES in the kernel source and found this: