This is a ciritical problem for Xen users.
The attachment is the patch for hardy's broken Xen netfront driver.
This fixes the duplicated memory allocation on the older Xen hypervisors,
and enables NAPI for correct message receiving.
Put this under debian/binary-custom.d/xen/patchset in the source tree.

As long as I read the source code, The netfront driver available from kernel.org has already fixed this problem.
However, the current source code in Ubuntu (and perhaps in Debian) differs greatly from the one in kernel.org.

Could this please be released in time for Hardy, since this makes the difference between an essentially useless DomU and a working DomU. See Bug 204010 for the types and numbers of people that have been affected.

I can now confirm that your packages helped me to get the domU network working.
It is interesting to know, that testing ubuntu domU with other DISTRO (SL5.1) as host OS can reveal the problem more precisely
and fix the problem more quickly then testing with ubuntu.

Console is /dev/xvc0 - Just edit /etc/inittab to run 1 getty on console,
and comment out all the rest, add console=xvc0 to 'extra' line in domain
config file in /etc/xen.

Also, install udev in your DomUs, and create a minimal /dev to go
underneath it on /dev so that you get console messages after init is
started. The console on the kernel config line will give you a console
while an initrd is loaded.

To get a linux 2.6.25 Xen kernel working, strip the vmlinux in the top
of the source tree after compile and compress:

I cannot believe it.
The ubuntu is released and the bug is still there.
This bug is critical for xen users. That is also what hirano said in his first response.
It is not even in the release notes.
Shame on Canonical.

I am also quite outraged that Ubuntu 8.04LTS was released with such a major bug. This basically disqualifies 8.04LTS for a very large number of server deployments and puts Ubuntu's reuputation as a well tested, stable server distribution on stake.

Can someone from Canonical (or whoever has some insight into the matter) please at least comment on if and when we can see a fix for this issue?

I was running into the exact same problem trying to run a PVM Ubuntu-8.04 within a CentOS-5.1 Dom0. I was able to install the fixed package in the DomU (running in HVM mode), copy the kernel and initrd to the Dom0 host, fix the guest configuration to point to the fixed kernel, and restarted the DomU and it works great.

As far as I can tell from the activity log Colin scheduled the fix for for 8.04.1.

I think waiting for a fix for a such a major bug which isn't even mentioned in the release notes for another two months is too long. This prevents production environments from being upgraded or installed and gives lots of people some headache. Please consider rethinking this decision.

On Tue, 2008-05-06 at 19:34 +0000, Bart Heinsius wrote:
> Colin Watson wrote:
>
> Accepted into hardy-proposed
>
> does this mean that the fixed kernel is now in the hard-proposed
> repository and that I can safely upgrade Hirano's custom made kernel
> with the proposed one?
>

I too can confirm that the hardy-proposed kernel for 32bit systems isn't fixing the issue here also. I can see RX/TX packets on the vif1.0 interface in Dom0, however, still no reply to pings / traffic from inside or outside DomU.

sorry that was everything i got from xen.. the machine wasn't started at all and the bootprocess was cut off as well at that point.
I changed back a test-setting in the hwclock-scripts in the VM-root, so now there is a little more:

now xen states that the machine is already present, but it isn't visible in 'xm top'

I tried Hirano's 32bit kernel and had no luck, now I'm on the hardy-proposed kernel and still have no luck.

To answer your question, janevirt, I'm using 2.6.24-17-xen. I'm installing via xen-tools, which seems (to my surprise) to be booting using dom0's kernel. I'm going to investigate setting up a domU which boots off of it's own older kernel, probably the gutsy one.

the "domain already present"-thing was a configuration fault of mine... I had a on_crash=reboot in the configuration of that domain.
so the server tried to start it repeatedly...

@russel:
I already tried a 32bit-hardy-system with gutsy-xen-kernel as domU with no luck.
the bootprocess stucks at "setting system clock". the mentioned workaround for that isn't working for me either

ps: why isn't a domain that shows up as "paused" in 'xm list' not visible in 'xm top'?

I posted allready in a duplicate bug, haven`t seen, that this is the main thread.

I was fighting with the same error described here. I installed Hiranos kernel (32-bit version) on my AMD Athlon X2 5600+ server.
With that kernel I am now able to get my network up and running, however the network connectivity is still broken for large data transfers.

I use right now Hardy Heron for Dom0 and my DomU Clients, all with the patched Kernel.

My problem is:

Every outgoing tcp-connection (from a domU) stalls and finally hangs, as soon as there is more data than a couple of bytes going out.

Doing a "ls -laR /" in an ssh-session is allready enough. As soon as I transfer in a second session a large file from my DomU-Guest with FTP to another physical server in the internet, IP connectivity to the DomU is not longer possible, till I kill both processes and wait some time.

It is no problem transfering large files etc. TO my domU server, the problem only occurs when sending data out.

I am running the routed network with public IP addresses on all interfaces. I was now able to trace the issue a little further:

running an scp from a virtual domU host to another physical server in the same datacenter i had four tcpdumps sniffing.
I stopped the scp when the connection was stalling.

So I realized, that the packet-loss happens in the dom0 and not longer in the eth0-vifX.X connection.

However communication with the dom0 itself is no problem, neither in or out. Comparing the dumps from both interfaces of dom0 I see, that just every 3rd to 6th packet is missing on eth0 and will be finally resend if not confirmed with an ACK.

I have no firewalling (empty iptables) in dom0

Any ideas??

Just repeating: There is no problem in the other direction: From outside TO domU....

The kernel linux-image-2.6.24-16-xen_2.6.24-16.30zng1_i386.deb made by HIRANO Takahito boots successfully and networking seems to work ok. This is with a Centos 5.1 dom0 on the same i386 box as was used in my last report (for 2.6.24-17.31)

I am experiencing the same problem as Wolfgang two posts above. Unsure if it is related to bug 218126.

Running stock feisty dom0, upgraded domU to hardy and now having this problem. Tried Hirano Takahito's kernel on the domU yet the problem subsists.

The server is quite heavily loaded as a web and mail server, but operates fine until one of the 12-hourly rsyncs occur to back up the database. At the backup times, there is about a 50% chance that the network connection on the domU will become inactive during the transfer. The only way I have found to restart the network connection is destroy and recreate the domU.

Through the domU console, tcpdump shows arp packets being received by the domU but nothing sent out. ifconfig shows number of packets received increasing slowly (with the arp requests) and number of packets sent not increasing.

This domU has run fine on a feisty for a considerable time, and only started to exhibit this problem after upgrading to hardy.

I'm seeing another effect. Not sure if this is the correct bug for it or not, and my apologies if it's not.

I'm running Xen on 32 bit hardy, on an HP Proliant DL 140 machine. Using Hirano's kernel or the 2.6.24-17-xen from hardy-upcoming, networking "works". But only sort of. If I try and transfer a file of any appreciable size (say 10M) between any of the domU using any protocol, the network stalls out after about 2M go through. If it's only a very small file (say 100k) it will make it.

I have no problems with the 17.31 kernel on an amd64 domU. And dom0 is running the hardy kernel (-16.xx), also amd64. I'm using debians hypervisor (3.2.0-3~bpo4+2).
And I DO have checksumming offload off (as described using /etc/network/interfaces). It was necessary with previous kernels and I didn't even think about it.

There is a difference between checksum offload on and off. With offload OFF I have a througput (domU->dom0) 18M/s; with offload on it is only 12 (using an 36M file).
From domU to domU (same host) it doesn't matter whether offload is on or off.

Yeah, unfortunately for me, I'm planning on just wiping the box and trying to redo things using KVM at this point. Which I don't like for performance reasons. But the reason I found this issue is that I was trying to set up a mini-cluster for mogilefs to do development against. And if I can't store files of any real size (which I can't like this) then this setup is useless to me. If I turn off the checksumming, then I can move files around, but mogilefs stops working reliably in that case, so I'm pretty much up a creek. :(

I'm not entirely sure about this, but it might be necessary to turn off checksum offload in all doms on the same machine. If I remember correctly, it is required to turn off the offload in dom0 as well. I know I have it off in all domUs and dom0.

I tried turning off checksumming, first just in domU, then in both dom0 and domU, but the network still stopped.

However I have figured out what the problem is. The domU's network is becoming unresponsive immediately after 4GB of data has been transmitted by the eth0 device, as reported by ifconfig, and the tx counter wraps around to 0. This happens about once every 24 hours on my machine, and most often happens when sending a big file - however it's not the throughput that is the problem.

Should I be opening a new report for this bug, or it related to the above?

At THURSDAY, I've upgraded to Hardy. And Xen networking between domU and dom0 stopped working immediately. I've tried everything until today evening, when I've found this bug description. After I installed Linux 2.6.24-17-xen #1 SMP x86_64 GNU/Linux from hardy-proposed, networking between domU and dom0 works like before.

This bug is known TWO MONTHS (see #204010) and there were no release notes about it. It is not even considered as serious bug! Lots of people has many XEN based servers and this bug can stop all of them. And it's still not resolved, still not in the main tree. And this is LTS. I am really shocked, I have to examine my confidence in Ubuntu..

I will tomorrow test the problem with large files some are talking about and let you know. Thank you again for the solution, I hope it will be soon distributed as normal update.

I am running Hardy dom0 and domU's on i386 with the proposed kernel
without any troubles at all.

This was just for the record.

--
xen guest kernel bug: 'kernel BUG at /build/buildd/linux-2.6.24/
debian/build/custom-source-xen/drivers/xen/netfront/netfront.c:785'https://bugs.launchpad.net/bugs/218126
You received this bug notification because you are a direct subscriber
of a duplicate bug.

Status in Source Package "linux" in Ubuntu: In Progress
Status in linux in Ubuntu Hardy: In Progress

It is not accurate to say that this bug is not considered serious; this bug is marked to be fixed before the first point release, and an attempt at fixing it has been included in the first stable release update of the kernel in 8.04.

However, the patch included in 2.6.24-17 is evidently incomplete, so a fixed kernel is contingent on another stable release update of the kernel (which is already in the works).

Based on comments, I'm not sure whether a sufficient patch is currently committed to the Ubuntu kernel team's git tree, but this will certainly be followed through on.

I am running Hardy dom0 and domU's on i386 with Hiranos kernel without
any trouble at all for my mail-, web- and nameservers .
Hardware: HP ML110 (7 domU's), ML115 (4 domU's) and Tyan GS21 (4 domU's)
I'm using domU's in TAP:AIO uand LVM configurations without any
problems.

Thank You Hirano for your kernel

--
xen guest kernel bug: 'kernel BUG at /build/buildd/linux-2.6.24/
debian/build/custom-source-xen/drivers/xen/netfront/netfront.c:785'https://bugs.launchpad.net/bugs/218126
You received this bug notification because you are a direct subscriber
of a duplicate bug.

Status in Source Package "linux" in Ubuntu: In Progress
Status in linux in Ubuntu Hardy: In Progress

Basically, if I do a large rsync between a domU (running gutsy) and an outside server, both my domU and dom0 network connections will fail. If I go to sit at the terminal, everything works correctly (minus networking, of course).

My Dom0 is running Hardy with a 2.6.24-28-xen kernel, using bridge networking specifically set to use only one of my NICs.

Hirano's kernels were good but still unstable with 3-4 oops a day on my dell sc1425. My own experience was that I had to use the stock xen kernel from xen.org (compiled most easily using the instructions from gentoo's wiki) before I saw any stability or decent performance on my hardy system. This might be a good choice for anyone else needing a stable system until this is properly tested. I'm disappointed and surprised that such an unstable kernel made it into Ubuntu.

I found this thread and attempted to follow. I am quite new to Ubuntu and XEN.

I have: dom0 - Centos 5.1
I would like to install Ubuntu hardy as a domU. The only Ubuntu version I can install properly with no problem is dapper (6.06).

What I have done so far:

1. Install dapper as DomU (using the virt-install Xen CLI tool)
2. Upgrade dapper - using Ubuntu instructions to upgrade.
3. Dapper is upgraded successfully to a kernel: 2.6.15-51-amd64-generic
4. I tried to upgrade to 8.04 - gksu "update-manager -c"
5. The long process completed successfully and I rebooted.
6. Upon reboot the new kernel was stuck - and would not boot - it seems like it is unable to load the virtual HDisks driver

7. I found this thread and downloaded the kernel and installed via: dpkg -i <file>

8. It generated the kernels:
vmlinuz-2.6.24-19-xen

and initrd image:
initrd.img-2.6.24-19-xen

9. Attempt to boot, I am unable to successfully boot - process pauses with the error: