Bug Description

[SRU Justification]

== Impact ==

Since 05c0b86b96 "ipv6: frags: rewrite ip6_expire_frag_queue()" the 16.04/4.4 kernel crashes whenever that functions gets called (on busy systems this can be every 3-4 hours). While this potentially affects Cosmic and later, too, the fix differs on later kernels (Bionic is not yet affected as it does not yet carry updates to the frags handling).

== Fix ==

For Xenial and Cosmic, the proposed fix would be additional changes to ip6_expipre_frag_queue(), taken from follow-up changes to ip_expire().
For Disco, I would hold back because we have a backlog of stable patches there and depending on what got backported to 5.0.y there would be a simpler fix.
For current development kernels, one just needs to ensure that the following upstream change is included: 47d3d7fdb10a "ip6: fix skb leak in ip6frag_expire_frag_queue()".

== Testcase ==

Unfortunately this could not be re-created locally. But a test kernel which had the proposed fix applied was showing good testing (see comment #37 and #38).

== Risk of Regression ==

The modified function is only called in rare cases and the positive testing in production would cover this. So I would consider it low.

---

Description: Ubuntu 16.04.6 LTS
Release: 16.04

After upgrading our server to this Kernel we experience frequent Kernel panics (Attachment).
Every 3 hours.
Our machine has a throuput of about 600 Mbits/s
The Panics are around the area of ip6_expire_frag_queue.

According to the bug finder of above bug it also occurred after using a Kernel with the change of
rewrite ip6_expire_frag_queue()

Intermediate solution. We disabled IPv6 on this machine to avoid further Panics.
Please let me know what information is missing. The ubuntu-bug linux was send. And I hope it is attached to this report.

Knowing which was the last good kernel would be good to minimize the delta of changes. Note that if you are able to interact with the grub loader at boot, you can go back to at least the previous kernel before the reboot.
For the trace it would be good to capture the full message. If the server has IPMI capabilities you could add a console= kernel command-line to have messages observable through SOL.

I spend the better part of 2h to install java on an old windows to satisfy the IPMI needs.
I was able to start SOL. BUT it is just a black windows displaying nothing.
I give up. on this - sorry I do not know how to produce an better screenshot.
(Yes I googled). IPMI Viewer produced the same bad results.

The latter means that you still have 4.4.0-143 around and could select that if you had any way of interfacing with the booting server. So you could go back and confirm the regression happened between 143 and 145.

About IPMI, I don't know how one would do that with Windows, but using a Linux box, there is a package (name in Ubuntu, might vary on other distros) called ipmitool which can be used to do the SOL session without any java and from a terminal window. Of course in any case to see anything you have to figure out which ttyS# on the server is mapped to the SOL session (ttyS0 or ttyS1 usually). And then something like "console=ttyS#,115200n8" has to be added to the default arguments in /etc/default/grub to tell the kernel to re-direct the console to that serial port.

I have had this crash, with the ip6_expire_frag_queue stack trace, more than 18 times since 2019-04-16 on more than 10 different servers in 8 different countries. There have been some more crashes, but from these ones the panic dump managed to go out to a remote syslog server where it's easy to grep. Crash count by kernel version; these are on both trusty and xenial:

Downgrading to 4.4.0-143 now, as that build does not seem to have the "ipv6: frags: rewrite ip6_expire_frag_queue()" change; it first appears in 4.4.0-144-generic image. I think by tomorrow it's clear whether that kernel is stable as we're now having multiple crashes per day (last crash 50 minutes ago).

Interestingly the crashes only happen on bare hardware. We have a much
larger number of VMs doing the same thing, most of them now running
4.4.0-146, and none of them have crashed like this. The hardware instances
do have a larger number of CPU cores, the VMs only have 2 or 4.

I am also seeing crashes on 4.15.0-48-generic hwe kernel running on xenial,
but no stack trace to show yet.

The issue is a check which is causing a oops/crash when a send buffer is referenced more than once when calling pskb_expand_head(). As mentioned in comment #18, this seems to be introduced by a series of patches modifying the way fragments are handled.

The networking code is quite complex, so I am not sure whether some detail I found actually is causing this issues (one backport claims to drop some extraneous initialization in ipv6 which was not done in the ipv4 counterpart), but I created a test kernel to see what happens. If someone could give http://people.canonical.com/~smb/lp1824687/ a try and let me know I would highly appreciate.

As a status update: thanks for testing. I pity it did not help. So far I was looking through all related changes in that set but could not find anything that immediately stuck out. Thinking more over the crash stacktrace it is a netfilter contrack timer expiring which causes a call into ip6_expire_frag_queue() and that got rewritten in "ipv6: frags: rewrite ip6_expire_frag_queue()" to use the first entry in the frag list for sending an ICMP message. And before doing that, it calls skb_get() which does increment the user refcount. That might actually be the issue but it is still done that way in any kernel since v4.18 upstream. Could be that nobody is using those under heavy ipv6 traffic, yet. Since I am not that familiar with the network stack, I would like to reach out to upstream with that question.

From the upstream discussion thread it looks like I was on the right track (https://marc.info/?l=linux-netdev&m=155688404826002&w=2). For confirmation I am building another set of test kernel packages and once this can be confirmed will proceed to SRU this into the other series. This looks to have remained unnoticed so far, so anything after 4.18 and all the older kernels which have backported those changes would be affected.

Unfortunately 4.4.0-144-generic #170+lp1824687v2 testing kernel still crashes. I have 4 hardware instances running it now, there were 2 panics (Australia, Sweden) within 24 hours. I installed linux-crashdump on them after the first crash to get the panic logs reliably. Attached a log from the second panic.

Spend a little more time on this yesterday. While it is somewhat clear that this results from fixing the original issue (now it crashes when releasing memory a little later), My past experience of looking at network issues like that is that memory dumps are of rather limited use as the reasons lie in the past and by the time crashes happen all the interesting state is already lost.
On the other hand I also would rather avoid making experiments in production environments (if that can be avoided). But I am not sure how much chance there is for that.

So far I have not been successful to trigger the code path which leads to the crashes on my test system. I have, however been able to extend the patch I had in v2 in a way that makes me a bit more hopeful that it might get us somewhere. Potentially not the most optimized handling but that could wait. The problem is a bit that all the changes come from a set of changes where I am not sure upstream really tested the intermediate steps too well. Anyhow, you would find the new debs again at http://people.canonical.com/~smb/lp1824687/
I know it sucks, but I would appreciate if we could put that again into production stress.

I reverted the changes to Cosmic because that needs at least a different approach. In that version the rbtree usage is not yet present and the IPv4 expire function does the exactly same thing (increment the refcount of the skb) and we have no hard evidence this actually causes crashes in the 4.18 kernel. So for now only keep the xenial change.

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.