Bug Description

[Impact]

The bug affects multiple users and introduces an user visible delay (~25 seconds) on SSH connections after a large number of sessions have been processed. This has a serious impact on big systems and servers running our software.

The currently proposed fix is actually a safe workaround for the bug as proposed by the dbus upstream. The workaround makes uid 0 immune to the pending_fd_timeout limit that kicks in and causes the original issue.

Then checking the log file if there are any ssh sessions that are taking 25+ seconds to complete.

Multiple instances of the same script can be used at the same time.

[Regression Potential]

The fix has a rather low regression potential as the workaround is a very small change only affecting one particular case - handling of uid 0. The fix has been tested by multiple users and has been around in zesty for a while, with multiple people involved in reviewing the change. It's also a change that has been proposed by upstream.

[Original Description]

I noticed on a system that accepts large numbers of SSH connections that after awhile, SSH sessions were taking ~25 seconds to complete.

Looking in /var/log/auth.log, systemd-logind starts failing with the following:

I backported the 230 fix for systemd to the xenial version of the package and started a test-build in my PPA. The patch didn't apply cleanly so I had to do some manual interventions - also, the changeset is pretty big itself. Since I am not a systemd maintainer I would prefer Martin to take a look at the patch before proceeding, so for now I am attaching the debdiff here for further review (along with links to the PPA, in case the package builds fine as I basically just did the dput).

But I also heard that possibly only the dbus fix might be sufficient. In case anyone confirms that I will prepare a new dbus distro-release for yakkety and xenial SRU.

After updating dbus to the version in this PPA, I ran my "ssh to a container" test (which I used as a test case reproduce the bug to file this), and also on another test system that was experiencing this issue with a real-world use case.

This time, I was able to SSH into the system several thousand times, and everything worked fine.

Next I turned it up to eleven by running eight continuous-SSH scripts in a loop. In a minute or two, it fell over and went back to the 25-second delay behavior. So while the behavior is *much* improved with the dbus patch, there are still lingering issues, and I think we should consider patching systemd as well (in addition to triaging further to determine if there is larger design flaw that can be fixed separately).

I think it's worth patching dbus alone as a first step. I will test Łukasz's systemd PPA to see if that further improves things.

Thanks again to everyone in the community who helped pull together a fix!

Actually, please disregard the portion of my previous comment where I suspected we should consider patching systemd as well. I no longer think that is necessary. (My test case was flawed.) After correcting the issue (ensuring I was running with the properly-updated dbus fix), I was able to run eight parallel continuous-SSH scripts against the LXC with the fixed dbus (without the systemd patch).

That's correct: only the dbus patch is absolutely necessary, but since the patch wasn't merged yet to dbus I am still not 100% sure if it is considered ready for prime time. It seems to work for us.

As Łukasz observed, the systemd patch is a lot more extensive. Even though it was merged to master, we were only going to attempt to backport the fix if it were absolutely necessary. It doesn't look like it was in this case, so we did not attempt to do it.

Ok, with everyone confirming that the systemd patch is not required, I am closing the systemd part of the bug as 'Invalid' - let's only concentrate on the dbus part here. That being said, I would not like to release a new patch for dbus downstream if the patch hasn't been fully reviewed and approved upstream.

In this case I would propose to wait a bit and see if a finalized patch will be available.

Some hints for using Ubuntu 16.04 machines that can't be rebooted to work around this bug:

1) You can keep your SSH logins a secret from systemd-logind by adding `UsePAM no` to /etc/ssh/sshd_config; this will avoid the ~25 second delay.

2) `UsePAM no` requires unlocked accounts (passwd -u) with a password set, even if you are only using publickey authentication.

3) You can use `AuthenticationMethods publickey` to prevent login with the passwords set for those accounts.

4) `su` also uses PAM and therefore informs systemd-logind and hangs for ~25 seconds, but in some cases `ssh user@localhost` can work as a replacement for `su`. There doesn't seem to be a way to configure `su` to not use PAM.

5) If you were relying on PAM to set a ulimit -n (nofile) using /etc/security/limits.conf, you can add something like `LimitNOFILE=131072` to the [Service] section in /etc/systemd/system/sshd.service, then `systemctl daemon-reload && systemctl restart sshd`

Sadly upstream still didn't agree on a concrete fix, as the one that was confirmed as working was actually reintroducing a security vulnerability. We tried one of the other proposed fixes but it didn't seem to help. I'll try to push a new dbus version with another of the proposed fixes, but there has been no notable movement on the original upstream bug since long [1].

I have prepared another xenial dbus package in my PPA containing the second WIP proposed fix from the upstream bug [1]. If you could give a try on reproducing the issue using dbus 1.10.6-1ubuntu3.1~test2 from this place, I would be grateful:

Same as with the previous package, there is no guarantee this will help. It's one of the proposed changes to make the situation better as per the upstream developers, who would be very welcome on some feedback.

It works fine with the new test package in your PPA. I ran an SSH login flood for 10 minutes and didn't see systemd-logind fall over. I then purged the PPA and confirmed it was still broken without the test package (it dies after about 2 minutes / 5000 logins).

Excellent, let me forward these comments to the upstream bug. We'll still wait for a few more people to test it out with this change applied and then try to release it to the latest series + back-porting to xenial at least. Of course we can do that instantly once upstream accepts the patch, but I'm sure they'll like some real world feedback as well.

Tested with an Openstack 16.04 instance and having Mike Pontillo's while loop hammering ten times in parallel on it. With dbus 1.10.6-1ubuntu3 I quickly got logins taking about 25 seconds, after having installed the dbus package from the sil2100 PPA I couldn't reproduce the issue anymore.

I think I will just go forward and start preparing the release of dbus with this fix in zesty and then backporting it to yakkety and xenial. Upstream didn't seem to officially review the fix or provide any feedback on our test results, but the fix is enough high-priority to consider including it anyway. I will of course get someone to review all this, but I suppose we'll be pushing upstream about it separately.

I have not been able to reproduce this on a Debian (jessie or sid) or Ubuntu (xenial) virtual machine prepared according to the instructions in autopkgtest-virt-qemu(1), even after reducing the pending_fd_timeout limit from 150000 (2.5 minutes) to 150 (150ms) with this configuration in /etc/dbus-1/system-local.conf:

<busconfig>
<limit name="pending_fd_timeout">150</limit>
</busconfig>

This is with 4 parallel loops repeatedly logging in via ssh, currently at around 280 logins each.

Is there something special that is needed in the OS image to exhibit this failure mode?

@Shay: yes, we will prepare an SRU to the currently supported series soon - but please note that the current 'fix' is, in fact, just a workaround. But it works.

@Simon: I can try finding some people that could reproduce this easily, prepare patched-up version of dbus with both the proposed fixes and ask them to run tests on them. Would be really cool if a real fix could be found this way. I'll take care of this this week and send feedback here and the upstream bug.

Could anyone that was able to reproduce the original issue install the dbus packages from the above PPA and re-try the tests to see if the issue is reproducible? The following packages have the workaround reverted and the two requested patches applied. I prepared both xenial and zesty packages in the PPA for testing purposes.

Lukasz, could we get an updated release for Ubuntu 16.04 (xenial)? We're finding that the latest kernel updates are overwriting our custom dbus packages, and we would prefer to have an official release soon. Thank you!

Hello Stan! The dbus packages for xenial and yakkety have been uploaded a long time ago (on the 25th of November actually) but are currently still sitting in the UNAPPROVED queue for each release. I'll poke the SRU team to take a look at those as soon as possible. I didn't expect this to take so long.

In the meantime, could someone please test the new packages with the new 'proper' proposed fix (as per my PPA above)? This would help upstream in getting rid of the issue without using just this workaround.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

@Kai-Heng
Thanks for giving those a spin, that's really good news! Would be good if we got a few more people testing those - this way we could revert the workaround and get the real fix released both upstream and downstream.

@Laurent
Thank you for testing the xenial package!

Are there any yakkety users around that could potentially give the dbus package a try as well? We could then switch the bug to verification-done and get both of them released into the updates pocket.

We are running Gitlab on 16.04 LTS and had this issue which was really causing a lot of frustrations. I was waiting for the update to go through for about a month now and decided to apply this updated version. After applying it via proposed repository it has solved the issue for us as well.

This must be affecting many others out there as well.

Anyway, thank you so much for making this available and hopefully they will release it soon.

Oh come on, I monitor machines via SSH remote execution of scripts - Icinga logs in every minute. You might get the picture in regard to this bug. We're waiting for six month now. How should anyone take Ubuntu serious anymore? *sadface*

This takes much too longer than expected, it was supposed to migrate long time already. All because of that additional fix that got attached by another developer which caused some regression somewhere in another component *sigh*. I re-uploaded a new dbus to xenial-proposed with the other fix reverted (so only having our logind workaround). Will now make sure that this one migrates ASAP (trying to get it in now).

Thanks for your quick reply. I don't want to sound harsh or be the guy who is always complaining, but this is just so annoying. I can't test packages on production servers or run "testing" packages on them, even if known good, when it's company policy to only run stable/official packages (especially when it's a LTS release).

And be honest: how much more basic from a user/admin perspective than "SSH is working" can it get?

A regression was discovered in another component. This is the reason for the delay. This is very uncommon, but also the entire reason for the SRU process.

If you are capable of contributing in a development manner, I will gladly mentor you or help find a mentor for you. Contributing solutions is the best way to help speed fixes for which you may care about in the open-source world.

As it sounds as though you are using in production in a mission critical application perhaps you'd consider financially supporting the project by purchasing a support contract, or donating when you download.https://buy.ubuntu.com/

As part of a recent change in the Stable Release Update verification policy we would like to inform that for a bug to be considered verified for a given release a verification-done-$RELEASE tag needs to be added to the bug where $RELEASE is the name of the series the package that was tested (e.g. verification-done-xenial). Please note that the global 'verification-done' tag can no longer be used for this purpose.