This is a long-standing problem. It's a minor one, though, but i've finally set out to resolve it once and for all.

The issue:

With gentoo (either on years-old systems or even the latest builds), logging into a remote system and issuing a shutdown/reboot/halt, the system goes down - but unless you explicitly 'logout' of the session before the system goes down, any active ssh connections do not disconnect, and effectively leave that terminal 'locked out'.

however, ubuntu, (and likely many other distros), does manage to close the connections within seconds of issuing the command. Any and all active ssh connections are disconnected, with "Connection closed by remote host", or something similar.

I've googled around a bit. It seems there's no definitive solution. Several suggested 'hacks' to try to accomplish this, most completely beside the point and only offer a 'bandaid' fix.

So I've begun to dig into and compare the shutdown procedures of both distros.

On ubuntu, I've isolated this behavior to the /etc/init.d/sendsigs init script, which is executed during reboot or halt runlevels. (rc0.d + rc6.d). Removing this script from execution also removes the ssh 'auto-logout' functionality, the sessions no longer disconnect automatically.

The functionality of this ubuntu init script appears to be very similar to gentoo's /etc/init.d/killprocs script, which is also executed during system shutdown. The main point of the script (on both systems) seems to be the execution of 'killall5 -15', which sends the signal to all procs.

So I tried just this command on both systems. Sure enough, 'killall5 -15' on ubuntu almost immediately closes any active ssh connections cleanly.
On gentoo, no-go, the sessions will hang.

So, I'm uncertain where to go from here. It doesn't appear to be a matter of sshd_config - I've tried the same ubuntu config on my gentoo systems and it doesn't result in an auto-logout.

Could it be a kernel config issue? sshd init scripts? Elsewhere in the system? I have no idea.

It would be great to solve this once and for all and get it upstream so no one ever has to complain about this again!

Thanks

Last edited by DNAspark99 on Fri Feb 08, 2013 10:25 pm; edited 1 time in total

What happens on the clients if you simply unplug the network at the server?

Well, physically yanking the cable will hang the client. It wouldn't get anything to tell it to terminate the connection.
Ubuntu seems to have a way to tell any active ssh connections that it's time to hang up and go home.

I'm still digging into it. strace on child and parent sshd processes doesn't seem to indicate anything different, though I suspect strace is being terminated first anyways.

I've been playing around with this, and what I've come up with to replicate this behavior (by way of a dirty hack!) is to modify /etc/init.d/net.lo script to include a ssh-specific killall command as one of the last things to run during stop(); (near the very bottom of this)

Code:

stop()
{
...
...
...

killall -s 15 sshd
return 0
}

A dirty hack, but at least it works in replicating the expected behavior....
Putting it into the /etc/init.d/killprocs script had no effect. Indeed it looks like the network shutdown in run before killprocs....

You can configure the client to disconnect from a dead server using ServerAliveInterval and ServerAliveCountMax .. man ssh_config. Similaryly, you can configure the server to disconnect from dead clients using ClientAliveInterval/Max .. man sshd_config.

You can also use TCPKeepAlive in either the server or client config, but that will kill your session much more quickly (eg. if you're running a ssh client on your laptop, via your wifi, to a server on the internet... your session won't survive restarting your laptop's wlan interface).

Yes, I'm aware of ssh keepalive, and numerous other 'workarounds' that could _help_ mitigate the situation, but that's all completely beside the point. It's not the root of the problem.

The issue is, ubuntu seems to be doing something 'correctly' here, gentoo is not. Gentoo is missing something, or something is out of order.

Most likely, gentoo's killprocs script *should* be killing and closing these connections. It certainly looks to be the intention. But it's not operating as expected.

Sure, it's not overly critical by any means - but it's a minor annoyance that has bugged me - and undoubtedly numerous other users - for years now.
And, if I can track it down, great - we'll hopefully never have to deal with it again! :p

OK, I think i've found an acceptable fix. It comes down to the way gentoo handles runlevels differently from most other distros.

For some reason the network interface is dropped long before the /etc/init.d/killprocs script is run. Effectively, "the cord has been yanked", hence any active ssh connections hang until 'keepalive' expires. The server-process undoubtedly (eventually) receives the signal, but it can't send the RST to the client without the interface up.

I've been playing with the various gentoo-specific init script dependancy structures of the killprocs vs net.lo script - I dunno, I couldn't get it to work (likely because I don't fully understand the way it determines it's ordering things (vs ubuntu/redhat's setup where the symlinks in the respective runlevel dirs can be clearly prioritized for start/stop)

So, ultimately, -in order to not muck up the order of existing shutdown - it seems this should fall into the sshd init script itself. OK. That's what I've done, and it works as expected. YAY! (finally!)

A default interval for declaring a connection dead on either
the client or server, followed by an exit more graceful
than "kill -9 $PID" from a root console would seem to
be robust design in an environment where any number
of forces beyond the user's control can disconnect
the network between client and server. So it is perhaps
not surprising if Gentoo maintainers do not consider this
a critical error. User's ssh clients should recognize and
react sanely to the network going down before the server
explicitly terminates the session (or they are broken).

That is no excuse to be sloppy in our shutdown scripts,
though. Telling sshd instances with a signal to terminate
all connections from clients and exit before the system
shuts down should be doable, too, and it sounds like it
is only an ordering issue in what happens at shutdown.

Maybe make network interface shutdown depend on
a while loop around checking for still-running network
servers? I remember the "Bernstein xmalloc" (more-or-less)
from qmail or something:

malloc() some space
if malloc() returns null, wait 60 seconds
try again
if malloc() still returns null, bail out with an error

Yea, as you see, I seem to have figured it out moments before you posted.
I do hope this change (or an improved variant of it) makes its way upstream.
It just seems fitting that, since you get a broadcast message anyways telling you that the 'System is going down NOW!', any active connections be closed cleanly.

wcg wrote:

A default interval for declaring a connection dead on either
the client or server, followed by an exit more graceful
than "kill -9 $PID" from a root console would seem to
be robust design in an environment where any number
of forces beyond the user's control can disconnect
the network between client and server. So it is perhaps
not surprising if Gentoo maintainers do not consider this
a critical error. User's ssh clients should recognize and
react sanely to the network going down before the server
explicitly terminates the session (or they are broken).

That is no excuse to be sloppy in our shutdown scripts,
though. Telling sshd instances with a signal to terminate
all connections from clients and exit before the system
shuts down should be doable, too, and it sounds like it
is only an ordering issue in what happens at shutdown.

so what happens if you are restarting your remote server sshd session only and want to maintain connectivity .. if this is in your sshd init.d script you kill your exsisting connectioin .. can be very .. not good .. in many cases ..

so what happens if you are restarting your remote server sshd session only and want to maintain connectivity .. if this is in your sshd init.d script you kill your exsisting connectioin .. can be very .. not good .. in many cases ..

The script appears to kill connections only when entering runlevel shutdown.

so what happens if you are restarting your remote server sshd session only and want to maintain connectivity .. if this is in your sshd init.d script you kill your exsisting connectioin .. can be very .. not good .. in many cases ..

Yea, turns out there's a handy variable, RC_RUNLEVEL, so you can safely stop / start sshd as need be. But when the _system_ itself goes down, it'll kill the connections.

It looks like there may finally be an 'official fix' in the works anyways, that isn't ssh specific:

THE FORCE IS STRONG WITH YOU YOUNG SKY WALKER, BUT YOU ARE NOT A JEDI YET!!!! the command you posted started throwing errors at shutdown for me. its not shutting down ssh if there are connections, its shutting down connections regardless if there are connections or not.....

and back to my clean shutdowns... ahhh yes, the dark side of the force is the pathway to many abilities some consider to be unnatural.

ill second your bug tracker stuff saying it should be sshd init script, not net.lo... i disabled net.lo for networkmanager (though subsequently revived it) sshd should control its ssh connections, not some other random script somewhere else on the system.

Last edited by 666threesixes666 on Sun Mar 03, 2013 9:59 pm; edited 1 time in total

Also, awk is not simply for '{print $1}', awk can do the same regex matching that grep can, so, for example, your SSHCONNECTIONS var above could avoid using grep, grep -v, head, etc, by the use of awk alone:

Code:

ps ax | awk '/[s]shd/{print $5;exit}'

... that is an example, as I pointed out above 'pgrep' should be used for such things.

666threesixes666 wrote:

ahhh yes, the dark side of the force is the pathway to many abilities some consider to be unnatural.

I for one ...

Anyhow, this issue will be fixed in openrc-0.12.x (see bug #259183), and as William Hubbs pointed out (see comment #20) the issue can be fixed with the following as the first line of the stop() function in /etc/init.d/net.lo

your right, i totally screwed that up on the '{print $#}' syntax, awks still strange voodoo to me. idk regex. pgrep and pkill are new to me.... never seen them before... are they in other distributions? what package do they come from? gpasswd was new to me like a month ago. where u suggest i get on with my regex educations?

666threesixes666 ... I run it from the prompt, and not from a script or sub shell, I then assumed the reason the var was empty was the awk, and not the fact that metachars are protected with slashes rather than quoted.

THE FORCE IS STRONG WITH YOU YOUNG SKY WALKER, BUT YOU ARE NOT A JEDI YET!!!! the command you posted started throwing errors at shutdown for me.

Yes, correct - I wasn't overly concerned with the noisy output if there's nothing to kill (just appended the ol' "> /dev/null 2>&1" to it), as my reboots were usually done through ssh anyways, and as had already been pointed out elsewhere, 'the fix is in' - and that's all I was after.

... also spits a usage error for me if there are no child 'sshd:' processes connected to kill off...

So I took a quick glance at the man page for xargs, and obtained correct shutdown by adding the '-r' option to the xargs portion. (the option is short for --no-run-if-empty : "If the standard input does not contain any nonblanks, do not run the command. Normally, the command is run once even if there is no input."). This removes the scenario where there's nothing for it to kill.

works perfectly, shuts down clean, works, tested. pgrep -c is a bit dirtier, im actually testing for open connections, not counting up and including the sshd process(s). its piped to the max and ugly, but gits er done. -r sounds like a better fix to me though. i dont like xargs because its not familiar to me and im just piping what i know in a method that make sense to me.

i think combining the 2 would fix the problem in an alternate way. my long ugly ssh connection check... with his short process kill.

The suggested fix [...] also spits a usage error for me if there are no child 'sshd:' processes connected to kill off...

DNAspark99 ... yes, and again, I didn't say "untested", not sure about the error, but its basically harmless. Unless your saying there were clients that we'ren't logged out? The following supresses any errors.

This performs as expected in all scenarios I've tested. It's certainly not as clean though, so hopefully you can easily tweak your fix... for those who don't like pipes :p

Well, I don't understand how, "kill -s 15" is ('-s') SIGNAL (which would be -TERM) and not '-n', or '-15', so I'm not sure why it works, perhaps the signal is just ignored. Anyhow, there are various reasons for not doing it that way, some of which I've provided. If you want an idea of the difference, try comparing the following (which I've fixed up somewhat):

pgrep -c is a bit dirtier, im actually testing for open connections, not counting up and including the sshd process(s)

That isn't what its doing, with sshd running there will be *one* process (at least) so if there are more than one, kill those matching the regex. Nothing "dirty" about it, it doesn't count but ask for a count to use as an expression to match: more than one ssh proccess, ie: clients connected.

Just tried it, and it's still causing an incorrect shutdown if there are no ssh clients connected?
It's in the 'green' if there _are_ sshd clients to kill, but if there are none, it exits with the red "ERROR: sshd failed to stop" msg at shutdown. That's not right...

So I'll continue to stick with my 'ugly' fix for now, as I can spare the .010th of a second

Just tried it, and it's still causing an incorrect shutdown if there are no ssh clients connected? It's in the 'green' if there _are_ sshd clients to kill, but if there are none, it exits with the red "ERROR: sshd failed to stop" msg at shutdown. That's not right...

DNAspark99 ... again, I didn't test it, except to see that the basics of it work, so please stop acting as though I've provided you with faulty information.

The issue is probably that the '&&' will return a non-zero exit status, and the script probably expects success.