I posted previously to this forum that when an MD is asleep and the core is restarted or a certain time period passes the MDs are orphaned from the core, and will no longer transmit or receive DCE messages. i.e. it is not possible to use the MD to control itself, or control it using another orbiter etc.

I posted previously to this forum that when an MD is asleep and the core is restarted or a certain time period passes the MDs are orphaned from the core, and will no longer transmit or receive DCE messages. i.e. it is not possible to use the MD to control itself, or control it using another orbiter etc.

Has this been addressed?

Wouldn't this just be that the TCP socket is lost? The DCE devices open some TCP connections to communicate with the Router through and these remain open - if you suspend an MD then the TCP connection would in theory remain open just the core would get no response on that session. If you reboot the core, then the socket info and state of that connection on the core end is lost (ie the source/destination IP/port numbers, and the fact that it exists and is owned by the DCERouter process) - when the MD resumes it will attempt to continue communicating on the same socket which will now not respond as it doesn't exist on the other end. But the MD isn't going to know this so it will effectively be permanently locked until something resets the process responsible for that session or reboots the box.

I imagine that not rebooting the core but leaving it long enough would trigger either an application (DECRouter) or OS clean up of the socket as it isn't responding to ACKs/Keep Alives....

Perhaps the simplest way to bullet proof this is hook a little piece of code into the MD Suspend-to-RAM function that 1) sets the MD state to OFF and 2) cleanly closes the TCP connections. Then on resume, some code that tells the MD to re-establish the DCERouter connections.

EDIT: Actually this looks like it might be a case of having to stop all child devices of the MD on suspend (which will close all the DCERouter connections), then start them all again on resume to reinitiate the connections....

I've used the method of restarting the devices on the MD at resume myself(this can be done with a script), but I see this as a hack, and no proper solution.

I agree that we should tell the core about our status, as it might need to know at some point. Perhaps it even should be some new status (MD_S2RAM/MD_S2DISK or whatever)?

Then there is of course the issue with maintaining or re-initiating the connections. Don't know the details of TCP programming, but it sure seems to just lock up the processes/devices in question after resume. Maybe a solution is to send a signal to every process to have it restart every connection to the core? At the other end, the core could restart the connections to the MD when it receives a MD_ON or MD_RESUMED status.

Just thinking out loud here...Anyway, I suppose this kind of talk belongs in the dev forum :-)

I really found this "solution" to hackish to add the script, but then again, as it is now, the whole suspend issue is quite hackish.

I just added the script to the wiki. It will basically kill the devices, causing the spawner to restart them. But note that it has a restart count of 50, so it will only work 50 times. You also need to adjust it to match the devices running on your MD.

When I think about it, it should probably be possible to find the list of device ids running on this MD and do a restart on devices based on that.

Ok, just had to google this Found a interesting thread about TCP connections and suspend/resume : https://lists.linux-foundation.org/pipermail/linux-pm/2008-June/017742.htmlAccording to this, connections can survive suspend, as long as there is no NAT and both sides of the connections silent. So, if we could tell the DCERouter to "be silent" for all connections to the suspended MD, we should have solved one part of the problem. This is actually consistent with my findings, it's only the connections to the MD devices that are affected, and not the ones from the MD to the core (as the MD is very silent when suspended). The other part of the problem is if the router recreates connections or otherwise do anything to them while the MD is suspended. Does it create new connections when reloaded?

Of course if the connections are silent and there are no keep-alives then the connection will never know that a suspend occurred. However:

1) we are talking about engineering a complex (and not necessarily guaranteed, esp at the OS level of the TCP connection) way of sustaining TCP connections when it isn't necessary. The whole point of (LMCE) devices is that they can stop and start, disconnect and reconnect as required. Modularity. Why build this complex extra functionality into DCERouter code, when it is a more elegant and clean/tidy just to close the device connections at suspend, and reconnect on resume. Both require code changes, but the latter is much more bullet proof and tidy.

2) the retain-connections option cannot survive a core reboot, or even a DCERouter reload, so this case would have to be handled anyway... and the simplest way to handle this would be to implement the former option above... which would solve the issue in the first place....

I think in the meantime we would be best directing our attention to cleanly reopening connections from the MD to the core. Is there a better way than killing all the child processes? I remember trying killing all SCREEN processes and then running the launch manager again from a script but for some reason this was not 100% bulletproof. I will revisit this soon when I have time. Any better ideas?