Non-responsive backend locks up all networked systems

Description

I have three systems running MythTV 0.25. One master backend (MB) and two secondaries (S1 and S2).

For some reason (there is nothing in the log), the backend process on S1 decided to take a vacation (sort of). The frontend task could not do anything on that system that required the services of the backend process (e.g. play a recorded program). However, the backend process was able to start and complete a recording at its appointed time.

Meanwhile, on the MB and S2 machines, the recorded program showed up in the list and was briefly available for concurrent (while recording) viewing until it abruptly stopped playing. At this point, the frontend process on the MB system (or the S2 system, take your choice) appeared to be hung. Several attempts to kill and restart it always resulted in the appearance of it being hung, when the program was viewed. Eventually, patience prevailed and an attempt to view the program did not hang but returned to the recorded programs list (about 5 minutes).

Attempts to view any other programs on the MB or S2 systems met with similar hangs, although the recording files were local to those machines. This behavior continued until the recording ended on the S1 system, at which point the MB and S2 started working normally (i.e. recordings could be played).

However, any attempts to play the recording from either the MB or S2 system resulted in the "File doesn't exist" message. Incidentally, this message is often a lie. The file is right where it should be, on the system that is not responding to the helm. Changing the message to reflect the true state of the problem would be swell. How about, "The file is located on a server which isn't responding to the helm," instead of misleading the user into thinking the file didn't get created? They might be foolishly tempted to delete the missing recording from the database. Anyway, other recordings could be played but not the one in question.

Meanwhile, the S1 system continued to be essentially broken. Several restarts of the frontend process had no effect.

Not until the backend process on the S1 system was restarted did anything start to work reasonably well. Incidentally, the new upstart method of controlling the backend does not work. It gives some lame message about "stop/waiting" but the actual task continues to soldier on (in its semi-hung state). A "ps x -A" followed by a "kill -9" of the appropriate task number did the job. Then, "start mythtv-backend" brought up a working backend.

Note that the recording on the S1 system worked fine. Once the backend process on that system was restarted, the recording file appeared on all systems (MB, S1, S2) and could be played. The frontend process on S1 was restarted and it worked too. All was right with the world.

So, it would appear that the portion of the backend process on S1, that responds to file transfer requests from other processes, went south. Perhaps there was more to it than that, since the frontend process on S1 apparently could not talk to the backend process either. The recording portion worked fine, however, since the scheduled recording started (perhaps this was before the problem began) and stopped at the correct time and the recording file itself is 100% fine.

This lack of response on one system (S1) had the effect of hanging all of the other systems (MB, S1). In the backend log file on S1, there is absolutely nothing to indicate any problem whatsoever. All I see is the recording starting and stopping normally.

Meanwhile, in the backend log file on MB (or S2, take your choice), I see a boat load of "E FreeSpaceUpdater? playbacksock.cpp:139 (SendReceiveStringList?) PlaybackSock::SendReceiveStringList?(): Response too short" messages (and I do mean a boat load) which go on for at least 30 minutes from around the time the event appears to have happened.

I'll be happy to send you whatever logs you need.

Here are some general observations:

1) The network timeouts are far too long. Waiting for two minutes

and then retrying a couple of times (for a total of 5 or 6 minutes)
is way too long. If the answer on a locally-connected device does
not come back in 500ms, something is probably wrong. Ten to twenty
seconds is more than enough. Remember, the user is sitting there
waiting for something to happen. After two minutes, they are
probably thinking about buying a gun. From my years of experience
in the online systems business, the goal should be two second
response time (not always achievable but still a good goal).

2) Perhaps a separate task that just looks up the recorded programs

list and streams files would be good (e.g. I have a server that
serves videos for Mythvideo, using Samba. One can select a video
from the menu, hit enter, and it starts playing in 2 seconds. Once
it begins, nothing ever interrupts it). Asking the backend to do
everything may be not such a good idea.

3) Error recovery from network problems seems to be not good. Often,

when something is wrong (e.g. can't view a recording), the fix is
simply to restart the frontend task. This would imply that error
recovery could be had simply by closing the socket and reopening
it.

4) It would be great if recording tasks could be spawned separately

from the backend task. Then, one could restart the backend without
losing all of one's recordings. Believe it or not, this state
happens quite often (i.e. when one would like to recycle the
backend process but must wait for valuable recordings to end).

5) A kill everything and start it all back up fresh (except for

recordings) command that actually works would be swell. It would
be really swell if it could be bound to a key on the
keyboard/remote that wasn't routed through the hung frontend
process. It would be even sweller if pressing this key would
package up all of the log entries (in the vicinity of the
problem), plus any other pertinent information, and send it from a
background task (heavy emphasis on the word background) to MythTV
Command Central for debugging purposes. There are users who do not
even know how a computer works who, none-the-less, would then be
capable of restarting the broken system and simultaneously sending
in a bug report. The Evil Empire has, for example, had great
success with such a system towards improving the reliability of
their software.

Fundamental problems that are probably network-related have been around since MythTV 0.21. I realize that debugging problems that are spread across multiple, networked devices is very difficult, if not impossible. So, whatever assistance I can render in the form of capturing packets, creating log files, running debugging code, etc., I'd be happy to help out.

Attachments (0)

Oldest firstNewest firstThreaded

Comments only

Change History (0)

Add Comment

This ticket has been modified since you started editing. You should review the
other modifications which have been appended above,
and any conflicts shown in the preview below.
You can nevertheless proceed and submit your changes if you wish so.