So I'm still stumped. Sadly, the linked question is without a final answer as well.

(I'm also wondering if this question is related, but it seems unlikely that S3 would be failing to honor Connection: close headers.)

I have several Debian (Wheezy, x86_64) servers all exhibiting the following behavior:

All of the servers have a set of cron jobs that, among other things, pull data from S3. These usually run fine, but occasionally ps aux reveals that some of the jobs started hours or days ago are still running, and have not finished cleanly.

Inspecting them with strace -p <pid> shows, in all cases, the process is hung on a read command. For example, the output of a process I checked just now was:

At first I assumed this was just due to a lack of setting a read timeout in the Python scripts, but this does not appear to be the case, for a couple of reasons:

This doesn't happen when the same jobs are run on our OS X boxes (all 10.5, i386), using identical code.

A variant of the script that does set a timeout (of 60 seconds, using socket.setdefaulttimeout -- this is in Python 2.7, but the code base has to be 2.5 compatible) has been hung since yesterday.

Another process that isn't Python seems to occasionally exhibit similar behaviour. In this case, a Python script is executing an svn up --non-interactive process (using subprocess.Popen, for what that's worth).

(There appears to be an svn:externals property set to ez_setup svn://svn.eby-sarna.com/svnroot/ez_setup on the directory being updated; based on their website, I think this is redirecting to telecommunity.com)

Additional possibly-relevant points:

The Python environment on the Macs is 2.5. On the Debian boxes, it is 2.7.

I am not well versed in SVN, and I have no idea if the reason it is hanging is fundamentally the same thing or not. Nor am I entirely sure what the implications of svn:externals are; this was set up before my time.

The Python scripts themselves are retrieving large-ish (~10MB in some cases) chunks of data from Amazon S3, and this has a tendency to be slow (I'm seeing download times as long as three minutes, which seems long compared to how long it takes the servers -- even in different data centers -- to communicate with each other). Similarly, some of our SVN repositories are rather large. So that's basically to say that some of these operations are long-running anyway, however they also seem to get hung for hours or days in some cases.

On one server, the OOM killer took out MySQL this morning. On closer inspection, memory usage was at 90%, and swap usage was at 100% (as reported by Monit); killing a large backlog of hung Python jobs reduced these stats to 60% and 40% respectively. This gives me the impression that at least some (if not all) of the data is being downloaded/read (and held in memory while the process hangs).

These cron jobs are requesting a list of resources from S3, and updating a list of MySQL tables accordingly. Each job is started with the same list, so will try to request the same resources and update the same tables.

I was able to capture some traffic from one of the hung processes; it's all a bit inscrutable to me, but I wonder if it indicates that the connection is active and working, just very very slow? I've provided it as a gist, to avoid clutter (I should note that this is about two hours worth of capture): https://gist.github.com/petronius/286484766ad8de4fe20b This was a red herring, I think. There's activity on that port, but it's not the same connection as the one to S3 -- it's just other random server activity.

I've tried to re-create this issue on a box in another data center (a VM running the same version of Debian with the same system setup) with no luck (I'd thought that perhaps the issue was related to this one, but the boxes experiencing these issues aren't VMs, and there are no dropped packets according to ifconfig). I guess this indicates a network configuration issue, but I'm not sure where to start with that.

So I guess my questions are:

Can I fix this at a system level, or is this something going wrong with each individual process?

Is there something fundamentally different about how OS X and Linux handle read calls that I need to know to avoid infinitely-hanging processes?

The tcpdump indicates the trace should be showing multiple reads, providing you are tracing the right process -- there is plenty of data being transmitted. The only case where this might not happen is if you have the MSG_WAITALL read flag being used plus a very large buffer.
– Matthew IfeJul 9 '14 at 13:56

I left the tcpdump running while I went out to lunch. The first fifteen minutes (before I left the office) it wasn't showing anything. I think I was not so patient with the strace commands, but I'll leave one going and see what it comes up with.
– Michael SchullerJul 9 '14 at 14:16

2 Answers
2

Can I fix this at a system level, or is this something going wrong with each individual process?

It is difficult to say because its unknown what is happening at the protocol level. Basically the read(2) will block indefinitely providing:-

The TCP connection remains open.

You expect at least 1 byte of data to arrive.

The sender is not ready to send you data.

Now, it could be that something is wrong with the process, such as the other end is expecting a response from you first before sending more data, or a previous response from the other end anticipates SVN to do something else before requesting more data. Suppose for example an error response came back which should force the client to resend some information.

You cannot fix this gracefully because its impossible from the information you have to determine what the sender of this data is expecting you to do. However there are a few possible ways to avoid the problem and report it.

Rather than using wait in a simple blocking mode, run wait and configure an alarm in the parent process. Now, when the process fails to complete within a fixed period of time you can kill it and report this happened. A cheap way to do this is to alter the subprocess.Popen to call the timeout command.

Modify the read so that it sets a read timeout socket option. You can do this by altering the code, or - using an interposer to override the default socket system call to also add a timeout to the receiver. Both of which are not trivial. This may cause svn to behave in unexpected ways.

Is there something fundamentally different about how OS X and Linux handle read calls that I need to know to avoid infinitely-hanging processes?

I do not know the answer to this, however if both are behaving posixly correct, they should both behave the same way. If you attempt to read from a socket that is not yet prepared to send you data blocking the stream indefinitely is the anticipated behaviour.

Overall, I think your best choice of attack is to expect your svn command to have completed within a specific time period. If it does not kill it and report you did so.

Due to the nature of the environment, I think it has to be assumed that the Python scripts are modifiable, but hacking SVN is not an option. The SVN command is less pressing an issue because it is being run in the foreground, as part of a process that I am watching in a terminal. The Python jobs are more problematic because they are run by cron and stack up (and consume memory as they do so). The behavior does seem to be different between OS X and Linux, but I hesitate to accuse one or the other of not following the POSIX standard (since that seems less likely than a problem somewhere else).
– Michael SchullerJul 9 '14 at 11:18

For pure python scripts you can use the socket.settimeout option to ensure a timeout occurs.
– Matthew IfeJul 9 '14 at 11:30

Interestingly enough, I've just discovered that one variant of the script that's been hung since yesterday does set a timeout. I've updated my question accordingly. Also, does SVN really not set it's own network read timeouts? Although I appreciate the help, I'd really like to see if I can solve this at a network/system level, rather than for each process that might someday need to run on these boxes.
– Michael SchullerJul 9 '14 at 11:56

I think I have figured out the issue(s) described above, and most of the mystery stems from my misunderstanding of what was happening on the servers.

There were the following fundamental problems:

Python scripts which should have had a timeout set (and which I assumed did) did not. Some of those hung indefinitely when connecting to S3, exhibiting the behavior of waiting indefinitely for a read to complete. Combing through the code and making sure that global socket timeouts were set, and not getting unset, seems to have solved that part.

Some of the old Python processes appeared to be hung, but on closer inspection (once the truly blocked processes were weeded out), they were simply listing large S3 buckets to check the status of keys in those buckets, and this operation was taking hours or days to complete.

The SVN checkout command was (still is) hanging for long periods of time when updating very large projects with many files in very deep directory structures. The client is waiting for a read to complete, but this is entirely legitimate (it appears to take the repository server a long time to collect the data it needs to send back to the client).

I'm leaving this answer here to explain what was going on, but I'm going to accept Matthew's, because he was right about what the actual possible issues were.