I have had a twisted epoll server that was heavily used, such that it
saturated CPU (100% shown by "top", about 5000 connections, intense message
relaying).
I am using twisted 2.5.0 that I patched for epoll bug.
It was run on python 2.4.4 , 2.6.11 kernel on a single core xeon 3.0 GHz
CPU. This server has been on for many months, and it has been rock-stable.
A couple of days ago I migrated that server to a newer machine: same patched
twisted 2.5.0, same python 2.4.4, newer 2.6.24 kernel and a quad core xeon
L5420 CPU.
CPU usage dropped from 100% to 30%, as expected, with the same rate of
client connections.
However the server now has the following intermittent problem: about twice a
day, it stops accepting new connections for a short period of 5-10 minutes.
telnet times out, I get this:
root at serv2:/proc/net/netfilter# telnet localhost 5229
Trying 127.0.0.1...
Existing connections are not cut, they server receives/delivers messages
to/from them just fine.
These short periods of not accepting connections do not correlate with
increased CPU load or with the overall number of connections to the server.
I have had a problem with the same symptoms before, when a server process
run out of its quota of file descriptors.
However, there were clear messages in the twisted log at that time, and
upping the ulimits solved the problem.
This time, there are no errors in ANY logs (twisted log. /var/log/messages,
etc)
I am out of ideas on what this could be, because my setup is exactly the
same as I have been using in the last year, except for a faster CPU and a
newer kernel?
I suspect that there are some new uncaught accept() exceptions in
internet/tcp.py in the part where it's looking for EMFILE, ENOBUFS, ENFILE,
ENOMEM, ECONNABORTED errors.