Re: venus crash

On Thu, Mar 15, 2007 at 05:14:08PM +0200, Florin Muresan wrote:
> And the logs ....
...
> This is what what was recorded in the SrvLog file, on the SCM at that time:
>
> 15:53:21 VGetVnode: vnode 1000002.6e06 is not allocated
> 15:53:21 VGetVnode: vnode 1000002.7078 is not allocated
> 15:53:21 VGetVnode: vnode 1000002.af3a is not allocated
These are probably a result of the other clients trying to revalidate
their cached copies of the removed files.
> 15:56:43 client_GetVenusId: got new host 192.168.110.4:32768
> 15:56:43 Building callback conn.
> 15:58:31 Callback failed RPC2_DEAD (F) for ws 192.168.110.4:32781
This is somewhat interesting, we get a new connection from a client and
2 minutes later time out a connection to the same IP address, but at a
different port. Possibly because the client was restarted.
> 16:00:08 Callback failed RPC2_DEAD (F) for ws 192.168.110.3:34405
Different ip address, so now we couldn't send a callback to one of the
other 2 clients. (i.e. the ones that are still stuck)
> 16:09:37 Callback failed RPC2_DEAD (F) for ws 192.168.110.4:32768
> 16:09:37 Deleting client entry for user System:AnyUser at 192.168.110.4.32768 rpcid 880750331
And almost 9 minutes later we fail to send a callback to the client we
saw connect at 15:56:43. And we only tear down one RPC2 connection,
almost as if this client never actually managed to complete it's
connection setup.
> 16:20:16 Shutdown received
Nothing in the logs between 16:09 and 16:20, and you are restarting the
server. This looks like some lock was taken by the server and none of
the client were unable to make any progress.
> The venus.err file contains this:
>
> [ W(2566) : 0000 : 15:54:33 ] WAITING(2.7f000001.bb.819e): level = RD, readers = 0, writers = 1
>
> [ X(00) : 0000 : 15:55:49 ] TERM: About to terminate venus
Ok, so there was a write lock on the root volume in your realm. Normally
such locks are only taken for the duration of the RPC2 operation, which
if the server is unresponsive should time out in at most 60 seconds or
so. Because we're clearly waiting longer than that the server must have
been returning BUSY responses which happens when an operation takes a
long time (i.e. it may be waiting on the completion of resolution or
there is another client that still has a lock on the volume).
> [ X(00) : 0000 : 15:56:40 ] Coda Venus, version 6.0.16
> [ X(00) : 0000 : 15:56:40 ] Logfile initialized with LogLevel = 0 at Wed Mar 7 15:56:40 2007
...
> [ X(00) : 0000 : 15:56:42 ] fsobj::Recover: invalid fso (wrtcGfkZw, 2.7f000001.7f6.d1e3), attempting to GC...
> 0x508f5f48 : fid = (2.7f000001.7f6.d1e3), comp = wrtcGfkZw, vol = 50867548
> state = Normal, stat = { 0, 1, 1173275016, -2, 0600, 1, File }, rc rights = 0
...
> [ X(00) : 0000 : 15:56:42 ] fsobj::Recover: invalid fso (wrtIJzt5Z, 2.7f000001.7f8.d1e4), attempting to GC...
> 0x50906b08 : fid = (2.7f000001.7f8.d1e4), comp = wrtIJzt5Z, vol = 50867548
> state = Normal, stat = { 0, 0, 1173275016, -2, 0600, 1, File }, rc rights = 0
...
> [ X(00) : 0000 : 15:56:42 ] fsobj::Recover: invalid fso (wrtL9Tbk3, 2.7f000001.7fa.d1e5), attempting to GC...
> 0x509ac108 : fid = (2.7f000001.7fa.d1e5), comp = wrtL9Tbk3, vol = 50867548
> state = Runt, stat = { 0, 0, -1, -1, 0177777, 255, ??? }, rc rights = 0
Client restarted and it discovers some inconsistencies for files in the
cache. I don't immediately see why the first two are considered invalid.
The last one is a file that we were trying to get the attributes for
(the object was created, but it is still a 'Runt' and has no valid
status). I'm pretty sure that the thread that was waiting for the
writelock to expire wanted to call GetAttr.
Some things I notice here is that all your files are in the volume
identified by '7f000001'. I think this is the first volume created when
a server is set up, so I assume you never created any additional
volumes. Almost all locks are taken on a per-volume basis. So adding
extra volumes and distributing the data across them will reduce lock
contention. i.e. a writer will only block readers to the subset of
objects located in that same volume.
Also you described your setup as 3 clients and a single server and we
will never actually have any resolution in such a setup. There is only a
single replica and clients will never see differences between replicas.
This makes the whole server lockup more suspicious, since I tend to
associate locking issues with resolution. Maybe those memory vnode
snatched away messages indicate that the server followed some error path
and forgot to unlock some read or write lock on the volume. I'll have to
look into that.
Finally, with 200 apache worker threads sending requests down, the Coda
client probably has some difficulty keeping up. It is possible to
increase the number of worker threads by starting the client with
venus -maxworkers 200
I just started my own client with 200 worker threads and it does't seem
to have a problem with that. Of course my desktop isn't really pushing
the client all that much, even when I try to by opening several email
clients it looks like I'm still only using about 2 of venus's worker
threads.
If you are tracking your webserver load with something like rrdtool,
it is probably possible to tune maxworkers to the expected number of
busy apache server processes. Our webserver doesn't have much trouble
with the default settings, average number of busy apache servers is
typically less than 1.8, the maximum over the past month was around 19.
So even though I have between 50 and 100 apache workers, only a fraction
is active at any given time.
I'm kind of guessing that most of the active apache processes aren't
actually waiting for Coda, but are blocked sending the response back to
slower clients, but apache doesn't record the state between reading
request and sending reply, which is where it would if it is blocked
while trying to open files from /coda.
Jan