… interesting tidbits of release engineering.

2014-06 try server update

Chatting with Aki the other day, I realized that word of all the
wonderful improvements to the try server issue have not been publicized.
A lot of folks have done a lot of work to make things better - here’s a
brief summary of the good news.

Before:

Try server pushes could appear to take up to 4 hours, during which
time others would be locked out.

Now:

The major time taker has been found and eliminated: ancestor
processing. And we understand the remaining occasional slow downs
are related to caching . Fortunately, there are some steps that
developers can take now to minimize delays.

What folks can do to help

The biggest remaining slowdown is caused by rebuilding the
cache. The cache is only invalidated if the push is
interrupted. If you can avoid causing a disconnect until your push is
complete, that helps everyone! So, please, no Ctrl-C during the
push! The other changes should address the long wait times you used to
see.

What has been done to infrastructure

There has long been a belief that many of our hg problems, especially on
try, came from the fact that we had r/w NFS mounts of the repositories
across multiple machines (both hgssh servers & hgweb servers). For
various historical reasons, a large part of this was due to the way
pushlog was implemented.

What has been done to our hooks

All along, folks have been discussing our try server performance issues
with the hg developers. A key confusing issue was that we saw processes
“hang” for VERY long times (45 min or more) without making a system
call. Kendall managed to observe an hg process in such an
infinite-looking-loop-that-eventually-terminated a few times. A stack
trace would show it was looking up an hg ancestor without makes system
calls or library accesses. In discussions, this confused the hg team
as they did not know of any reason that ancestor code should be being
invoked during a push.

Thanks to lots of debugging help from glandium one evening, we found and
disabled a local hook that invoked the ancestor function on every
commit to try. \o/ team work!

Caching – the remaining problem

With the ancestor-invoking-hook disabled, we still saw some longish
periods of time where we couldn’t explain why pushes to try appeared
hung. Granted it was a much shorter time, and always self corrected,
but it was still puzzling.

A number of our old theories, such as “too many heads” were discounted
by hg developers as both (a) we didn’t have that many heads, and (b)
lots of heads shouldn’t be a significant issue – hg wants to support
even more heads than we have on try.

Greg did a wonderful bit of sleuthing to find the impact of ^C during
push. Our current belief is once the caching is fixed upstream, we’ll
be in a pretty good spot. (Especially with the inclusion of some
performance optimizations also possible with the new cache-fixed
version.)

What is coming next

To take advantage of all the good stuff upstream Hg versions have,
including the bug fixes we want, we’re going to be moving towards
removing roadblocks to staying closer to the tip. Historically, we had
some issues due to http header sizes and load balancers; ancient python
or hg client versions; and similar. The client issues have been
addressed, and a proper testing/staging environment is on the horizon.

There are a few competing priorities, so I’m not going to predict a
completion date. But I’m positive the future is coming. I hope you have
a glimpse into that as well.