Beta Labs isn't synchronising; AFAICS it hasn't done so since ~ 11 hours
ago (15:10 UTC on 2014-09-08). I noticed this when prepping a patch for
tomorrow and found that.

Going to https://integration.wikimedia.org/ci/view/Beta/ I found that
"beta-update-databases-eqiad" had been executing for 12 hours, and
initially assumed that we had a run-away update.php issue again. However,
on examining it looks like "deployment-bastion.eqiad", or the jenkins
executor on it, isn't responding in some way:

pending—Waiting for next available executor on deployment-bastion.eqiad

I terminated the beta-update-databases-eqiad run to see if that would help,
but it just switched over to beta-scap-eqiad being the pending task.

Having chatted with MaxSem, I briefly disabled in the jenkins interface the
deployment-bastion.eqiad node and then re-enabled it, to no effect.

This happened again today. It seems like it always involves the database update job. Maybe related to https://issues.jenkins-ci.org/browse/JENKINS-10944. Apparently the parent build for a matrix job is not supposed to occupy an executor, but sometimes it does. Seems to be an open bug still upstream.

I have manually changed the config for the beta-update-databases-eqiad job in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep Jenkins from confusing itself. If this "works" then the changes should be backported into the JJB configuration to keep it from coming back the next time the job is updated from config.

I have manually changed the config for the beta-update-databases-eqiad job
in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep
Jenkins from confusing itself. If this "works" then the changes should be
backported into the JJB configuration to keep it from coming back the next
time the job is updated from config.

Maximum Total Concurrent Builds: 2

Maximum Concurrent Builds Per Node: 2

Throttle Matrix master builds: true

Throttle Matrix configuration builds: true

(reopening to keep this on my radar for a few days)

The throttle settings are still in place, so they didn't fix the problem. The fact that we didn't see this error for a couple of weeks is apparently uncorrelated.

That happened again on Oct 23 2014. Looking at Jenkins thread dumps, the deployment-bastion execute thread are locked by the Gearman plugin. It must have some logic error somewhere but I can't really debug Java ;(

You can assume that this upstream bug will not be fixed within our life time, and we should not invest in learning all Jenkins internals to fix it ourselves (it's a significant problem in how Jenkins works).

I have upgraded the gearman plugin earlier from 1.1.1 to f2024bd. From git log --reverse --no-merges 0.1.1..master the three interesting changes are:

commit 6de3cdd29bb8c4336a468985af6f8e0e4fd88e66
Author: James E. Blair <jeblair@hp.com>
Date: Thu Jan 8 08:37:50 2015 -0800
Protect against partially initialized executer workers
The registerJobs method of an executorworker can be invoked by an
external caller before the worker is completely initialized by
its run method. We protected against that by checking one instance
variable, but there's still a race condition involving another.
Add a check for that variable as well.
Change-Id: I8e2cfffb54aa8a4cf8b1e61e9a9184b091054462

That one might solve the issue we are encountering with executor not being available anymore.

commit 7abfdbd2d00010a1121cefebf479bcf104e7ef18
Author: James E. Blair <jeblair@hp.com>
Date: Tue May 5 10:38:25 2015 -0700
Stop sending status updates
Don't send status updates every 10 seconds. Only send them at the
start of a job (to fill in information like worker and expected
duration, etc). We don't actually do anything with subsequent
updates, and if Zuul wants to know how long a job has been running
it's perfectly capable of working that out on its own.
Change-Id: I4df5f82b3375239df35e3bc4b03e1263026f0a68

commit 65a08e0e959b0853538eabeec030d594a01c4385
Author: Clark Boylan <clark.boylan@gmail.com>
Date: Mon May 4 18:09:31 2015 -0700
Fix race between adding job and registering
Gearman plugin had a race between adding jobs to the functionList and
registering jobs. When registering jobs the functionMap is cleared, when
adding a job the plugin checks if the job is in the function Map before
running it. If we happen to trigger registration of jobs when we get a
response from gearman with a job assignment then the functionMap can be
empty making us send a work fail instead of running the job.
To make things worse this jenkins worker would not send a subsequent
GET JOB and would live lock never doing any useful work.
Correct this by making the processing for gearman events synchronous in
the work loop. This ensures that we never try to clear the function map
and check against it at the same time via different threads. To make
this happen the handleSessionEvent() method puts all events on a thread
safe queue for synchronous processing. This has allowed us to simplify
the work() loop and basically do the following:
while running:
init()
register()
process one event
run function if processed
drive IO
This is much easier to reason about as we essentially only have
bookkeeping and the code for one thing at a time.
Change-Id: Id537710f6c8276a528ad78afd72c5a7c8e8a16ac

The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the jobs managed to run.

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).