When I examine the PR, I see the failed status was correctly posted to the PR. I have bors connected to a Github Enterprise instance, and I verified in the GHE logs that the response returned with in a few ms.

For this request, I’m able to see the entries in the GHE logs show times responded without issue.

PRs that were queued up at the time are not picked rescheduled, and they need to be bors r- before they can be bors r+ again…

Around this time these crashes happen I notice webhooks (from GHE to bors) were taking > 10s to return. From what I can tell this beghins to happen when GHE is under high load (usually sending a lot of webhooks). The load on the instance running bors is never fully utilized (at least according to monitoring metrics).

Now we believe the timeout is not a GenServer.call() timeout, instead, it is a timeout from httpoison/hackney. After increased the default GenServer.call() timeout to 10s, we now see the nested timeout exception from httpoison

FWIW, we have seen some intermittent DNS issue where some hosts can’t resolve our github enterprise hostname, could that shown as connection timeout for hackney?

From what I’ve tested, that would’ve produced an :nxdomain error. Similarly, if it was a connection timeout, it would have produced a :connect_timeout error, if it had run out of its connection pool, it would also have produced a :connect_timeout error, and if it had already started downloading data, it would have been a :recv_timeout.

The :timeout error is supposed to appear when HTTPoison/Hackney makes a successful connection to the server, sends its request, and doesn’t get back a response within the timeout window.

btw, we also see long response time from GHE webhooks. Do you think it might be the http client or maybe the erlang VM? The box we provisioned is x5.xlarge, we don’t see the cpu, network, or anything is stressed. Is there something we can tune?

The long webhook response times come from the webhook making HTTP requests.

I am talking about the webhook of bors called from GHE. Are you suggesting GHE is slow? Or you are confirming Hackney’s pool manager is the culprit in that case? FWIW, those webhook calls (to bors) are pretty fast in general, but slow some time, and anecdotally around the same time we see the timeout exception on bors side.