Buildkite Status - Incident Historyhttps://www.buildkitestatus.com
StatuspageSun, 15 Sep 2019 18:02:55 +1000Builds timing out, elevated Agent API error rate
<p><small>Sep <var data-var='date'>13</var>, <var data-var='time'>10:49</var> AEST</small><br><strong>Resolved</strong> - Agent API latency and error rates have returned to normal levels.</p><p><small>Sep <var data-var='date'>13</var>, <var data-var='time'>09:37</var> AEST</small><br><strong>Monitoring</strong> - We’ve rolled out additional capacity and are seeing metrics returning to more normal levels. We will continue monitoring as thing continue to normalise.</p><p><small>Sep <var data-var='date'>13</var>, <var data-var='time'>09:05</var> AEST</small><br><strong>Identified</strong> - One of our build-log buffer DB instances experienced unusually high load and failed. We are rolling out additional capacity.</p><p><small>Sep <var data-var='date'>13</var>, <var data-var='time'>08:48</var> AEST</small><br><strong>Investigating</strong> - We’ve been alerted to elevated error rates and slow response from the agent API and are investigating.</p> Fri, 13 Sep 2019 10:49:39 +1000https://www.buildkitestatus.com/incidents/9tn80z1ng2ql
https://www.buildkitestatus.com/incidents/9tn80z1ng2qlAgent and web connectivity issue
<p><small>Sep <var data-var='date'> 1</var>, <var data-var='time'>01:15</var> AEST</small><br><strong>Resolved</strong> - Connectivity issues with the Agent API have been resolved and traffic has returned to a normal level.</p><p><small>Sep <var data-var='date'> 1</var>, <var data-var='time'>01:02</var> AEST</small><br><strong>Monitoring</strong> - We've restored connectivity to the Agent API and things appear to be functioning again.</p><p><small>Sep <var data-var='date'> 1</var>, <var data-var='time'>00:01</var> AEST</small><br><strong>Identified</strong> - We've identified a connectivity issue with our redis infrastructure caused by an AWS outage, we're working on a fix</p><p><small>Aug <var data-var='date'>31</var>, <var data-var='time'>23:46</var> AEST</small><br><strong>Investigating</strong> - We're investigating agent and web degraded network connectivity possibly related to an ongoing AWS EC2 incident.</p> Sun, 01 Sep 2019 01:15:21 +1000https://www.buildkitestatus.com/incidents/qb6t7jly69xj
https://www.buildkitestatus.com/incidents/qb6t7jly69xjErrors setting GitHub commit statuses
<p><small>Aug <var data-var='date'>22</var>, <var data-var='time'>22:16</var> AEST</small><br><strong>Resolved</strong> - Everything seems to be back to normal.</p><p><small>Aug <var data-var='date'>22</var>, <var data-var='time'>21:16</var> AEST</small><br><strong>Monitoring</strong> - Errors have subsided, we'll keep an eye on things.</p><p><small>Aug <var data-var='date'>22</var>, <var data-var='time'>21:06</var> AEST</small><br><strong>Identified</strong> - We're seeing errors when setting GitHub commit statuses for builds and jobs. It looks like GitHub are aware of the issue (https://www.githubstatus.com/incidents/nt6n1hs8bk3v) and are investigating.</p> Thu, 22 Aug 2019 22:16:49 +1000https://www.buildkitestatus.com/incidents/t3z5t2s8ccj9
https://www.buildkitestatus.com/incidents/t3z5t2s8ccj9Live updates to build & pipeline pages not working
<p><small>Aug <var data-var='date'>16</var>, <var data-var='time'>10:01</var> AEST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Aug <var data-var='date'>16</var>, <var data-var='time'>09:48</var> AEST</small><br><strong>Update</strong> - We're trying to make contact with our vendor. We believe the limits will reset in 15 minutes.</p><p><small>Aug <var data-var='date'>16</var>, <var data-var='time'>09:39</var> AEST</small><br><strong>Identified</strong> - We've exceeded the daily limit on the amount of websocket messages our vendor allows to send in a day. We've reached out to them for a temporary upgrade. Builds, jobs and agents are still operating as normal.</p> Fri, 16 Aug 2019 10:01:16 +1000https://www.buildkitestatus.com/incidents/n3rx8byr9vm8
https://www.buildkitestatus.com/incidents/n3rx8byr9vm8Some agents failing to register
<p><small>Jul <var data-var='date'>23</var>, <var data-var='time'>04:35</var> AEST</small><br><strong>Resolved</strong> - The fix is out and we are seeing no further issues. Apologies for the interruption.</p><p><small>Jul <var data-var='date'>23</var>, <var data-var='time'>04:31</var> AEST</small><br><strong>Identified</strong> - Some new agents are failing to register. A fix is currently being deployed.</p> Tue, 23 Jul 2019 04:35:17 +1000https://www.buildkitestatus.com/incidents/slsy1r7p2t38
https://www.buildkitestatus.com/incidents/slsy1r7p2t38Elevated Error Rate on API Traffic
<p><small>Jul <var data-var='date'>19</var>, <var data-var='time'>13:07</var> AEST</small><br><strong>Resolved</strong> - With a temporary workaround in place we expect this incident has been resolved. We have identified some follow-up measures that we will schedule to be implemented during an upcoming off-peak period.</p><p><small>Jul <var data-var='date'>19</var>, <var data-var='time'>13:02</var> AEST</small><br><strong>Monitoring</strong> - We’ve identified the cause of the elevated error rates and have implemented a workaround, and are seeing response status rates return to expected levels.</p><p><small>Jul <var data-var='date'>19</var>, <var data-var='time'>12:22</var> AEST</small><br><strong>Investigating</strong> - We’re experiencing elevated error response rates from the REST and agent APIs. We’re working to identify the issue.</p> Fri, 19 Jul 2019 13:07:07 +1000https://www.buildkitestatus.com/incidents/3kzczv43gkb1
https://www.buildkitestatus.com/incidents/3kzczv43gkb1Elevated Error Rate on API and Agent Traffic
<p><small>Jul <var data-var='date'>18</var>, <var data-var='time'>18:38</var> AEST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Jul <var data-var='date'>18</var>, <var data-var='time'>17:52</var> AEST</small><br><strong>Monitoring</strong> - We’ve rolled out a fix and are continuing monitor impacted systems.</p><p><small>Jul <var data-var='date'>18</var>, <var data-var='time'>17:40</var> AEST</small><br><strong>Identified</strong> - We’re experiencing an elevated error rate from API endpoints. We have identified the issue and are working to deploy a fix, it is expected this issue will be resolved shortly.</p> Thu, 18 Jul 2019 18:38:11 +1000https://www.buildkitestatus.com/incidents/k1m3gxc5m7q2
https://www.buildkitestatus.com/incidents/k1m3gxc5m7q2High error rate with dashboard and REST API
<p><small>Jun <var data-var='date'>30</var>, <var data-var='time'>20:30</var> AEST</small><br><strong>Resolved</strong> - No further issues have been observed.</p><p><small>Jun <var data-var='date'>30</var>, <var data-var='time'>18:37</var> AEST</small><br><strong>Monitoring</strong> - We've resolved the issue and systems are now operational.</p><p><small>Jun <var data-var='date'>30</var>, <var data-var='time'>17:19</var> AEST</small><br><strong>Identified</strong> - We've identified an issue with the puma web servers on our fleet of REST API instances and are working on a fix.</p><p><small>Jun <var data-var='date'>30</var>, <var data-var='time'>17:13</var> AEST</small><br><strong>Investigating</strong> - We're investigating a high error rate on the dashboard and REST API.</p> Sun, 30 Jun 2019 20:30:12 +1000https://www.buildkitestatus.com/incidents/tlxr3vkgrkc7
https://www.buildkitestatus.com/incidents/tlxr3vkgrkc7Web dashboard asset network connection failures
<p><small>Jun <var data-var='date'>24</var>, <var data-var='time'>23:11</var> AEST</small><br><strong>Resolved</strong> - We're seeing traffic returning to normal levels and connection errors cease in all regions.</p><p><small>Jun <var data-var='date'>24</var>, <var data-var='time'>23:04</var> AEST</small><br><strong>Monitoring</strong> - The upstream network issue has been resolved, we are monitoring.</p><p><small>Jun <var data-var='date'>24</var>, <var data-var='time'>22:46</var> AEST</small><br><strong>Update</strong> - We're continuing to work on mitigating the upstream networking issue. It's also affecting AWS and Google services in various regions. The upstream cause appears to be a BGP route leak.</p><p><small>Jun <var data-var='date'>24</var>, <var data-var='time'>21:48</var> AEST</small><br><strong>Identified</strong> - We've identified an issue with our web asset CDN provider (Cloudflare) where some networks are receiving connection failures, preventing the web dashboard from loading. Cloudflare are investigating the issue, and we're investigating fixes.</p> Mon, 24 Jun 2019 23:11:16 +1000https://www.buildkitestatus.com/incidents/3858lh2m8971
https://www.buildkitestatus.com/incidents/3858lh2m8971Reports of elevated error rates from API and Dashboard
<p><small>Jun <var data-var='date'>17</var>, <var data-var='time'>20:20</var> AEST</small><br><strong>Resolved</strong> - We've identified and fixed an issue caused by a broken deploy that resulting in new instances not registering as healthy in our auto-scaling groups.</p><p><small>Jun <var data-var='date'>17</var>, <var data-var='time'>19:52</var> AEST</small><br><strong>Investigating</strong> - We've had reports of elevated API and Dashboard errors which we are investigating.</p> Mon, 17 Jun 2019 20:20:22 +1000https://www.buildkitestatus.com/incidents/vmzf7llfbcl5
https://www.buildkitestatus.com/incidents/vmzf7llfbcl5Credential stuffing attack
<p><small>May <var data-var='date'>29</var>, <var data-var='time'>22:04</var> AEST</small><br><strong>Resolved</strong> - We’ve seen no further suspicious activity, and have rolled out additional measures to prevent similar attacks in the future.<br /><br />In light of the attack, we’re working on improvements to our documentation and platform to help you keep your Buildkite account secure. We’ll be posting these to the <a href="https://buildkite.com/changelog">Buildkite Changelog</a> as they’re released.<br /><br />If you have any questions please contact support@buildkite.com</p><p><small>May <var data-var='date'>27</var>, <var data-var='time'>21:16</var> AEST</small><br><strong>Update</strong> - We're continuing to monitor for further attacks, and have rolled out additional improvements to our authentication and monitoring systems.</p><p><small>May <var data-var='date'>26</var>, <var data-var='time'>17:44</var> AEST</small><br><strong>Update</strong> - We are continuing to monitor for any further issues.</p><p><small>May <var data-var='date'>26</var>, <var data-var='time'>17:40</var> AEST</small><br><strong>Monitoring</strong> - Last night we were alerted to an incident that occurred over May 18th-22nd (UTC), where an attacker managed to access a small number of Buildkite user accounts using email/password lists from publicly available data breach dumps. This attack is known as a "<a href="https://www.owasp.org/index.php/Credential_stuffing">credential stuffing</a>" attack and relies on the fact that users will often use the same email and password across services and forget to change it. <br /><br />We've reached out to admins of the few affected organizations, and are assisting them to determine the impact. If we haven't emailed you, your account hasn't been affected.<br /><br />In response to this attack, we're rolling out changes to our authentication and login systems to prevent this type of attack being possible, and will continue to monitor for any further suspicious activity. We'll update this incident as we go.<br /><br />As always, we're available for questions and assistance at support@buildkite.com.</p> Wed, 29 May 2019 22:04:30 +1000https://www.buildkitestatus.com/incidents/z4dn9qzvzt93
https://www.buildkitestatus.com/incidents/z4dn9qzvzt93GitHub Enterprise build events not being processed
<p><small>May <var data-var='date'>27</var>, <var data-var='time'>21:52</var> AEST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>May <var data-var='date'>27</var>, <var data-var='time'>21:49</var> AEST</small><br><strong>Monitoring</strong> - A fix has been implemented, and GitHub Enterprise build events are now being processed as normal.</p><p><small>May <var data-var='date'>27</var>, <var data-var='time'>21:28</var> AEST</small><br><strong>Identified</strong> - Build events from GitHub Enterprise are currently not being processed. We've identified the problem and are deploying a fix.</p> Mon, 27 May 2019 21:52:07 +1000https://www.buildkitestatus.com/incidents/dk5qh5w6qzyd
https://www.buildkitestatus.com/incidents/dk5qh5w6qzydDelayed job dispatches
<p><small>May <var data-var='date'>18</var>, <var data-var='time'>06:10</var> AEST</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>May <var data-var='date'>18</var>, <var data-var='time'>05:33</var> AEST</small><br><strong>Monitoring</strong> - An unusual request pattern created a backlog of high-intensity database operations; the queue has now returned to normal and dispatches are following</p><p><small>May <var data-var='date'>18</var>, <var data-var='time'>04:01</var> AEST</small><br><strong>Investigating</strong> - Monitoring has detected delays in job dispatching</p> Sat, 18 May 2019 06:10:59 +1000https://www.buildkitestatus.com/incidents/39bq004kmx68
https://www.buildkitestatus.com/incidents/39bq004kmx68Okta logins causing loss of administrator access
<p><small>May <var data-var='date'> 3</var>, <var data-var='time'>15:00</var> AEST</small><br><strong>Resolved</strong> - We received multiple reports of users logging in via Okta and losing their administrator access.<br /><br />An update to the Buildkite Okta config had been deployed several days before, which caused an empty admin attribute to be sent to us. <br /><br />This was resolved by deploying a fix that ignored the empty admin attribute such that admin access isn't removed on login. We'll be reaching out to affected teams to assist with anyone who has lost admin access for their organization.</p> Fri, 03 May 2019 15:00:35 +1000https://www.buildkitestatus.com/incidents/mwr107cq1p8x
https://www.buildkitestatus.com/incidents/mwr107cq1p8xEmail delivery delayed due to provider outage
<p><small>May <var data-var='date'> 3</var>, <var data-var='time'>08:11</var> AEST</small><br><strong>Resolved</strong> - Upstream mail provider appears to have resolved the errors we were seeing, mail is delivering as usual.</p><p><small>May <var data-var='date'> 3</var>, <var data-var='time'>07:43</var> AEST</small><br><strong>Monitoring</strong> - We are seeing elevated error rates when delivering notifications and transactional email through our upstream provider Mailchimp/Mandrill. Emails are being queued and will be delivered when the problem resolves.</p> Fri, 03 May 2019 08:11:01 +1000https://www.buildkitestatus.com/incidents/v5yy8lzq66rx
https://www.buildkitestatus.com/incidents/v5yy8lzq66rxContinued Database Upgrade
<p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>14:20</var> AEST</small><br><strong>Resolved</strong> - Done! We're sorry it took so long. We had run multiple test upgrades throughout the week without a hitch, this one presented some new challenges that took a little extra time to work through.<br /><br />Buildkite is back and everything should be running smoothly again! If you notice jobs stuck in weird states, a "cancel" + "retry" should get them going again. If that doesn't work, reach out to us hello@buildkite.com and we'll take a look.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>13:51</var> AEST</small><br><strong>Update</strong> - We're finishing up now, and our graphs are starting to look again.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>13:25</var> AEST</small><br><strong>Update</strong> - We've finished our maintenance tasks and services are coming online. You may see a few errors while our servers start warming up.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>12:45</var> AEST</small><br><strong>Update</strong> - We've just marked our service as having "major outage" to indicate that nothing is working. We're still running our final tasks.<br /><br />We'll post an update as soon as we're back!</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>12:05</var> AEST</small><br><strong>Identified</strong> - We're continuing our database upgrade. StatusPage automatically finished our previous scheduled maintenance incident before we had finished our maintenance tasks, so we'll post updates in this incident moving forward.</p> Sun, 14 Apr 2019 14:20:11 +1000https://www.buildkitestatus.com/incidents/sd78sgyl64t9
https://www.buildkitestatus.com/incidents/sd78sgyl64t9Database Upgrade
<p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>12:00</var> AEST</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>11:23</var> AEST</small><br><strong>Update</strong> - We're still working through all the tables ensuring all our maintenance tasks have been run on them. One of our larger tables is giving us a few troubles (that didn't come up during out testing of this earlier in the week). Still working through that one.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>10:53</var> AEST</small><br><strong>Update</strong> - While verifying we've found a problem in one of our tables, we're going back into maintenance mode while verifying.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>10:39</var> AEST</small><br><strong>Verifying</strong> - The database has upgraded successfully. We're verifying the upgrade and making sure everything's returning to normal.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>10:22</var> AEST</small><br><strong>Update</strong> - There is an error in our maintenance page causing an error message while the database is down for upgrade. We're looking into it. But the upgrade is proceeding as planned.</p><p><small>Apr <var data-var='date'>14</var>, <var data-var='time'>10:13</var> AEST</small><br><strong>In progress</strong> - We're about to start the upgrade. We expect a few minutes of downtime when the database performs the upgrade process. We'll keep you posted.</p><p><small>Apr <var data-var='date'> 5</var>, <var data-var='time'>10:27</var> AEDT</small><br><strong>Scheduled</strong> - We'll be upgrading our database next weekend on Sunday 14th April 2019 between 10am-11am UTC+10 (AEST). We expect a few minutes of downtime for the database upgrade process during this window. For these few minutes the Dashboard and APIs will be unavailable, new build jobs will not be scheduled on agents, and running build jobs should run to completion but results will be delayed until the agent's retry behaviour successfully contacts the API.</p> Sun, 14 Apr 2019 12:00:54 +1000https://www.buildkitestatus.com/incidents/wgc06sz54v5l
https://www.buildkitestatus.com/incidents/wgc06sz54v5lSome job logs unavailable
<p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>21:49</var> AEST</small><br><strong>Resolved</strong> - The issue has been resolved and the underlying instance has been completely replaced. There was an underlying hardware issue which was not detected by usual hardware failure mechanisms and should no longer be affecting the instances.<br /><br />The issue only affected some instances in our cluster so agents which failed log submission should have retried, and dashboard log display may have been delayed or malfunctioned, but no data has been lost, and operations are now back to normal.</p><p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>21:37</var> AEST</small><br><strong>Monitoring</strong> - We've failed over and the new primary is doing nicely. We're recycling the old one so we're not affected by the underlying hardware fault in the event of a failover back.</p><p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>21:24</var> AEST</small><br><strong>Identified</strong> - We've been speaking to AWS who have identified an underlying hardware issue and are performing a failover.</p><p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>20:31</var> AEST</small><br><strong>Monitoring</strong> - The issue seems to have been resolved, some network connections were dropped between out connection pooler and an upstream rds instance, appearing to be a transient networking issue. We're keeping an eye on things.</p><p><small>Apr <var data-var='date'> 8</var>, <var data-var='time'>20:20</var> AEST</small><br><strong>Investigating</strong> - One of our job log databases is having some trouble, we're looking into it.</p> Mon, 08 Apr 2019 21:49:33 +1000https://www.buildkitestatus.com/incidents/wdzlnzqqxnq3
https://www.buildkitestatus.com/incidents/wdzlnzqqxnq3Expanded job logs not always automatically refreshing
<p><small>Mar <var data-var='date'>27</var>, <var data-var='time'>07:39</var> AEDT</small><br><strong>Resolved</strong> - Job logs are now updating as normal. Any stuck logs should start updating again if you reload the page. Apologies for the interruption!</p><p><small>Mar <var data-var='date'>27</var>, <var data-var='time'>07:28</var> AEDT</small><br><strong>Identified</strong> - Job logs are currently not always refreshing automatically. They will refresh based on changes to the job or build’s state, but followed log output may be behind the present until you refresh otherwise.</p> Wed, 27 Mar 2019 07:39:30 +1100https://www.buildkitestatus.com/incidents/zny6rpnmh0ly
https://www.buildkitestatus.com/incidents/zny6rpnmh0lyReported slow response times and increased error rates
<p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>09:00</var> AEDT</small><br><strong>Resolved</strong> - The database performance incident is resolved, we'll follow up with a post-mortem.</p><p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>08:12</var> AEDT</small><br><strong>Monitoring</strong> - We've identified and fixed the issue causing database load, latency issues should be returning to normal on all fronts.</p><p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>07:40</var> AEDT</small><br><strong>Update</strong> - We're continuing to investigate the issue that is causing degraded performance.</p><p><small>Mar <var data-var='date'>14</var>, <var data-var='time'>05:58</var> AEDT</small><br><strong>Identified</strong> - We’ve identified some performance problems with our main transactional database that’s causing slow load times for dashboard requests and some API calls to timeout. We’re working on addressing the issues now.</p> Thu, 14 Mar 2019 09:00:12 +1100https://www.buildkitestatus.com/incidents/s1ddrm88pltf
https://www.buildkitestatus.com/incidents/s1ddrm88pltfReported slow dispatches.
<p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>03:32</var> AEDT</small><br><strong>Resolved</strong> - This incident has been resolved</p><p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>03:15</var> AEDT</small><br><strong>Monitoring</strong> - We believe this is an isolated incident and will continue to monitor.</p><p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>03:14</var> AEDT</small><br><strong>Update</strong> - We are continuing to investigate this issue.</p><p><small>Mar <var data-var='date'>13</var>, <var data-var='time'>02:47</var> AEDT</small><br><strong>Investigating</strong> - We've received reports of slow dispatches. We are investigating.</p> Wed, 13 Mar 2019 03:32:00 +1100https://www.buildkitestatus.com/incidents/14f1qhfykk4r
https://www.buildkitestatus.com/incidents/14f1qhfykk4rDashboard pages failing to render
<p><small>Mar <var data-var='date'> 7</var>, <var data-var='time'>07:16</var> AEDT</small><br><strong>Resolved</strong> - The fix has rolled out, and we’re not seeing any further errors. Sorry for the trouble.</p><p><small>Mar <var data-var='date'> 7</var>, <var data-var='time'>06:56</var> AEDT</small><br><strong>Identified</strong> - Some dashboard pages are failing to render correctly. We’re deploying a fix, apologies for the interruption.</p> Thu, 07 Mar 2019 07:16:36 +1100https://www.buildkitestatus.com/incidents/lnbz1y9rfslv
https://www.buildkitestatus.com/incidents/lnbz1y9rfslvIncreased rate of error responses on Agent API
<p><small>Feb <var data-var='date'>28</var>, <var data-var='time'>00:00</var> AEDT</small><br><strong>Resolved</strong> - An increased rate of error responses on the Agent API was detected. It was resolved in 5 minutes, though some pipeline upload job failures might have occurred. Affects builds can be rebuilt/retried.</p> Thu, 28 Feb 2019 00:00:55 +1100https://www.buildkitestatus.com/incidents/jmqsps81h2dt
https://www.buildkitestatus.com/incidents/jmqsps81h2dtPartial database outage
<p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>18:33</var> AEDT</small><br><strong>Resolved</strong> - All impacted systems are fully functional and have remained stable with the fix deployed. We will continue to monitor the situation, and will take steps to prevent it occurring again.</p><p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>17:53</var> AEDT</small><br><strong>Monitoring</strong> - We detected high error rates on the dashboard and agent APIs due to a partial database outage. We've rolled out a fix and all systems have returned to normal. We'll continue to all systems to ensure performance is fully restored.</p> Tue, 26 Feb 2019 18:33:25 +1100https://www.buildkitestatus.com/incidents/py2cv79drr8j
https://www.buildkitestatus.com/incidents/py2cv79drr8jJob log API elevated error responses
<p><small>Feb <var data-var='date'>10</var>, <var data-var='time'>15:22</var> AEDT</small><br><strong>Resolved</strong> - Agent API requests for job logs continue to be served correctly, and all systems are stable.</p><p><small>Feb <var data-var='date'>10</var>, <var data-var='time'>14:47</var> AEDT</small><br><strong>Monitoring</strong> - One of the job log storage databases experienced an unplanned failover, and as a result some Agent API requests for job log storage failed between the period 03:05 UTC and 03:10 UTC.<br /><br />Buildkite Agents > v3.8.3 will have continued to retry posting their job logs to the Agent API, but jobs running on earlier agents may have truncated job logs for jobs run between 03:05 UTC and 03:10 UTC. We recommend upgrading to Buildkite Agent v3.8.3 or above, which provides improved retry behaviour in the case of job log API problems.</p><p><small>Feb <var data-var='date'>10</var>, <var data-var='time'>14:39</var> AEDT</small><br><strong>Identified</strong> - An elevated error response rate was detected from our Job Log Agent API endpoint, and has since recovered. An automatic failover has already taken place, and responses have returned to normal, but we'll continue to investigate the underlying cause and continue to monitor the related systems.</p> Sun, 10 Feb 2019 15:22:04 +1100https://www.buildkitestatus.com/incidents/y8b3jm97zl1m
https://www.buildkitestatus.com/incidents/y8b3jm97zl1m