Improving Outage Response in Atlassian

When an outage hits, there are two things more important than getting the system back up. First, you must not make things worse. And second, you have to ensure that you capture all of the necessary information to diagnose the root cause. If you don’t, you’ll have to wait for another outage before you can start your research in earnest.

To help teams deal with these challenges, we have identified two simple improvements that any admin team can quickly implement to improve their response with Atlassian tools.

Damage prevention in Jira:

Delete the stop-jira.sh script from your Jira installation’s bin folder. This script will kill the process even if it is still working to shut down. In a larger instance, there is a high likelihood of causing index corruption. A corrupt index means a much more extended outage than you first thought. Deleting this script entirely prevents an admin from accidentally using it. Instead, your team should use the shutdown.sh script.

Data capture in Atlassian tools:

Atlassian has provided some great tools at https://bitbucket.org/atlassianlabs/atlassian-support which aid users in data collection. But if an admin isn’t experienced at data collection, for example the night shift ops guy who has never touched Jira before, then you may end up with an outage and nothing but your core logs. To help address this problem, we have updated the shutdown.sh script to ask the user if they are shutting down in response to an outage, and if so, it will run the support tools script if it is present, or tell the user where to get it if it is not. You can download the modified shutdown.sh file here👈.

We believe that these two simple changes will go a long way to ensuring that teams consistently capture the data they need to find the root cause of their outages.