Crowd Bootstrap Failed - The Crowd database is being updated by another instance

I've recently upgraded to Crowd 3.0.0 and upon starting crowd, crowd fully loads but shortly after it gives the error:

ERROR [atlassian.scheduler.core.JobLauncher] Scheduled job with ID 'com.atlassian.crowd.manager.cluster.ClusterSafetyManager-checkSafety' failedjava.lang.RuntimeException: The Crowd database (jdbc:postgresql://localhost:5432/crowd) is being updated by another instance. The instance IP is 127.0.0.1. Please make sure all the instances connected to this database are Data Center instances and have clustering enabled.

I have a single server deployment, not a clustered deployment. I did find an article on Confluence that has a similar issue but the resolutions weren't helpful to Crowd.

I also note that I'm connecting via localhost in my database strings, but that is giving the instance IP of 127.0.01...of course both of those are the same. I tried updating my connection string to replace localhost with 127.0.0.1 in case there was an issue there, but either way I get the same error (the difference being localhost above is replaced with 127.0.0.1).

This seems to be a situation where the Crowd Server is blocking itself, but I'm not sure where to go or how to troubleshoot.

6 answers

All my test instances are using localhost in the JDBC connection string so I think we are ok there.

Crowd thinks another instance is connecting to it's database. We call this state a "cluster panic".

Please make sure there is only one entry in the table cwd_cluster_safety by running this query:

select * from cwd_cluster_safety;

There should only be one record in that table, if there are two, shutdown Crowd, backup the database and delete one of the rows. If you would like syntax to delete a row, please post the results of the query.

In my testing, I only had one record in the cwd_cluster_safety table and when I shut down Crowd it persisted. Because it is a test instance I experimented by deleting the only row and Crowd re-populated it on startup and seems to be running fine. I have to emphasize, since this seems to be your Production instance please back up the database before running any SQL against it.

Thanks Ann. There's only one entry in the cwd_cluster_safety table. I have a backup of the database from before the upgrade, so I tried the same thing yesterday before posting -- I shut down Crowd and just deleted the one entry from the cwd_cluster_safety table. On startup it repopulated, but I was still getting the same error.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

It's a Linux server. I do have other java applications on that server (I have other Atlassian products), but crowd is not running. Running the grep for java apps shows confluence & jira, but not crowd. Running the grep for crowd only returns the grep command.

When I perform the upgrade for Crowd my first step is to stop previous instances and move the install directory so I don't inadvertently start it up or start up the wrong instance.

I'm also running Crowd in the foreground for testing so I can make sure nothing else is starting up (it happens whether I run it in the foreground or not though).

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

If you don't ever plan to run Crowd Data Center (clustered instance) and you are confident you will not accidentally spin up a test instance against the database, then we can disable the scheduled job that checks the cluster safety table.

If you are interested in that strategy, please let me know.

It will have to be done directly in the database. Iwill need to do some testing to see which job it is (I have to delete a value, spin up two instances and make sure the second one starts despite the first one being pointed at the database.)

In the cwd_cluster_job table there are two jobs I need to test one at a time to see which one or whether both need to be disabled:"clusterMessageReaperJob" and"clusterNodeInformationPrunerJob". I am betting on "clusterMessageReaperJob".

It will be later today or tomorrow morning before I finish testing if you want me to try it.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

@Ann WorleyWe've the same issue as well. Can you provide me the necessary information to disable the job?

We run Crowd behind a proxy and configured it only to listen on 127.0.0.1. But as seen in the DB table cwd_cluster_safety it expects requests from the public interface IP address. May this cause the error?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

The IP address shouldn't really matter. It's put into that table just for informational purposes, so that when the problem occurs we can display it. The only case that should trigger the error is multiple instances writing to the same database.

Have you verified there are no other Crowd instances running that might cause the issue, or an old instance that's still active?

Are you also using postgres? Which version?

To diagnose further, please try enabling logging, by adding the line:

log4j.logger.com.atlassian.crowd.manager.cluster=TRACE

to your <CROWD_INSTALL_DIR>/crowd-webapp/WEB-INF/classes/log4j.properties/log4j.properties file, and restart Crowd. This should output extra information about the check to the logs. These should look like this:

Thanks for the logs. They do seem to show two instances of the ClusterSafetyManager running at once, which could be causing the issue.

Is this a fresh installation of Crowd? Which version have you downloaded (zip, tar.gz or war?)? Did you make any changes to any config files in the Crowd installation directory (specifically web.xml)? Are you using any custom integrations or add-ons? What JVM you are running?

I'd also be grateful if you could send the full logs to lpater@atlassian.com.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

As Ann mentioned this is triggered by a check, that tries to determine that multiple Crowd instances are trying to access the same database, and shuts down Crowd, in order to prevent data corruption.

This check was introduced in 3.0.0, the way it works is that every 2 minutes, the instance reads a value from the 'cwd_cluster_safety' table, compares it with the value it expects to find, and writes a new value there.

If the value read from the table differs from the one previously written the instance shuts down with the message you mentioned. This shouldn't be happening if only a single instance is running.

I'd advise double-checking that there's not a leftover instance running still, or if any other processes might be accessing the database and changing it in this manner. Try running `ps aux` as a superuser, to verify the process list, and search for Crowd instances (ther should be exactly one). Also enabling database query logging, and checking the queries made on the table might help you discover the culprit.

If you still have issues, please reach out to Atlassian Support, so that we can investigate further.

Yesterday while testing I rebooted the server just to make sure there was nothing in memory. I have confirmed by running ps aux as speruser that there is only one process in the list, but I did not go so far as enabling database query logging.

Just wanted to make a note here that it's definitely not an issue of multiple instance or the same instance running twice.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Additionally, when I tail the logs I can see somehow an error occuring and Crowd seems to reboot in itself (without raising a new linux process.

Filtered logs:

2017-08-24 10:22:09,640 ContainerBackgroundProcessor[StandardEngine[Catalina]] ERROR [atlassian.event.internal.AsynchronousAbleEventDispatcher] There was an exception thrown trying to dispatch event [com.atlassian.plugin.event.events.PluginFrameworkShutdownEvent@256fdd8e] from the invoker [com.atlassian.plugin.event.impl.MethodSelectorListenerHandler$1$1@68ff6095]java.lang.RuntimeException: java.lang.NullPointerException

[...]

2017-08-24 10:22:09,664 ContainerBackgroundProcessor[StandardEngine[Catalina]] INFO [com.atlassian.crowd.startup] Stopping Crowd2017-08-24 10:23:11,433 Caesium-1-2 ERROR [crowd.manager.cluster.ClusterSafetyManager] The Crowd database (jdbc:postgresql://localhost:5432/crowd) is being updated by another instance. The instance IP is 172.18.136.20. Please make sure all the instances connected to this database are Data Center instances and have clustering enabled.

nginx, crowd, and postgres are in the same server. Postgres listening in localhost only, and upgrading from 2.7.3 to 3.0 I receive this error some seconds after crowd startup

The Crowd database (jdbc:postgresql://127.0.0.1:5432/crowddb) is being updated by another instance. The instance IP is 127.0.0.1. Please make sure all the instances connected to this database are Data Center instances and have clustering enabled.

After that error I check cwd_cluster_safety table and there is only one record from 127.0.0.1. Of course there is only one crowd instance.

Maybe the reverse proxy from nginx is causing this?

I just upgraded my crowd instance following Automatic Database Upgrade method, but I had to roll back because of that bug.

Please tell me how to disable that scheduled job aswell.

Thank you.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

occuring shortly after each other (there are no entries about Crowd being stopped in between).

This indicates that even though there is only one java Crowd process as returned by

$ ps -ef | grep java | grep -v grep | wc -l

there are two (or possibly more) Crowd applications deployed on the same application server (tomcat by default in Crowd standalone distribution).

Therefore, as this might turn into a problem (running more than one instance of Crowd on the samedatabase), I would like to ask you if you could check for aforementioned log entries and check if by any chance you are running more then one Crowd application deployed to the same application server (one can check that by connecting via jconsole to the tomcat java process)

I would grately appreciate your reply.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

If you spend enough time as a Jira admin - whether you are managing a single, mid-sized instance, a large enterprise one or juggling multiple instances at once - you will eventually find yourself in ...