We have a workflow model that, sometimes, has running instances that get stuck on a service task. The service task is configured with Asynchronous Before. This seems to happen randomly. But, once an instance of the model gets stuck in this way, all subsequent instances of this model that are initiated also get stuck at the same service task. Here's the really interesting behavior - if we make a new deployment of this model (with no changes to the model at all), we can then run instances of this new deployment successfully. However, the instances of the previous deployment that were previously stuck in the service task, remain stuck there.

Our environment is as follows: Camunda 7.5.0 Running in the Apache Tomcat container Using the standalone engine configuration Using a mysql database running on Amazon RDS

The relevant portion of the model is shown below. The service task named "Redact Uploaded Documents" is the task that gets stuck. This task is an http-connector that issues a POST to one of our own application endpoints. We have extensive logging in our application and the logging indicates that our endpoint is not getting called. We have also examined the catalina log and find no errors there. So it appears that this service task is not getting executed at all.

I didn't review for errors... However, parallel gateways don't always behave as expected due to database transaction behavior and/or requirements.

I did see the async-continuation. But, the database requirements will attempt to serialize as transaction requirements demand.

A simulated, of sorts, 'parallel' can be achieved with the async setting in Camunda. But, this is really transaction configuration. The documentation goes into very good detail on this topic.

I also ran into this issue.

For true parallel behavior I ended up looking into taking advantage of app' container's executor service (on WildFly 10). Configure the BPMN model to illustrate process information but keep DB requirements light and attempt to offload to the executor service. The next problem though is collecting the results. The intermediate message event doesn't (yet?) support persistent message-event subscription. But, there are work-arounds given that Camunda runs on current platforms. You could mix-and-match various frameworks to achieve requirements. Specifically, I'm looking at Camunda with: Apache-Camel + robust messaging.

Thank you for the suggestion. I tried the suggested POST and it did cause the job to execute, thereby getting this instance out of the "stuck" state.

Does that tell you anything about why these workflows are getting stuck on this task? Surely in an operational system, we can't be expected to somehow monitor running instances and issue this request if a workflow instance is stuck. I don't see any reason why the engine is not executing this job. This feels like a bug in the workflow engine.

Thank you for your reply. I have previously read the documentation on this topic. We don't really need true parallel behavior. In this case, we have a user task with our own custom form and UI that displays different status based on the status of tasks in the other path in the parallel gateway. We needed to make the service task async, so that the user task properly displays the status information.

But none of this really explains to me why the async service task is not executing at all. It seems to me that this is a bug in the workflow engine that is causing this task to never execute.

do you use an embedded or shared engine? Do you know that your Job Executor is running?How many Process instances do you have from this deployment? Do you observe that the jobs are executed aftersome time like 1 or 2 minutes?

We are using a shared engine.I don't know if the Job Executor is running. Can you tell me how I would check this?From the deployment that is failing, there are 11 running instances.No, the jobs are not executed after some time. They are stuck forever.

As I previously mentioned, the only way to get subsequent jobs to run is to make a new deployment of the workflow model. Then new instances run ok, but the old ones are still stuck forever.

We are running our application, including the Camunda engine, using virtual machines on Amazon Web Services. In our QA environment, we often bring down deployed instances and then bring up new instances whenever we deploy new code.

Also, we haven't touched the jobExecutorDeploymentAware setting, so this is true by default (as you noted).

So, could it be possible that the following is happening?1) We have a virtual machine up in AWS.2) We deploy a workflow model to this machine.3) Instances of this model run fine.4) We bring down this machine and bring up a new one.5) Now when we try to run instances of this same deployment, the async service tasks don't run because of the deployment aware setting. This is a different machine, so these jobs won't run on this machine because they are associated with the deployment on the machine which has been brought down?6) We observe that if we deploy the exact same model, then instances from the new model run fine, because now the deployment is associated with this new virtual machine.

Does this seem to you that this is the source of our problem? If so, then I guess we need to set jobExecutorDeploymentAware to false.

Can you tell me exactly how nodes are identified in a deployment aware configuration? Is it based on the IP address of the machine or is some other id used?

How do you deploy processes? As part of a process application or directly via REST/Java API?

Stephen_Bucholtz:

Can you tell me exactly how nodes are identified in a deployment aware configuration? Is it based on the IP address of the machine or is some other id used?

This is not a centralized configuration. Instead, each engine (aka node, if you have one per node) keeps a set of deployment IDs (i.e. the database ID_ fields in ACT_RE_DEPLOYMENT) of deployments that are registered with it and uses that while acquiring jobs. There is no concept of node/machine/engine identifiers.

Aha, this give me a clue as to what might be happening. We are running our application, including the Camunda engine, using virtual machines on Amazon Web Services. In our QA environment, we often bring down deployed instances and then bring up new instances whenever we deploy new code. Also, we haven't touched the jobExecutorDeploymentAware setting, so this is true by default (as you noted). So, could it be possible that the following is happening? 1) We have a virtual machine up in AWS...

@menski I do not think this is related "specifically" to the docker image. But i guess possible.

Based on @thorben's comments above, and the documentation about DeploymentAware (which is on by default).

The steps to reproduce:

Deploy default docker container setup.

Deploy a BPMN with a few automated tasks such as a script task. Make the Start event, and all tasks async.

Run the BPMN.

Stop the Container

Start the Container

Run the BPMN

On the second run of the BPMN, the behaviour we are seeing is that the process will get stuck on the start event and not move forward - the job executor is not running for that process deployment anymore. If you redeploy the BPMN, the problem is resolved. It looks like the DeploymentAware / The details that @thorben describe here:
forum.camunda.org

Ok, then either deactive the deploymentAware setting or register the deployments manually. For process applications, the process engine has logic that it detects previous versions of a deployment and makes respective registrations. For standalone deployments, that is not the case.