I'm reaching out to the community for advice from anyone who has a high volume, clustered environment, preferably using WildFly. We're experiencing performance issues with current volume and the projected volumes are a lot higher.

Peak message volume: 3300 per minute (aggregate of all sources as the load balancer)

Project Daily Volume of Process Starts: 5.5 million

Average process duration under "no load" conditions: 14 seconds

Average process duration under full load: 4+ hours

Average Size of Records Per Message: 100 KB+- This is simply the amount of storage consumed divided by the number of process instances started

The bulk of what most processes do is make calls to external APIs via REST or SOAP. There is very little work (e.g. complex calculations) done by the processes. The bulk of what a process does is the following:

Get a start message

Parse the contents

Create an API REST/SOAP request

Send the request through a gateway

Parse the API request response

Make a decision as to what to do next based upon the response

The above can repeat any number of times. Most process "execution" time is spent waiting for the response from the API call.

While I know there are innumerable factors that influence performance, I'm hoping someone out there can provide some general guidance here in terms of the number of servers, configuration, expected performance, etc. based upon their own experience.

Our goal is that any process start request that hits our platform should start and complete within 5 minutes plus the normal "no load" execution time of the process itself. In effect, we can allow for about 5 minutes of queuing.

Assuming you're not seeing any sort of competition for scarce computing resources... And, typically I see performance issues associated with a combination of unnecessarily serialized operations accompanied with latency (DB, Network, etc.).

For example, thousands of outbound ReST calls... none pulled - each requiring a new connection. This isn't an issue unless you're looking at many thousand requests. Noting that inbound requests can also stack up - though not sure on bottleneck specifics (obviously different approach here).

Another typical issue is a DB bottleneck. Not enough datasources/connections in the pool. Consequently, requests stack up. And, as new sessions back up in line for datasources, you see thread consumption, thread waiting... Each potentially not an issue until the system is driven into massive performance profiles. For example, the latency (slowdown) spikes up very quickly when the datasource pool is exausted.

Review implementation architecture:Using threads/multi-process provides massive increase in performance. But... make sure your code can handle async operations. Alternatly focus on multi-process (breaking apart functions into a distributed work-load). There are several new Java libraries available to help with the tedious details. The most enjoyable is rxJava followed by Streams (etc.).

We don't actually know where the bottleneck is as we haven't engaged in sufficiently controlled testing. That's something we're going to do next week. It will probably reveal the root cause and then we'll be able to focus on addressing that.

History - what level is the history set to? In your case I would consider setting the history level to ACTIVITY

Process Variables - Do you have a lot of process variables per process instance or just one large json structure? More or larger process variables mean more work for the DB I/O system.

Async continuation boundaries - is the process running in the foreground client thread or in a background job executor thread, or both? This could influence if you need to tune job executor or rightsize threadpools. Are your process transaction coarse grained or fine grained? (The larger you can make them, the less work the engine needs to do...)

NOTE: This is a little too generic a response... I'll follow-up later with some details.

Key here is to put together some telemetry/metrics and a suitable test environment to run the base-line profile(s). The system needs to be stressed to breaking point so the profiles are clearly understood. Otherwise... you'll end up thrashing at moving targets. A common problem given that early BPMN work, by nature of Agile dev' methodology, backlogs this effort for later follow-up - reasonably justified, given early code may not be ready for performance testing (however - IMHO, not so "reasonably").

BPM engines generally tend to run into high i/o situations when managing large numbers of BPMN event objects and serialization (ORM, etc). This is primarily resulting from implied heavy demands on DB i/o.

Another common problem is extending a process-app' while not regularly running profile/performance tests - tracking the overhead. Referring to this as the "magic" behind executable BPMN because marshalling events, objects, etc. into a store capable of managing long-running business transactions is... complicated. Big process applications may overlook the details and over-tax process-depended system resources.

A classic anti-pattern, or archetype in systems-terminology, is Tragedy of the Commons:

Lack of awareness on overall resource demands leads to general system collapse as total consumption depletes availability (Senge 1990).

Highway Gridlock Example:Each driver on their way to work requires additional room within a fixed highway system. Traffic slows to a crawl during rush hour demonstrating how each person’s commute participated in creating a parking lot from a highway system (Senge 1990).

Looking back to BPM-engines... A classic example here (on my check-list during most performance related reviews) is to look for unexpected extension of the XA-transaction context into unintended resource managers. For example, spotting transaction annotations in ReST clients or consumers. Additionally, what causes trouble, is simply re-using Camunda's built-in data-source rather than creating one specific per process-app'.

Here's an example of Camunda's data-source reference from the WildFly configuration:

So, unless the transaction is actually vital or critical to Camunda's system DB, I recommend avoiding the "jndi-name="java:jboss/datasources/ProcessEngine" datasource for reuse, or extension, into non-system areas such as custom process app's. Goal is to let Camunda use this datasource exclusively for its needs - don't mix process-application DB transactions unless its both noted and a key requirement to your design.

I've been digging deep into MySQL performance tuning and monitoring tools and there's a huge swath of possible changes that can be made depending upon the configuration.

My #1 objective in testing is to determine whether or not we have literally run out of server I/O capacity. If that is the case, no amount of tuning is going to help. One challenge here is that observing what's happening in detail on both the Camunda and database servers, and correlating all of it imposes its own overhead because of the volumes we would be pushing.

One final top-level thought that has some traction in our team is that we feel Camunda has not indexed as much as they should. This is not based upon any "smoking gun" evidence, but MySQL Enterprise Monitor is constantly complaining about inefficient index use. In previous versions of Camunda (7.4.2 was the last we tried this on), we were getting awful Camunda Cockpit performance on a large instance, so we tracked down the tables used, added indices and performance was 10X better. This would explain some, but not all of our issues. However, I've always operated under the assumption that Camunda have already optimized the schema (and related SQL scripts that create them), but perhaps they never anticipated the volumes we are running.

Here are a couple of details on our setup:

Primary server acts as a REPLICATION MASTER to another server on the same subnet. Replication is asynchronous.

The database server is only used for Camunda and nothing else. Camunda has complete control of the database. The only things accessing the database are: Camunda BPMN engine/REST API, Camunda adminstration GUI web application, MySQL Enterprise Monitor Agent.

We do not use an XA datasource.

We do use a largish connection pool (minimum 100 connections) and everything uses SSL (this is required).

Transactions tend to be extremely active in that every task is generally waiting for something (usually and external web service call) to complete, and will then move on to next activity. There are no intentionally long running processes (e.g. human interactive tasks) currently. Everything executes as fast as the servers will allow.

Here's a sample of our datasource configuration, which I have modified to remove proprietary information as well as to make it easier to read. We use the commercial MySQL Connector/J driver. We are also going through a local load balancer which supports a vURL, which points only to a single database server. The load balancer acts as a "switch" in the event rapid failover is required.

One question I can't figure out is, do each of the core Job Executor threads actively "poll" the database for a task? If that were the case, then setting a high "core-thread" count would result in useless chatter on the database server. In our case, if we get a backlog of 100,000 active processes, each of which actually wants to do something or is actively waiting for its task to complete, then where do we go?

One challenge we face is a classic "network engineering" problem of how much do you build to peak hour volume. To be sure, it depends upon your SLA, but economically one typically wouldn't build for volumes that occur only 1% of total operation time unless an SLA dictated it. If we need truly deterministic behavior like you would expect from a real-time operating system or application, then we would need to have a very fine grained understanding of the performance of server element of every task and then establish and a performance envelope around that.

My gut tells me what we're going to be doing is going to a truly clustered database that provides horizontal (adding servers) scalability. The problem with most "clustered" databases is that they do not actually perform as a single, unified instance (i.e. share most), but rather as a group of individual instances that use a high-performance synchronization process to keep them in sync. Camunda's READ COMMITTED isolation level requirement, which I completely understand, imposes additional challenges, particularly when a node in the cluster gets slightly behind. In this scenario, a query could be sent to it that would return a value of a committed transaction that did not reflect the current value elsewhere on the cluster. READ COMMITTED only guarantees you only get results from committed transaction and you don't get "dirty" reads. It does not guarantee that the value of a particular row on a specific node of the cluster is in fact the latest value.

The consequence of this is that if Camunda or your process logic is dependent upon "serial" execution of database updates based upon the finite, unique time they occurred (no two events truly occur at the same time, though the resolution of time keeping on the server dictates a practical limit to how you can confidently "order" transactions), then you could have problems. In essence, the state of any row in the database must be the same on every server at the same time, which is not possible with clustering technology like Galera or MySQL's new GROUP replication. In fact, the only MySQL technology I know of that provides a truly monolithic, scalable instance, is MySQL Cluster NDB, which we've never used.

I'm sure this is going to be a tedious, iterative process. I'll share as much of my experience, tooling, and results with the community as I can.

As always, thanks for your advice and your contributions to the Camunda community.

One question I can't figure out is, do each of the core Job Executor threads actively "poll" the database for a task?

No (At least not in the Tomcat case and I believe its the same pattern for other app servers). Each job executor has a single job acquisition thread. The logic is count number of ready jobs -> lock and acquire a subset -> dispatch jobs to the job executor thread pool -> repeat.

Thus the analogy I use is the job executor is a lot like a steam train. The executor thread pool is like the boiler, the jobs to execute table is like the coal supply, the job acquisition thread is like the engineer shoveling coal.

The (Tomcat) detail is as follows. Provided there are jobs ready to run, the job executor will keep locking a subset (jobs per acquisition cycle) and dispatch to the executor thread pool. When a job is dispatched to the thread pool, the next free thread will execute the job. If there are no free threads, the job will be queued (in memory). If max jobs are queued, the acquisition thread will back off. Thus the size of jobs acquired each acquisition cycle is a critical factor - too large and you get blocked by the thread pool, too small and you throttle throughput via repeated round trips to the DB. Note in a clustered environment job acquisition threads may compete with each other. Thus if the job acquisition thread cannot lock all the jobs it requests, it assumes it was competing against another node and thus backs off to avoid getting in lockstep with another node.

As an aside, if DB tier is saturated and you cant vertically scale, there are large production instances using a sharded architecture. For example Zalando use a sharded architecture with something like eight shards...

We have full history and are going to consider turning it down. We are initially sending a large JSON object into the system to get around the 4K variable (string) size limitation. The processes themselves then break down the object and extract what they need. These are instantiated into process variables.

The size of each set of process records, if I can base it on the number of process instances divided into the total disk space consumed (and I realize this is not a good measure), puts each process' total history size at around 100 KB (very roughly). In 4 weeks we went from a fresh database to 280+ GB and this was not under projected loads.

I am concerned about the threads, but I'm not sure the issue is "starvation" as much as the database simply cannot process the load. There are so many factors at play here that the only rational way forward is to set up a rigidly controlled test environment and start adjusting until we see improvements or identify the issue(s).

I continue to wonder if Camunda considered the volumes we will be moving (5.5 million start requests per day or more, with peak load of 3300 requests per minute) when they tested and if we're simply never going to be able to carry them and get reasonable execution times.

When you refer to coarse or fine-grained transactions, what do you mean?

When you refer to coarse or fine-grained transactions, what do you mean?

What I mean is say your process needs to perform three API calls in sequence, ie Task A->Task B-> Task C. You could configure this as three async continuations, this is what I mean by fine grained, or all three tasks could be performed one after the other in a single thread of control, this is what I mean by coarse grained.

Fine grained puts load on the DB as each task requires a trip to the DB at the start of the task and a trip to the DB at the end of the task. In addition, the process will require at least three job acquisition cycles to execute. On the other hand, the coarse grained approach has one trip to the DB at the start of the tasks and a trip to the DB at the end after all three have been run..

So why choose one over the other? My principles are more coarse grained is preferred over fine grained for performance and throughput. Use fine grained where an API or task is not idempotent and thus you want to minimise the chance of, or damage caused by retry semantics...

Good details. Since I'm catching up on this today I'll focus on the following...

Tasks are waiting on a reply back from web-service calls (ReST or SOAP clients):

mppfor_manu:

Transactions tend to be extremely active in that every task is generally waiting for something (usually and external web service call) to complete, and will then move on to next activity.

Assuming, given the term "waiting", these are not asynchronous. In-other-words, these task implementations (i.e. java delegates) are synchronous JAX-RS/WS clients.

It sounds like the BPM-engine is invoking synchronous Java web-service client requests against observably long-running SOA-provided operations. Given throughput requirements, this isn't good - though reasonably corrected. Additionally, beefing up both BPM container and resource (DB, etc.) capacities doesn't necessarily address the core problem that we're trying to scale up, from the start, a large collection of synchronous web-service client instances (task implementations or delegates).

Recommended next steps: 1) Review task-delegate source to determine web-service client architecture (sync/async)2) Estimate, per capacity requirements, what this means at scale. For example, how many tasks waiting per process instance? How many process instances? 3) If synchronous, and determined to lead or cause our performance bottleneck, refactor to request/reply EAI pattern or alternate workaround. I like messaging because it provides additional features for recovering state and message-event reliability (i.e. dead-letter, message time-out, transaction, load-balancing, etc. ).

Additionally, and given the need for high-capacity, I'd avoid using BPMN timers as a means of escalation on late service response. For example, rather than each task instance invoking a BPMN timer event, instead off-load this requirement to a specialized service. In this case, I'd load the events representing a SOA-client wait-state, into a message and set this message "time-out" per response escalation. And, this is only due to capacity (scale) requirements because I prefer the BPMN timer business-oriented representation.

Looking back on recommended next steps highlights our problem in that it appears that we have too much service-oriented function within current process implementation (such as binding BPM-task implementation into the role of Request-Reply services). It's because we're essentially looking to offload the "waiting for SOA response" back to the SOA layer (application services) - and, given that the act of "waiting" doesn't bode well for capacity/throughput requirements.

sending a large JSON object into the system to get around the 4K variable (string) size limitation. The processes themselves then break down the object and extract what they need. These are instantiated into process variables.

The BPM-engine's token data-store isn't an ideal place to manage records of this size. This highlights the difference, or argument, between BPM state management vs SOA. Though the BPM-engine plays a significant role, as process-manager, in the overall service-stack... this doesn't necessarily mean it's ideally suited for managing big JSON objects.

Recommended next steps:1) review JSON objects and ask "what's really necessary" for BPM-engine state management2) Offload large JSON object persistence to dedicated systems. For truly BIG capacity... maybe even look to grid-computing. We have options given the interest in "big data" (old problem... new name). Interesting topic for a later discussion.

In 4 weeks we went from a fresh database to 280+ GB and this was not under projected loads.

Referring back to above point regarding the management of "heavy" JSON objects.

Final note... (before I'm distracted with other work): Adding DB indexes may only provide a temporary, short-lived fix. Apologies if this sounds obvious, but adding an index will actually increase time required for storage (volatility). So, you'll see a short-lived performance increase that is later followed by an increasing latency reflecting additional effort now required for index-management. I'd enjoy sharing some stories on this ONLY because the recommendation to add indexes is typically made by our DBMS experts prior to code, or transaction/function, review. NOT saying this is the wrong approach... just saying.

Honestly, Gary, I'm sort of overwhelmed by your generosity and advice. I can't thank you enough and hope I can make a similar contribution to the community, though I doubt it will ever be at the depth that you do. I don't have a formal computer science background (I have a degree in theater), but 35 years of working with computers gives me just enough knowledge to occasionally sound like I know something.

That said, while I'm still a bit unclear on the difference between synchronous and asynchronous operations, I think what you're saying is that it's better for a process "fire and 'forget'" than to directly "supervise" an operation (i.e. a REST request) during its lifetime.

We use a gateway that we designed and built that masks all external services into a common language construct (think something like a WSDL on steroids, but with JSON). Basically if we can get characters to and from the endpoint, the process can talk to it. In the past I've suggested to my team that we consider sending the request and then letting the process use a catch event to consume and process the response, rather than holding open the session during its lifetime. This requires the gateway to maintain knowledge of the process instance ID so that it can return the response to the message interface of the "sleeping" process.

Part of what drive my thinking on indexing is past experience in 7.4.2 where we saw dramatic performance increases and partly from MySQL Enterprise Monitor, whose alarms constantly suggest we're not using indexing properly. I'm disinclined to mess with Camunda's schema because that means you must remember to include the modifications in every subsequent new installation.

As to the heaviness of our JSON objects, I'm not actually in control of that entirely. We work in a highly regulated industry and as such may be required to maintain detailed records of operations, including every state of a variable as a process executes. I really can't say at this point, but a shift to a reduced JSON object footprint may be non-trivial as it puts the burden of parsing out only what you need upstream of the process itself. In other words, we may just be shifting the problem. However, our architecture allows that upstream function to scale horizontally, whereas the database cannot currently do that.

I've been in contact with Oracle and after a review of our options, the one truly monolithic, inherently scalable solution they offer cannot support our requirements. MySQL Cluster NDB has a 14KB per row size restriction that we easily exceed. So in spite of the fact that it had enormous hardware requirements, it would scale horizontally without any issues.

My one concern with clustered databases is that you can have a READ COMMITTED isolation level on a cluster and still have a query read a row from one node that is actually not the most recent "version" of the row. For example, if a transaction commits on one node in a heavily loaded cluster, other nodes may not see that update immediately. However, a previously committed transaction on a different row may provide what the database considers to be a "clean" read of old data.

Camunda have "certified" MariaDB/Galera cluster for use with 7.6+. However, I wonder if they considered the preceding, and if they did, what effect it has on the integrity of operations, particularly when logic of a process is based upon knowing the "current" state of a row. Oracle wondered if Camunda would be willing to certify their GROUP REPLICATION technology. Oracle have also just released InnoDB Cluster, which looks like a fusion of GROUP REPLICATION and the MySQL Router technology. My gut tells me that the MySQL folks realize they don't have a truly WRITE scalable solution in the non-NDB space and so they're trying to plug that hole.

A final note to file away somewhere. Yesterday a system administrator came back to me about my database server and informed me that there's a bug in the VMware vmtools management utility that can have a profoundly negative effect on I/O performance if the utility is "out of date". Now why this is we don't know, but he said on one database server he worked came under heavy load and nearly fell over and died. Once he upgraded the utility, it ran great. We've updated every one of our servers.

As always, thanks for your time.

Michael

PS: Hey Camunda, you can weigh in here too (maybe they're scared to get involved).

My one concern with clustered databases is that you can have a READ COMMITTED isolation level on a cluster and still have a query read a row from one node that is actually not the most recent "version" of the row. For example, if a transaction commits on one node in a heavily loaded cluster, other nodes may not see that update immediately. However, a previously committed transaction on a different row may provide what the database considers to be a "clean" read of old data.

Camunda have "certified" MariaDB/Galera cluster for use with 7.6+. However, I wonder if they considered the preceding, and if they did, what effect it has on the integrity of operations, particularly when logic of a process is based upon knowing the "current" state of a row.

Hey, you know we love you guys, I was only kidding about the "scared" part.

The real question is, did my interpretation of this possibility seem plausible? READ COMMITTED does not mean it is consistent across an entire cluster, only that a transaction on a particular node is committed for the row being read. If the synchronization technologies used for Galera or MySQL GROUP REPLICATION are at best semi-synchronous, then a heavily loaded system could have two rows on different nodes that had fully committed data, but were actually different.

I cannot provide concrete evidence that this could happen. However, unless the database was specifically built for synchronous replication that produced atomic consistency(i.e. MySQL Cluster NDB) no matter how you access the data, I don't think any of these clusters actually meets your requirement in the strictest sense. Moreover, if they were to offer truly synchronous, atomic consistency for what is a "shared nothing" architecture, the performance would probably be worse than a standalone database as each node would have to wait for every node to complete every transactions before it could continue.

The one "arbitrary" way you that you might be able to mitigate this would be to have 100% of your reads occur on only a single node. Then its version of any particular row would be considered the only "current" version. The MySQL & MariaDB database driver supports this.

We may have to just accept the fact that there's a possibility of non-atomic consistency and try to minimize the amount of time this true.

If my understanding of any of this is flawed, please correct it.

By the way, I do believe Oracle would like Camunda to consider formal testing of their database. I can give you a contact if you are interested.

Hey, you know we love you guys, I was only kidding about the "scared" part.

No worries :). It is rather that text-intensive topics take time to follow.

mppfor_manu:

I cannot provide concrete evidence that this could happen. However, unless the database was specifically built for synchronous replication that produced atomic consistency(i.e. MySQL Cluster NDB) no matter how you access the data, I don't think any of these clusters actually meets your requirement in the strictest sense.

I was not involved in evaluating Camunda with Galera, but my understanding was that the settings listed in above link do indeed enforce atomicity in the entire cluster. Let us know if you have anything tangible that suggests another interpretation.

mppfor_manu:

Moreover, if they were to offer truly synchronous, atomic consistency for what is a "shared nothing" architecture, the performance would probably be worse than a standalone database as each node would have to wait for every node to complete every transactions before it could continue.

I agree. With the restriction Camunda makes, clustering is useful for failover but probably not for performance gains.

I've read the MariaDB driver documentation and the two modes permitted do not allow for load sharing. In effect, both failover and sequential permit the client to select a new "master" host within the cluster, but do not permit it to read from or write to more than one host at a time.

As I reflect on this, it makes sense based upon an assumption I made earlier about how one might mitigate the potential for reads from different nodes. Essentially, they're saying you can't do that. You must write to and read from the same node every time unless that node fails, in which case you can switch.

If this is the correct interpretation, then we potentially have a serious problem as it would require extremely powerful database servers.

If Camunda could confirm that this is the case, I would appreciate it. I would also like to know that if we do use another database (Oracle) that provides atomic consistency across the "instance" but still uses clustering technology, would it be acceptable?

Another team within the company has tested the MariaDB/Galera configuration. Their testing confirms what I suspected. MariaDB/Galera can only be used for failover, not scaling. If that is the case, then were are left with the conclusion that barring any unusual or custom configuration, Camunda will only use a single-server database. Scaling is a matter of the horsepower in that server and optimization of processes, database, Java container, and Camunda itself.

If a single server solution cannot be made to work within the preceding environment, you cannot use Camunda. And before anyone jumps all over me, I'm operating under the assumption that all of your processes must run within a single Camunda cluster.

The alternative that we are now considering is offloading the history to an external database, which would free up the Camunda database for task execution exclusively (plus other Camunda overhead). At a minimum, it would relieve part of the I/O burden. The 7.5 documentation at https://docs.camunda.org/manual/7.5/user-guide/process-engine/history/ indicates that such a configuration is possible without modifying Camunda itself.

We are also considering using a custom "function" that would read a process variable and output "debug level" variable states to a separate database. This would be something we could turn on and off as needed.

One final question? Can you change the history level of Camunda after the database has been initialized and first use commences? I know it has to be changed in the configuration and that you also have to go into the database and manually modify a value. However, I don't know what the consequences of this are.