Mistral doesn’t log enough info about sending actions to executor and
receiving them on the executor side. It makes it hard to debug
situations when an action got stuck in RUNNING state. It has now been
fixed by adding additional log statements.

Fixed a backward compatibility issue: there was a change made in Rocky
that disallowed the ‘params’ property of a workflow execution to be None
when one wants to start a workflow.

Cleanup transports along RPC clients. Fixed a bad weird condition in the
API server related to cron-triggers and SIGHUP. The parent API server
creates a RPC connection when creating workflows from cron triggers. If a
SIGUP signal happens after, the child inherits the connection, but it’s
non-functional.

Sometimes Mistral was raising DetachedInstanceError for action defintions
coming from cache. It’s now fixed by cloning objects before caching them.

Fixed a bug that prevents any action to run if the OpenStack catalog
returned by Keystone is larger than 64kB if the backend is MySQL/MariaDB.
The limit is now increased to 16MB.

Fix issue where next link in some list APIs,
when invoked with pagination and filter(s),
contained JSON string. This made next link
an invalid URL. This issue impacted all REST
APIs where filters can be used.

Fixed the issue when “join” task remained in WAITING state forever if
the last inbound task failed and it was not a direct predecessor.

If an action execution fails but returns a result as a list (error=[])
the result of this action is assigned to the task execution ‘state_info’
field which is a string according to the DB model. On Python 3 it this
list magically converts to a string. On Python 2.7 it doesn’t. The reason
is probably in how SQLAlchemy works on different versions of Python. This
has now been fixed with an explicit type coercion.

Workflow output sometimes was not calculated correctly due to
the race condition between different transactions: the one that
checks workflow completion (i.e. calls “check_and_complete”) and
the one that processes action execution completion (i.e. calls
“on_action_complete”). Calculating output sometimes was based on
stale data cached by the SQLAlchemy session. To fix this, we just
need to expire all objects in the session so that they are
refreshed automatically if we read their state in order to make
required calculations. The corresponding change was made.

Workflow execution integrity checker mechanism was too aggressive in case
of big workflows that have many task executions in RUNNING state at the
same time. The mechanism was selecting them all in one query and calling
“on_action_complete” for each of them within a single DB transaction.
That could lead to situations when this mechanism would totally block
all normal workflow processing whereas it should only be a “last chance”
aid in case of real infrastructure failures (e.g. MQ outage).
This issue has been fixed by adding a configurable batch size, so that
the checker can’t select more than this number of task executions in
RUNNING state at once.

Action heartbeat checker was using scheduler to process expired action
executions periodically. The side effect was that upon system reboot
there may have been duplicating delayed calls in the database. So over
time, the number of such calls could be significant and those jobs could
even affect performance. This has now been fixed with regular threads
without using scheduler at all. Additionally, the new configuration
property “batch_size” has been added under the group “action_heartbeat”
to control the maximum number of action executions processed during one
iteration of the action execution heartbeat checker.

Removed DB polling from the logic that checks readiness of a “join” task
which leads to situations when CPU was mostly occupied by scheduler that
runs corresponding periodic jobs and that doesn’t let the workflow move
forward with a proper speed. That happens in case if a workflow has lots
of “join” tasks with many dependencies. It’s fixed now.

Eliminated an unnecessary update of the workflow execution object
when processing “on_action_complete” operation. W/o this fix all
such transactions would have to compete for the workflow executions
table that causes lots of DB deadlocks (on MySQL) and transaction
retries. In some cases the number of retries even exceeds the limit
(currently hardcoded 50) and such tasks can be fixed only with the
integrity checker over time.

Action execution checker didn’t set a security context before failing
expired action executions. It caused ApplicationContextNotFoundException
in case if corresponding workflow specification was not in the cache and
Mistral had to load a DB object. The DB operation in turn was trying
to access a security context which wasn’t set. It’s now fixed by setting
an admin context in the action execution checker thread.

Workflow and join completion check logic is now simplified with using
post transactional queue of operations which is a more generic version of
action_queue module previously serving for scheduling action runs outside
of the main DB transaction. Workflow completion check is now registered
only once when a task completes which reduces clutter and it’s registered
only if the task may potentially lead to workflow completion.

The header X-Target-Insecure previously accepted any string and used it
for comparisons. This meant unless it was empty (or not provided) it would
always evaluate as True. This change makes the validation stricter, only
accepting “True” and “False” and converting these to boolean values. Any
other value will return an error.