Cohesion: Rethinking Workflow Development

Cohesion is a workflow system designed for a delightful developer
experience and operational simplicity.

Why build a new workflow tool?

While there are many workflow systems in wide use already, they each
have two areas of weakness: operational complexity, and developer
experience.

Workflow engines are operationally complex

A workflow runs a set of tasks. A workflow system has to figure out
which tasks to run when, actually run those tasks, and keep track of
its state.

Consider just a few of the important parts of a workflow system:

Workflow Scheduler
Persistent State Task compute runtime

A workflow engine, which contains a scheduler and keeps track of
workflow state in some sort of persistent storage — either a
database or a message queue.

A compute runtime for tasks from the workflow — usually a
cluster of worker nodes.

Most workflow systems come with lots of operational complexity.

For one, a cluster has to be set up and managed for running the tasks
(often that’s Kubernetes, a complex thing in itself).

More importantly, workflow engines need reliable and fast storage for
their state. If it’s not reliable, user’s code must take on the
complexity of handling that unreliability, and much of the point of
using the workflow system is lost. And if it’s not fast, the workflow
engine becomes a bottleneck, slowing down all workflows.

Storage with these requirements is complex to manage at scale (whether
it’s in the form of a database or a message queue).

Finally, most engines also have fairly complex architectures and not
enough observability built into their internals, making it hard to
debug why something is slow.

Workflow developer experiences suck

The typical workflow development experience is not great, with two
major problems: language, and testing.

Most workflow systems have a custom language, often in JSON
or YAML. This works okay for “static” workflows with a few tasks; but
for a non-trivial workflow with conditionals, loops, error handling,
and more than a dozen or so tasks, YAML starts looking like a pretty
terrible programming language.

(Aside about Airflow, since it uses Python: Airflow uses Python to
create a static workflow DAG; so you don’t really get to use Python
control flow constructs. And it relies heavily on templating —
so the programmer has to think carefully about “template application
time” versus “run time”.)

Further, most workflows are a pain for a developer to test
properly. Most try to get testing to work on a single developer
machine, but workflows along with their dependencies often form too
big a unit to test on a single laptop. And local set ups for the
workflow system and its dependencies are tricky to create and need
lots of maintenance effort as complexity and needs grow.

Cohesion

We’re building a workflow system to address these problems. Cohesion
lets you:

Build workflows in regular programming languages

Test in the cloud in an isolated test account

Deploy into AWS Step Functions, a “serverless” workflow system

1. Building Workflows in Code

Most workflow systems have a custom language to describe a workflow.
These languages vary in what they’re capable of, from being restricted
to simple task sequences, to DAGs, all the way to full Turing-complete
languages. But writing a workflow is programming — and the best human
interface for programming is a programming language.

So, Cohesion workflows are just regular code. For now, we focus on
Python. Here’s a small example:

How does this workflow run? We need a workflow engine and a place to
run tasks. We could build that, but then our first problem applies:
workflow engines are hard to operate, and this difficulty is intrinsic
to the problem — a shiny new workflow system won’t necessarily
be any simpler to operate.

However, there is still a way to avoid operational complexity. We can
use a managed workflow runtime: a high-level service API that accepts
a workflow definition and runs it. Turns out, AWS Step Functions is
exactly what we want.

Cohesion transforms Python code into the workflow language for AWS
Step Functions, and a set of serverless functions. This transformation
is separate from the running of the workflow — this means you run a
"natively AWS" workflow, with nothing added at runtime.

You can think of it as a Python compiler — except it isn't targeting a
typical computer, but rather the Step Functions workflow
runtime.Cohesion lets you use the full set of Python control flow
constructs: if-statements, loops, try/except blocks, and
functions. All these constructs are transformed to the underlying
workflow language.

2. Testing in the cloud

Testing code on a developer’s laptop is attractive. The “inner loop”
of coding — change, test, repeat — thrives on fast feedback.
Deploying to the cloud on every change seems like it would just slow
things down when you can test in a local docker container.

But as the system being tested gets larger, local testing gets harder.
If there are a few interconnected services, then a local set up
requires some combination of multiple docker containers, a local DB,
etc. After a point, the test setup is a non-trivial project in
itself.

Workflows tend to be beyond the level of complexity you want to test
within a single docker container — local testing tends to be a poor
fit for them.

So: why not test in the cloud? We have to overcome two challenges:
multiple developers have to be able to test in isolation from each
other, and deploying changes to the cloud can be slow.

For isolation, Cohesion simply creates one AWS account per developer.
AWS accounts are not charged, and can be created fairly quickly. For
deployment speed, Cohesion contains some optimizations to avoid any
slow provisioning operations on many kinds of changes. (We'll dive
into these details in a future post.)

All in all, you can get cloud testing for both task-level and
workflow-level changes with less than a second of overhead using
Cohesion.

3. AWS Step Functions

AWS Step Functions is a “serverless” workflow system. In the same
vein as AWS Lambda, there’s no cluster to manage, just a higher-level
service API. Give it a workflow definition, and it runs it. A
cloud-managed workflow runtime means you can avoid the complexity of
operating a workflow engine and its database. It also has
fine-grained usage-based pricing: you pay only on workflow state
transitions, and it’s free while the workflow is waiting.

The workflow definition language that Step Functions uses is a list of
"state" objects (in JSON). It's a flexible language — you can express
arbitrary control flow (branching, loops, error handling, etc.) and
you can manipulate data between workflow tasks.

However, the language feels artificial as a programming interface for
humans. Common patterns like loops are not simple to write. We think
Step Functions is a great workflow runtime, but lacks a good
programming interface. This is where Cohesion comes in -- by
compiling Python to Step Functions, we bring an excellent developer
experience to an operationally simple workflow runtime.

Try it out!

Cohesion is an exciting new tool for workflows with dev and ops
simplicity. We're rolling it out to a few initial users: we'd like to
understand users better, and fine tune our features based on user
feedback before opening it up to everybody.

If you're interested in trying out Cohesion right now, we'd love to
have you onboard! Give us your email address below, and we'll send
you an invite. We'll include $100 of cloud credits for each person who
signs up, until we run out.