Explore and understand
the
of your system
before your pager tells you

Experiment, in production

The crash test dummy for your system

understand the risk of your distributed application

Sample

capture production

Sample your production deployment artifacts, topology, and traffic, allowing you
to construct a plan to build complete or partial variants of
your system and the communication paths between your
services.

Simulate

deterministic replay

Classify and replay requests across your system deterministically
across 100s of simulated variant environments running under controlled
failure conditions. Our algorithms reduce the problem space
and learn over time to zero in on the more vulnerable and
critical parts of your system.

Score

quantify risk

Ongoing reports help you understand the safety of your
deployment. Helping you uncover critical components, unexpected
or emergent behavior, or coupling between services that you didn't know
existed. You can prioritize where to improve the resilience
of your system and monitor if your system is getting better or
worse over time.

Build high-availability cloud native applications

simulate how cloud applications can fail by running continuous experiments, discovering which work leads to increased availability

Resilience, safety, confidence

Why should i be thinking about failure?

As systems get more complex, failure is inevitable.

With distributed architectures the unknown unknowns of
how the behavior of one component of the system
can cause other parts to fail can become difficult to reason about
without observing them interact.

With applications running on managed infrastructure and
partially composed of black box 3rd party APIs, owning the
availability of your system is no longer exclusively
within your control. You no longer own your availability entirely
and have to shift your focus around failure tolerance to application behavior.

Having a way to observe how your application behaves in failure conditions
helps you qualify a level of risk you are willing to accept about your system.
Prioritizing which "technical debt" to pay off, and where NOT to.

Proactively experimenting with failure, and observing how
components respond when chaos strikes helps you
understand the inherent risk in your sysyem, so you can ship
changes confidently and safely and experience fewer incidents by: