README.md

Message Passing (and Dropping) Protocol Simulator

pick up a distributed algorithms textbook (Nancy Lynch's is my
favorite) or a research paper or a crazy idea scribbled in a notebook
for some message passing protocol, then...

write a small amount of code to implement the algorithm, then...

run a simulator to tell you if it works?

Ja, those things exist today. But there are things to be careful of,
including but not limited to:

What language are you writing your algorithm in?

What language are you verifying the algorithm's properties in?

Can the protocol simulator control process scheduling patterns?
(Irregular scheduling is far more interesting than
regular/consistent scheduling.)

Can the protocol simulator simulate dropped/lost messages?
(Most protocols do pretty well in perfect environments. It's much
more interesting to see what a protocol does when it starts
dropping messages.)

Can the simulator explore all possible scheduling and message
dropping states to verify that my protocol is correct.

This message passing simulator attempts to address almost all of the
above list.

We use Erlang. It's a nice, high-level language that's frequently
used for distributed algorithms anyway. If you're using Erlang,
it's not as useful to simulate your protocol outside of Erlang and
then reimplement it in Erlang.

Yes. We use the QuickCheck tool to simulate a token-based
scheduler, where QuickCheck generates random lists of tokens. The
scheduler is as fair or as unfair as you wish.
There is nothing in the implementation (that I know of) that would
prevent the use of PropEr for generating any of the test cases
within the simulator. I simply haven't had time to try using
PropEr yet.

Yes. We use QuickCheck again to specify when network partitions
happen during a simulated test run: For none/some/many {t0,t1,A,B},
then between simulated time t0 and t1, any message from a process
in set A that sent to a process in set B will be dropped.

Note that these network partitions can be asymmetric: process
C can be partitioned from process D, but D can send
messages to C.

No. Exhaustive exploration of the entire state space is beyond the
scope of this tool. However, it's my hope that this tool points
the way for something similarly easy for
McErlang which is
capable of performing full state space exploration.

By default, QuickCheck will run 100 random test cases. For a protocol
simulation, that usually isn't enough cases to find bugs in even a
simple protocol.

The argument to eqc:quickcheck/1 is a property.

The return value of slf_msgsim_qc:prop_simulate(SimModuleName, OptionsList)
is a property.

So the example above,
eqc:quickcheck(slf_msgsim_qc:prop_simulate(SimModuleName,
OptionsList))., works as you'd expect.

If you wish to run 5,000 test cases with a property Property, then
use the construct eqc:quickcheck(eqc:numtests(5000, Property)).

If you wish to make each individual test case longer, i.e., to
contain longer sequences of protocol operations, use the construct
eqc:quickcheck(eqc_gen:resize(R, Property)). where R is an integer
larger than 40. The effect seems to be exponential: test cases with
R=100 are much, much longer than tests that use R=50.

The resize() and numtests() wrappers can be used together, e.g.,
eqc:quickcheck(eqc:numtests(5000, eqc_gen:resize(60, Property)))

For example, to run 10,000 test cases of the
distrib_counter_2phase_vclocksetwatch_sim simulator:

Describing the evolution of two flawed protocols

The source for these are echo_bad1_sim.erl and echo_sim.erl,
respectively.

A distributed counter service, where all clients are supposed to
generate strictly-increasing counters. There are five variations of
the protocol; all of them are buggy.

The sources for these are distrib_counter_bad1_sim.erl through
distrib_counter_bad5_sim.erl.

All of the buggy simulator code in a file foo.erl has a
corresponding text file called foo.txt which contains:

Instructions on how to run the test case

Output from the test case

Annotations within the output, marked by %% characters, that help
explain what the output means.

The foo.txt file has annotated
simulator output and discussion of what's wrong with each
implementation, e.g. echo_bad1_sim.txt and
distrib_counter_bad1_sim.txt.

For the distributed counter simulations, it can be instructive to use
"diff" to compare each implementation, in sequence, to see what
changed.

diff -u distrib_counter_bad1_sim.erl distrib_counter_bad2_sim.erl

diff -u distrib_counter_bad2_sim.erl distrib_counter_bad3_sim.erl

diff -u distrib_counter_bad3_sim.erl distrib_counter_bad4_sim.erl

diff -u distrib_counter_bad4_sim.erl distrib_counter_bad5_sim.erl

How the simulator works

TODO Finish this section

The simulator attempts to maintain Erlang message passing semantics.
Those semantics are not formally documented but can loosely be
described as "send and pray", i.e. no guarantee that any message will
be delivered. In the case where process X sends messages A and
B to process Y, if Y receives both messages B and A,
then message A will be delivered before B. (I hope I got that
right ... if not, the Async Message Passing Police will come and
arrest me.)

Write a callback module

gen_initial_ops/4 The simulator scheduler sends messages from
created by this generator to each of the simulated processes.
QuickCheck will randomly choose some number of client & server
processes for each test case.

gen_client_initial_states/2
Generate the local process state data for each client process.

gen_server_initial_states/2
Generate the local process state data for each server process.

verify_property/11
After a simulated test case has run, verify that whatever protocol
properties should be true are indeed true. Any failure will cause
QuickCheck to try to find a smaller-but-still-failing
counterexample.

If your test passes 100 test cases, then you probably need to run
for thousands or even millions of test cases. Use
eqc:numtests() and/or/both eqc_gen:resize(N, YourProperty)
where N is a large number on the range of 50-100.

Using the simulators with McErlang

The current work on
McErlang integration is
... well, barely recognizable as "integration". But it's trying to
get there, slowly.

Short answer: look at the commit log entries starting on May 28,
2011. There are cut-and-paste'able examples and a fair amount of
commentary there.

One major complication is the simulator's support for Erlang's
"selective receive" feature. Take this bit of code from
distrib_counter_2phase_sim.erl:

If the simulated process receives a {unexpected, ...} message
while in the client_ph1_waiting state, that message will be
ignored. Why? "Selective receive" will only pull a message out of a
mailbox when there is a sufficiently general pattern to match it. In
the code for client_ph1_waiting() above, there are exactly three
possible messages that can be processed while in that state: