I like it when researchers and professors blog. Blogging gives us a different venue with an informal format to present our opinions. This enables us to provide intuitive and simpler explanations, opine open-endedly without needing to provide proof/validation, and even show our humorous and human-sides. Blogging is ideal for professing, I think.

Compositional security enforcement

An idea that worked for compositional security enforcement in software is to "control the flow of information throughout a computing system". The secret flow shouldn't get into public flow and leak outside. In other words, like in Ghostbusters, you must not let the secret and public information streams cross inside a component. The components that possess this property compose and by composing them together you get end-to-end security in your system. (I am not in the security field, but I had heard about this idea applied to achieve Android system security.)

Andrew had worked on this idea for software systems, and recently wondered if we can use this idea also for achieving hardware security. This is because even if you have security at the software level, if the hardware level leaks, you didn't achieve anything. And there were very interesting exploits in recent years using side-channel attacks and timing attacks to leak information from the hardware (i.e., data cache, instruction cache, computation unit, memory controller).

Secure HDLs

The idea is to develop a secure hardware description language (HDL) that uses the information flow type ideas described above to ensure that hardware is secure at design time. Chip design already uses Verilog as an HDL and synthesize chips from Verilog programs. (Chip design is still a relatively constrained domain that synthesis from high-level code is possible.) So Andrew's team add security annotations to Verilog to provide SecVerilog.

SecVerilog is essentially Verilog plus some dependent security labels. The idea is to design a chip system that doesn't leak, by modeling/verifying it in Secverilog. The security model is that the attacker sees contents of public hardware state (high/low) at each clock tick.

Using SecVerilog Andrew's team produced a formally verified MIPS processor. The static analysis overhead of SecVerilog was extremely low: it was 2 seconds for the complete MIPS processor.

Friday, December 8, 2017

For the second project in my distributed systems class Fall 17, I assigned modeling of the two-phase transaction commit protocol. I ask students to model what happens when the initiator/transaction manager (TM) fails, how would a backup (TM) take over, and what type of problems could arise.

Here is the description of the project. I will post the solution later on.

2 phase commit

In a distributed system, a transaction is performed by a collection of processes called resource managers (RMs), each executing on a different node. The transaction ends when the transaction manager (TM) issues a request either to commit or to abort the transaction. For the transaction to be committed, each participating RM must be willing to commit it. Otherwise, the transaction must be aborted. The fundamental requirement is that all RMs must eventually agree on whether the transaction is committed or aborted.

Here is a model of 2-phase commit. (This was built by modifying the /P2TCommit.tla/ at Lamport's site to add a TM agent). I decided to stay with the shared memory model rather than the message passing model to keep the project simple. The interesting scenarios is still possible under the shared memory model.
Here are the constraints on the shared memory communication. A RM node can only read tmState and read/update its own rmState. It cannot read other RM's rmState. A TM node can read all RM nodes' rmState and read/update tmState. A BTM node can read all RM nodes' rmState and read/update tmState.

2 phase commit modeling and validation

If no faults occur, the 2-phase commit algorithm is correct. In the presence of a crash fault, however, problems can arise. In the questions below, we will use TLA+/PlusCal to explore what problems may arise, and how to properly design the protocol to overcome those problems.

Model check /Consistency/ and /Termination/ with no failures (RMMAYFAIL=FALSE and TMMAYFAIL=FALSE). You should see no errors.

Model check with RM failure (RMMAYFAIL=TRUE and TMMAYFAIL=FALSE). You should see no errors.

Part 2.

Model check with RMMAYFAIL=FALSE and TMMAYFAIL=TRUE. (No need to recover a failed TM.)

Write in the comments section, after the "==================" line, your findings/observations. Comment whether the /Termination/ property is violated by a TM failure.

Part 3.

Add a backup TM process to take over if primary crashes. (Assume the BTM cannot fail. TMMAYFAIL can only affect TM not BTM.)

Test satisfaction of /Consistency/ and /Termination/ properties with no TM or RM failures. Make sure BackupTM terminates, so the /Termination/ property is also satisfied as well as /Consistency/ property.

Model check with both TM and RM failure allowed (RMMAYFAIL=TRUE and TMMAYFAIL=TRUE). Write down your observations. (No need to recover a failed RM or TM.)

Tuesday, December 5, 2017

The paper aims to demonstrate that "machine learned models have the potential to provide significant benefits over state-of-the-art database indexes".

If this research bears more fruit, we may look back and say, the indexes were first to fall, and gradually other database components (sorting algorithms, query optimization, joins) were replaced with neural networks (NNs).

In any case this is a promising direction for research, and the paper is really thought provoking.

Motivation

Databases started as general, one-size fits all blackboxes. Over time, this view got refined to "standardized sizes" to OLAP databases and OLTP databases.

Databases use indexes to access data quickly. B-Trees and Hash-maps are common techniques to implement indexes. But along with the blackbox view, the databases treat the data as opaque, and apply these indexes blindly without making any assumptions about the data. However, it is obvious that not knowing about the data distribution leaves performance on the table. Consider this thought experiment. If the keys are from the range of 0 to 500m, it is faster to just use the key as index, rather than using a hash. This observation can be extended to other data distributions, if we know the cumulative distributed function (CDF) of the data. We can generalize by saying "CDF*key*record-size" gives us the approximate position of the record the key refers to.

Ok, so, by knowing about the data distribution, we can achieve performance gains. But now we lost reusability when we go full whitebox. We can't afford to go full whitebox, inspecting the data, and designing the indexes from scratch every time.

The paper shows that by using NNs to learn the data distribution we can have a graybox approach to index design and reap performance benefits by designing the indexing to be data-aware.

The case for applying NNs to indexing is shown over the following three index-types:

B-trees, which are used for handling range queries

hash-maps, which are used for point-lookup queries

bloom-filters, which are used for set inclusion checks

I will only summarize the section on how to replace the B-tree structure. For the hash maps, the learned structure is a straightforward function based on CDF of the data.

B-trees

Why is it even conceivable to replace B-tree with an NN model? Conceptually, b-tree maps a key to a page. We can have a model that also performs key to position mapping, and for the error range, we can do a variant of binary search (or expanded ring search) to locate the page.

OK, then, how do we know min_error and max-error? We train the model with the data we have. The data is static, the NN makes a prediction and then learns from these errors. (Even simple logistic regression may work for simple distributions.)

What potential benefits can we reap by replacing B-tree with a model:

smaller indexes: less main-memory or L1 cache storage

faster lookup: as a result of smaller indexes

more parallelism (TPU), instead of hierarchical if statements as in B-tree.

The key insight here is to trade computation for memory, banking on the trend that computation is getting cheaper (and if you can do it on TPU/GPU you reap more benefits). The evaluation of the paper doesn't even go into using TPUs for this.

The paper includes several strategies to improve the performance of the learned index, including using a recursive model index, hierarchical models, and hybrid models. For evaluation results, please refer to the paper.

Monday, December 4, 2017

This morning I was at Taeyeon Ki's dissertation proposal. He presented his work on enabling openness (hackability/tinkerability) in Android platform-side.

He defined openness as being able to participate in innovation and distribution, and gave SDN and FUSE are examples of openness.

He argued that mobile vendors control their platforms tightly, and this prevents tinkering the platform and obstructs innovation. Even though you can innovate at the app side, the OS/platform side is closed for tinkering. He posed the question: how can we enable anyone to easily develop distribute new platform-level functionality?

To answer this question, he presented his work in two projects: Reptor and Mimic.

Reptor

Intersecting calls to methods is not straightforward. Doing just method name substitution cannot handle calls to class level features: class hierarchy, interface, abstract class, etc. So instead of replacing the method, the Reptor tool replaces entire class. Class level replacement approach works well with callbacks in Android as it ensures that typecasting is correctly handled.

One use case for Reptor is to change some platform-level components used in an app with a more suitable one depending on the region. China and Korea do not have Google Play store, they run their own app-stores for Android. So in China and Korea Google Maps (which is an integral component of Play Store) is not available. Apps there use a downgraded static web map mostly. But with Reptor, it is possible to wrap those applications and redirect the maps calls to Amazon map. Taeyeon showed a demo of this on the Airbnb app.

Reptor has been dummy-tested with 1200 popular apps from Google Play. It has been tested for 32 benchmarked apps with more in depth tests, where runtime checks performed. For this in-depth tests for showing runtime equivalence the Mimic tool described below has been used. The tests showed that Reptor achieves minimal overhead in memory, and performance.

While run-time testing via Mimic is helpful for checking if Reptor substitution at the application of one platform-component with another is "correct" with respect to the platform-level classes, better definition of "correctness" and better framing/formalization of the problem will help this work.

Mimic

Mimic is an automated UI behavior comparison testing system that provides runtime behavioral compatibility check between original and instrumented app. (I think this is under submission or to appear state.)

When I heard this description, I combined this and Reptor together in my mind, and all I could think of was that the combination can be great for automating click farms for smartphone apps!

Taeyeon had something more useful and honest in mind. There are a dozen different android versions in use. A typical app posts several updates a month and the developers struggle to test their apps with all these different environments. Mimic combined with Reptor can help perform these tests in an automated way? (Well, still sounds a lot like click farm to me :-)

Taeyon explained their follow-the-leader testing model, where Mimic replicates the user behavior at the leader to the followers.

OpenCL views a computing system as consisting of a number of compute devices (GPUs, CPUs, FPGAs) attached to a host processor (a CPU). It defines a C-like language for writing programs. Functions executed on an OpenCL device are called /kernels/. A single compute device typically consists of several compute units, which in turn comprise multiple processing elements (PEs). A single kernel execution can run on all or many of the PEs in parallel.
In addition to its C-like programming language, OpenCL defines an API that allows programs running on the host to launch kernels on the compute devices and manage device memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices.

OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. Conformant implementations are available from Altera, AMD, Apple, ARM, Creative, IBM, Imagination, Intel, NVIDIA, Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS. Although OpenCL provides an alternative to CUDA, it has some support from NVDIA.

Thursday, November 30, 2017

After I teach a class I am debilitated/incapacitated for about an hour. I can't do anything productive in that time. This is because teaching demands a lot of mental power (at least for me). I am sure, being an introvert doesn't help either; I need some recovery time after too much exposure. When I am lecturing though, I let go (at least after the first 5-10 minutes or so), and I don't think about myself or being introvert. After the initial warm up period, I get excited about what I am teaching, and all I can think of is the message I am trying to convey. This is actually good. When I find myself lost in the moment (or maybe that is getting into the teaching zone), I know I am doing good. I had written about this earlier on presenting: you should center on the message not on yourself. Oh wow, actually I wrote twice on that topic.

So, in that down hour after teaching, what do I do? I have regret-flashbacks about some parts of the lecture I delivered. I have remorse about how I botched some parts of the lecture and how I couldn't answer a student's question better. The reason I botch up a part of the lecture is often because I haven't internalized that part yet. The reason I wasn't able to make a smooth transition from one part to another part is often because I didn't have a better understanding/insight about how those parts fit. It is likely I was missing some context about it. Or maybe more likely that I knew the context, but I didn't internalize it well enough. There are many different levels of knowing, and the more reliable/resilient/dependable level of knowing is knowing by experience. Teaching is a way to have hands-on training in your research area. You learn through your regrets about how you couldn't explain it better. So "teaching is an integral part of a faculty/researcher's job" is a cliche for a reason: it is very true and important.

Richard Feynman wrote about teaching in several memoirs. This one is especially relevant: "If you're teaching a class, you can think about the elementary things that you know very well. These things are kind of fun and delightful. It doesn't do any harm to think them over again. Is there a better way to present them? Are there any new problems associated with them? Are there any new thoughts you can make about them? The elementary things are easy to think about; if you can't think of a new thought, no harm done; what you thought about it before is good enough for the class. If you do think of something new, you're rather pleased that you have a new way of looking at it."

The paper introduces compositional operators of labels, and shows how to infer the coordination points in dataflow programs. When reading below, pay attention to the labeling section, the labels "CR, CW, OR, OW", and the section on reduction rules on labels.

To figure out these coordination points, the Blazes framework relies on annotations of the dataflow programs to be supplied as an input. This is demonstrated in terms of a Twitter Storm application and a Bloom application.

The paper has many gems. It says at the conclusion section in passing that when designing systems, we should pay attention to coordination locality, not just data-access locality. (Maybe often the two are related/correlated, and this leads people to confuse them.) It also makes practical lessons about how to organize your dataflow programs: move replication upstream, caching downstream. I wish the paper elaborated more on these points at the conclusion.

In my summary below I borrow from the paper figures and a lot of text with small edits.

Contributions

The key idea in Blazes is that even when components are order-sensitive, it is often possible to avoid the cost of global ordering without sacrificing consistency. In many cases, Blazes can ensure consistent outcomes via *sealing* which indicates when partitions of a stream have stopped changing. (Is "sealing" like "watermark" in the Google streaming paper?)

Some contributions of the Blazes work are:

The paper shows how to analyze the composition of consistency properties via a term-rewriting technique over dataflow paths.

The paper demonstrates how the semantics of components can be modeled using a system of programmer-supplied annotations (graybox approach).

The paper discusses two alternative coordination strategies, ordering and sealing, and shows how to automatically generate application-aware coordination code that uses the cheaper sealing technique in many cases.

Dataflow consistency

Look at these anomalies. Just look at them. This figure should have come with a trigger warning for distributed systems developers.

But it is possible to go to the first-principles and cut the Gordian knot. If we could ensure that every component in a dataflow produces a deterministic set of outputs regardless of the order in which its inputs are delivered, then we could prevent all of these anomalies.

There are two approaches to doing that.

*Consistency via Coordination*: A sufficient strategy for preventing the anomalies (or arbitrary components) is to remove the nondeterminism in delivery order by coordinating the (otherwise asynchronous) processing of inputs by using a coordination system such as ZooKeeper. This is a general (independent of semantics of computation) approach but imposes throughput and latency costs.

*Consistency via Component Properties*: A dataflow component is *confluent* if it produces the same set of outputs for all orderings of its inputs. At any time, the output of a confluent component (and any redundant copies of that component) is a subset of the unique, final output.

In an ideal world, we would build dataflows out of strictly confluent components. Unfortunately this is an infeasible assumption. When we add an order-sensitive component to an otherwise confluent dataflow, how can we avoid falling back on a global coordination strategy? This is the question the paper investigates.

Remark: Note that confluent is a stronger property than convergent. Convergent replicated components are guaranteed to eventually reach the same state, but this final state may not be uniquely determined by component inputs. As Figure 5 indicates, convergent components allow cross-instance nondeterminism, which can occur when reading "snapshots" of the convergent state while it is still changing. Consider what happens when the outputs of a convergent component (e.g., GETs posed to a key/value store) flow into a replicated stateful component (e.g., a replicated cache). If the caches record different stream contents, then the result is replica divergence.

Annotated dataflow graphs

The *CR* annotation indicates that a path through a component is confluent and stateless; that is, it produces deterministic output regardless of its input order, and the path does not modify the component's state.

*CW* denotes a path that is confluent and stateful. That is, calls to the API may modify the internal state of the component, but the final state is determined only by the input contents and not their order.

The annotations *OR_gate* and *OW_gate* denote non-confluent paths that are stateless or stateful, respectively.

The *Seal_key* annotation means that the stream is punctuated on the subset key of the stream's attributes.

The label *Async* corresponds to streams with deterministic contents whose order may differ on different executions or different stream instances. Streams labeled *NDRead* may exhibit cross-run nondeterminism, having different contents in different runs, or cross-instance nondeterminism within a single run if the system is replicated. Finally, streams labeled *Taint* may exhibit persistent nondeterministic contents, or replica divergence in replicated systems.

Reduction relation

Combination rules model what happens when multiple input streams contribute to a single output stream in a dataflow. Taint is the strongest label: if any of the paths contributing to an output stream can exhibit permanent nondeterministic contents, then so can that output stream. Async is the weakest, and will be dominated by any labels with which it combines.

Coordination selection

Blazes will repair nonconvergent and nonconfluent dataflows by constraining how messages are delivered to individual components. When possible, Blazes will recognize the compatibility between sealed streams and component semantics, synthesizing a seal-based strategy that avoids global coordination. Otherwise, it will enforce a total order on message delivery, say using ZooKeeper.

*Sealing Strategies.* To determine that the complete partition contents are available, the consumer must a) participate in a protocol with each producer to ensure that the local per-producer partition is complete, and b) perform a unanimous voting protocol to ensure that it has received partition data from each producer. The voting protocol is a local form of coordination, limited to the "stake holders" contributing to individual partitions. When there is only one producer instance per partition, Blazes need not synthesize a voting protocol. Once the consumer has determined that the contents of a partition are (henceforth) immutable, it may process the partition without any further synchronization.

Blazes framework

Blazes can be directly applied to existing programming platforms based on distributed stream or dataflow processing, including Twitter Storm and Spark Streaming. Programmers of stream processing engines interact with Blazes in a graybox manner: they provide simple semantic annotations to the blackbox components in their dataflows, and Blazes performs the analysis of all data ow paths through the program.

Case study: Word count example

The Storm word-count dataflow can be reduced to Taint in the following way once we supply an Async label for its input interface.
If, on the other hand, the input stream is sealed on batch, then Blazes instead produces this reduction:
Because a batch is atomic (its contents may be completely determined once a punctuation arrives) and independent (emitting a processed batch never affects any other batches), the topology will produce deterministic contents in its (possibly nondeterministically ordered) outputs, a requirement for Storm’s replay-based fault-tolerance—under all interleavings.
Figure 20 shows that the sealed solution outperforms the ordering based solution significantly as the number of workers increase.

The TLA+ approach seems to be more fine granularity and can do refinement checks between different models. But it is preliminary so far. The Blazes approach can handle components. It looks like it is more coarse granularity but can be applied easier.

Saturday, November 25, 2017

The below definition for dataflow programming is from Wikipedia (with some small edits):
"Traditionally, a program is modeled as a series of operations happening in a specific order; this may be referred to as sequential, procedural, or imperative programming. The program focuses on commands for programming, where data is normally /at rest/.

In contrast, dataflow programming emphasizes the movement of data; it models a program as a series of connections with explicitly defined inputs and outputs, and connect operations. An operation runs as soon as all of its inputs become available. Thus, dataflow languages are inherently parallel and can work well in large, decentralized systems."

Map-Reduce uses a very simple directed-acyclic-graph (DAG) with only two operations: map and reduce. Spark extends map-reduce to a couple dozen operations in addition to map and reduce, and implements data storage/processing to be in-memory. More specifically, in Spark, a computation is modeled as a directed acyclic graph (DAG), where each vertex denotes a Resilient Distributed Dataset (RDD) and each edge denotes an operation on RDD. RDDs are collection of objects divided in logical partitions that are stored and processed as in-memory, with shuffle/overflow to disk.

I had written summaries about GraphX and GraphLab approaches to dataflow as well.

In this post, I will focus more on Naiad (and timely-dataflow), and try to compare/contrast that with TensorFlow.

Naiad, timely-dataflow, and differential-dataflow

Naiad introduced a dataflow programming framework that allows cycles in the dataflow graph. Naiad's dataflow model supports structured loops allowing feedback in the dataflow, stateful data flow vertices capable of consuming and producing records without global coordination, and notifications for vertices once they have received all records for a given round of input or loop iteration. Here is my summary of Naiad from 2014.

Differential dataflow (i.e., incremental dataflow) means streaming iterative computation and doing incremental computation only in response to changing data. "If you change one input to your computation, you would prefer not to re-evaluate the entire computation. Rather, changes to the input produces some changes in the output of your first dataflow operator, which are then changes to the inputs of subsequent operators. Eventually these changes either evaporate or spill out the end of your dataflow as changed outputs. As changes flow around the dataflow, they happen at various logical times. These times reflect the epoch of streaming input data, and iteration counters for any loops they may be contained in. Rather than eagerly aggregate these changes, we leave them disaggregated, allowing future computation to use arbitrary subsets of the changes. Although we leave ourselves more work to do for each change, we end up with an orders-of-magnitude reduction in the numbers of changes."

Theres is a technical distinction between differential and incremental. Incremental works with "sequence of arbitrary updates" whereas differential works with "partial order of arbitrary updates". The former has been around a while, has lots of prior art, etc. Differential is pretty new and while more general is pretty similar in most cases (e.g. streaming join is the same for both of them; streaming join in a loop is only possible in differential).

Naiad versus TensorFlow

TensorFlow is a special instantiation/application of Naiad's timely-dataflow model of cyclic dataflow processing for the domain of machine learning and specifically deep-learning. It is well integrated with tools, Python, GPUs, etc. Here is my summary of TensorFlow if you need a refresher.

I think TensorFlow traded off generality of Naiad and gained much more in return. This is what some people call "addition by subtraction".

Naiad aimed to satisfy real-time querying. TensorFlow is OK with batch training. TensorFlow can still output a result at the end of each minibatch, not a big deal. The real-time querying requirement in Naiad may be an unrealistic and unnecessarily/impractical requirement to shoot for in practical systems. Who knows. Are there practical applications that require tight real-time querying with the most recent updates reflected in the view? (More on this, at the end, on the applications discussion.)

Naiad aimed at precisely correct output. If something changes slightly, Naiad will recompute effected things (incrementally, yay) and give you the correct output. On the other hand, if something changes, TensorFlow will consider the changed things as a new minibatch to train with, as a result of which not much may change. Machine Learning and Deep Learning tolerate uncertainty anyhow. Minibatches are how TensorFlow handles differential/incremental dataflow implicitly. No complicated mechanism is required. If things change, that is another minibatch to train for, and most of the previously computed model parameters may remain unaffected.

(Naiad's differential/incremental processing imposes extra memory requirements. It is unclear to me how much state needs to be held at the stateful-operations about the input-graph so that incremental processing is still possible.)

Finally, Naiad has a join operation a very strong/capable yet a costly operation. TensorFlow does not have a join operation, but that makes TensorFlow operations easier to shard/partition across machines. Maybe TensorFlow doesn't have join operations because its inputs are assumed to be independent. Naiad assumes inputs can be dependent: like parts of a graph, or update to an earlier graph, streaming into Naiad. So Naiad has a join operator but pays the cost for that powerful operator.

Naiad's generality should open it up for more potential applications. Differential dataflow is ideal for graph algorithms that update the graph or operates on a dynamic graph. This should be useful for search engines; when the internet graph and links update, you shouldn't recompute the entire result in batch. I don't know what Google uses most recently, but Google has been using an incremental transactional system leveraging BigTable, called Percolator, to solve this problem, without needing the differential dataflow computing power of Naiad.

Other application areas for Naiad's computing power could be social network applications and recommendation systems. But it seems like social network applications do not need to be very precise, approximate answers are OK with them. And recommendations systems do not need very tight real-time constraints yet.

Heterogeneous GPU reallocation

It looks like they have developed a GPU virtualization tool called Crane recently. "General purpose GPU (GPGPU) computing in virtualized environments leverages PCI passthrough to achieve GPU performance comparable to bare-metal execution. However, GPU passthrough prevents service administrators from performing virtual machine migration between physical hosts. Crane is a new technique for virtualizing OpenCL-based GPGPU computing that achieves within 5.25% of passthrough GPU performance while supporting VM migration. Crane interposes a virtualization-aware OpenCL library that makes it possible to reclaim and subsequently reassign physical GPUs to a VM without terminating the guest or its applications. Crane also enables continued GPU operation while the VM is undergoing live migration by transparently switching between GPU passthrough operation and API remoting." Here is the video of conference presentation at SYSTOR'17.

The HotCloud paper is a position paper motivating the use of the Crane tool. They say GPU workloads underutilize GPU device memory (using less than 37%). So there is an opportunity to reallocate heterogenous GPUs within existing VMs, upgrading and downgrading them. They employ Crane to support reallocation of heterogenous GPU models within a VM and also by allowing live migration of VM to a new host.

The HCl scheduler: Going all-in on Heterogeneity

The paper builds a new scheduler HCl that tries to fully model and support resource heterogeneity. HCl operates on 2 annotated DAGs: one representing the Spark application, and one representing the available hardware resources.

HCl uses task runtime estimations for each resource to find the optimal mapping of a task, and estimate future availability of resources. It also tries to estimate the cost of data transfer and includes this as a factor in where to schedule tasks. It uses lookahead scheduling to place parent tasks closer to their children's optimal resource instead of only placing child tasks closer to their parents. Finally, it performs path balancing and tries to balance the execution times of converging paths in a DAG such that the waiting time for tasks at a convergence points is minimized.

This is a complicated scheduling algorithm, and would likely have problems scaling as the size of tasks and heterogenous resources grow. Scheduling needs to be done quickly, and these calculations will take time. The evaluations don't test at scale.

There is not much detail about HCl implementation in the paper, but it looks like HCl is not a Mesos framework, it is a separate scheduler. It must be more like Borg, where the scheduler gets to see not only resources but also all the tasks.

I am interested in the paper because it talks about TLA+ modeling of TensorFlow graphs and uses that for creating an operational semantics for TensorFlow programs. In other words, the paper provides a conceptual framework for understanding the behavior of TensorFlow models during training and inference.

As you recall, TensorFlow relies on dataflow graphs with mutable state. This paper describes a simple and elementary semantics for these dataflow graphs using TLA+. The semantics model does not aim to account for implementation choices: it defines what outputs may be produced, without saying exactly how. A framework of this kind does not just have theoretical/academical value; it can be useful to assess correctness of TensorFlow's dataflow graph (symbolic computation graph) rewriting optimizations such as those in XLA.

TensorFlow refresher

In TensorFlow, dataflow graphs support both training and inference. That is, a computation may perform one or more steps of training for a machine-learning model, or it may be the application of a trained model. TensorFlow models are assembled from primitive operations by function composition. The operations are implemented by kernels that can be run on particular types of devices (for instance, CPUs or GPUs).

In addition to edges for communicating tensors, a graph may include control edges that constrain the order of execution. This order can affect performance as well as the observable semantics in the presence of mutable state.

A client typically constructs a graph using a front-end language such as Python. Then the client can make a call to run the graph, specifying which inputs to "feed" and which outputs to "fetch". TensorFlow propagates the input values, repeatedly applying the operations prescribed by the graph, until no more nodes can fire. The order in which nodes fire is constrained by data dependencies and control edges, but is not necessarily unique. The execution ends with values on the graph's output edges. Often, a graph is executed multiple times. Most tensors do not survive past a single execution of the graph. However, mutable state does persist across executions.

Core Computational Model

This section presents TLA+ modeling of the TensorFlow.

TensorFlow values include tensors and variables. The TLA+ model distinguishes three kinds of edges: tensor edges, variable edges, and control edges.

Operations are of several kinds: Functions, Var(x), Read, Assign. A TensorFlow program consists of a directed acyclic graph G, plus a mapping (a "labelling") L from nodes of G to TensorFlow operations.
A TensorFlow program starts with non-EMPTY input edges, consumes the values on those edges, and repeatedly propagates them, applying operations, until no more nodes can fire. In the course of such an execution, each node fires exactly once, and the execution ends with non-EMPTY output edges. The order in which nodes fire is not necessarily unique: determinism is not always expected or desired. For example, lock-free stochastic gradient descent is a common source of intentional race conditions.

Each change of state in the behavior is caused by the execution (i.e., the firing) of exactly one node in the graph. A condition for whether a node n can cause a change from a state s to a state s′ is that for all its incoming control edges d, InTransit(d) = GO in s and InTransit(d) = EMPTY in s′, and for all its outgoing control edges e, InTransit(e) = GO in s′. Moreover, InTransit(d) must be the same in s and in s′ for all edges d not incident on n, and VarValue(x) must be the same in s and in s′ for all variables x, except in the case where L(n) = Assign-f for some function f.
Figure 3 shows a computation that *assign-adds* x:=x+D and x:=x+E. The *assign-adds* commute for x, but *writes* alone do not commute and leads to a
last-write-wins race condition, as in Figure 4.
A race condition can be avoided by other means, such as implementing mutexes using TensorFlow control primitives.

Further work

The paper claims that this semantics can be generalized to accommodate cyclic execution graphs in TensorFlow. In that generalization, instead of firing exactly once, each node fires at most once in each "execution context", where an "execution context" identifies an iteration of each (possibly nested) loop that contains the node.

Monday, November 20, 2017

I made it a tradition to start my CSE 4/586: Distributed Systems lectures with a short YouTube video. As the students trickle into their seats, I play a YouTube video related to the lecture of the day.

Video is an underutilized medium in teaching. I don't understand why more professors do not use them. They are visual and they wake the students up, and get them interested in the class. After the students are primed with the video clip about the context of the lecture, they listen the class more attentively. I get overwhelmingly positive feedback from the students over many years about this practice.

In the first couple weeks of the class I teach about specifying and reasoning about the safety and progress properties of distributed programs. Some students find this short forage into predicate logic and computation model tedious. So to motivate them I tell them that this understanding of program properties and reasoning about them will get very handy as they learn more complicated algorithms soon. I tell them to trust me on this: /before they can do karate, they need to learn about some tedious wax on and wax off/. Then I show them scenes from Karate Kid, an awesome movie from my childhood. The movie is from 1984, and recently I came to the sobering realization that it is older than my students in the class.

For consensus, I don't have anything good to show. Instead I bring a couple students to the front of the class and we re-enact the attacking generals problem with messengers running between the two generals.

Saturday, November 18, 2017

The programming model that underlies several popular toolkits such as TensorFlow uses a static declaration approach: they separate declaration and execution of the network architecture.

Static declaration has a number of advantages. After the computation graph is defined, it can be optimized in a number of ways so that the subsequent repeated executions of computation can be performed as quickly as possible. This also simplifies distribution of computation across multiple devices, as in TensorFlow. But static declaration is inconvenient for the following:

variably sized inputs

variably structured inputs

nontrivial inference algorithms

variably structured outputs

Of course, it is possible to process variable sized inputs if the computation graphs can represent objects whose size is unspecified at declaration time. Flow control operations such as conditional execution and iteration can be added to the inventory of operations supported by the computation graph. For example, to run an RNN over variable length sequences, Theano offers the scan operation, and TensorFlow offers the dynamic RNN operation.

While it is therefore possible to deal with variable architectures with static declaration in theory, that still poses some difficulties in practice:

Dynamic declaration

The dynamic declaration model in Dynet takes a single-step approach: the user defines the computation graph programmatically as if they were calculating the outputs of their network on a particular training instance. There are no separate steps for definition and execution: the necessary computation graph is created, on the fly, as the loss calculation is executed, and a new graph is created for each training instance. (To avoid the overhead, DyNet strives to provide very lightweight graph construction.)

Dynamic declaration reduces the complexity of the computation graph implementation since it does not need to contain flow control operations or support dynamically sized data. DyNet is designed to allow users to implement their models in their preferred programming language (C++ or Python). A symbolic computation graph is still constructed, but by using the host language (C++ or Python) rather than providing them separately at the computation graph level. Thus, dynamic declaration facilitates the implementation of more complicated network architectures.

What is the innovation in DyNet?

DyNet aims to minimize the computational cost of graph construction in order to allow efficient dynamic computation. This way DyNet aspires to remove barriers to rapid prototyping and implementation of more sophisticated applications of neural nets that are not easy to implement in the static computation paradigm.

DyNet's backend, which is written in C++, is optimized to remove overhead in computation graph construction, and support efficient execution on both CPU and GPU. This is feasible to do. Since flow control and facilities for dealing with variably sized inputs remain in the host language (rather than in the computation graph, as is required by static declaration), the computation graph needs to support fewer operation types, and these tend to be more completely specified (e.g., tensor sizes are always known rather than inferred at execution time).

DyNet programs

DyNet programs follow the following template
1. Create a Model.
2. Add the necessary Parameters and LookupParameters to the model. Create a Trainer object and associate it with the Model.
3. For each input example:
(a) Create a new ComputationGraph, and populate it by building an Expression representing the desired computation for this example.
(b) Calculate the result of that computation forward through the graph by calling the value() or npvalue() functions of the final Expression
(c) If training, calculate an Expression representing the loss function, and use its backward() function to perform back-propagation
(d) Use the Trainer to update the parameters in the Model

In contrast to static declaration libraries such as TensorFlow, in DyNet the "create a graph" step falls within the loop. This has the advantage of allowing the user to flexibly create a new graph structure for each instance and to use flow control syntax (e.g., iteration) from their native programming language.

Here is an example program.

This program shows the process of performing maximum likelihood training for a simple classifier that calculates a vector of scores for each class it will be expected to predict, then returns the ID of the class with the highest score. Notice that, at line 14: symbolic graph is defined dynamically, at line 15: forward pass is executed, and at line 16: backward pass automatic diff is executed. At line 19, after the training, inference is done. To account for dynamic input/graphs at inference, the graph is reconstructed for each serving input.

Dynet allows dynamic flow control at the inference time easily. This can allow the classifier to avoid wasting processing time when the answer is clear. It is also possible to perform dynamic flow control at training time, and this supports more sophisticated training algorithms using reinforcement learning. These algorithms require interleaving model evaluation and decision making on the basis of that evaluation.

How do we make DyNet distributed

Dynet is currently centralized. There is also support for automatic mini-batching to improve computational efficiency, taking the burden off of users who want to implement mini-batching in their models. For more complicated models that do not support mini-batching, there is support for data-parallel multi-processing, in which asynchronous parameter updates are performed across multiple threads, making it simple to parallelize (on a single machine) any variety of model at training time.

Thursday, November 16, 2017

It is a bad sign if the references section of a paper fails to cite any recent papers. That means, either the authors are not aware of any recent work on the area, or the area is not of interest to other researchers.

But it is also a bad sign if the references section of a paper fails to cite any old papers. That means, the authors likely do not know enough about the fundamental/foundational work in the area.

There is a case to be made for working from the fundamentals, the first principles. Elon Musk made that case in his work, and showed that you can make transformative work, even in the commercial technology world, by working from the first principals.

Working from the first principles is also essential for research. It is not uncommon to get your best ideas when preparing for a class. Sometimes reviewing fundamental work in a topic, you notice a gap, some weird under-explained assumption, and go "huh, why is it that way". Or sometimes the students (or outsiders of the field) ask a basic question from the left field, and that may start an investigation. And sometimes, you see the old idea/algorithm as a promising/useful fit in new emerging applications. Recently the flexible quorums extension to Paxos is a good example of working from first principles. Nobody expected that result after 30 years of Paxos.

Back to the seminar

Reading these golden oldies could be a chance to revisit the fundamentals of the field. When reading these papers, we can note about how they aged (which parts aged well, which not) and how the context changed from then to now. We can use Google Scholar to investigate how these papers got cited in the intervening years. We can also consider if some of these algorithms can find new applications in modern systems.

But I am not sure what the selected papers will be yet. Here is my first brain dump on this. It may take several weeks to finalize the list. If you have suggestions, please let me know in the comments. What should be the cutoff date for these papers? 2000 seems to be a reasonable cutoff date. We may even stretch this up to 2007 to include a paper or two.

Lamport. Time, clocks, and the ordering of events in a distributed system, 1978.

The growth mindset is best understood with its antithesis: the fixed mindset. The fixed mindset says that your abilities and capacity are predetermined at birth. The growth mindset says they can be altered and improved.

So, why is this important? This sounds like just an issue of perspective. But a perspective change (a paradigm shift) is capable of changing a lot of things.

The book argues that fixed mindset thinking leads people to play defensive. The fixed mindset people are scared of failures and embarrassment as they would show to the world their capacity (limited and unchanged). So in order not to fail and save face they don't attempt things.

In contrast, the growth mindset people (i.e., the people who embraces the growth mindset/perspective) love challenges. They are not scared of failure and embarrassment. They may chalk up failure and embarrassment as learning and improvement opportunities. Of course this does not mean that they can absolve themselves of any responsibilities or guilt about their failings/shortcomings. They need to accept the pain in order to grow: no pain, no gain. This just means that the growth mindset people are not affixed to the fixed capabilities and skills defining their self worth. They know they can grow. The growth mindset is akin to taking an antifragile approach to personality and life.

The fixed versus growth mindset has several implications in education, parenthood, relationship, sports-coaching. One practical advise (you have certainly heard this from media/blogs/newspapers) is about how to praise your kids. Don't praise your kids about how smart they are as this would reinforce the fixed mindset. Praise them about how hard they try, how persistent they are, how courageous they are to attempt/try new challenges. This way you reinforce the growth mindset in them.

One of the biggest harms we can do to our gifts is to raise them with the fixed mindset. And one of the biggest gifts we can give to our kids is to raise them with the growth mindset. Don't just teach them how to fish. Teach them the concept/notion that they can always become much better fisherman than you were and they currently are.

Sunday, November 12, 2017

I have recently read this 2017 book about the collaboration between two psychologists, Amos Tversky and Daniel Kahneman. The book is by Michael Lewis, a truly great storyteller. From his pen, the academic collaboration story of two social scientists becomes a love story and a thriller. I wouldn't ever imagine social science and behavioral economics could be this exciting. So I plan to read more of Michael Lewis's books: Flash Boys, The Big Short, Moneyball, The Blind Side, and Liar's Poker.

Here are some short notes from the book.

Danny Kahneman had a very tough childhood. His family survived (except the father) through Nazi prosecution and World War 2, and were able immigrate to Israel in 1948. He was a gifted child and starred in academia, although through out his life he was always had doubts about his talents and was always unsure of himself.

Amos Tversky was born in Israel and served in the Israel army for many years. He got educated at US for graduate school in psychology. He was very bright, which led to his colleagues coining an IQ test: "The faster you realized Tversky was smarter than you, the smarter you are."

The book contains many amazing quotes from Amos. When he was asked why/how he had become a psychologist, Amos replied: "The big choices we make are practically random. The small choices probably tell us more about who we are. Which field we go into may depend on which high school teacher we happen to meet. Who we marry may depend on who happens to be around at the right time of life. On the other hand, the small decisions are very systematic. That I became a psychologist is probably not very revealing. What kind of psychologist I am may reflect deep traits."

Amos's early research on similarity of things was also very interesting, and the book explains this very nicely. Similarity is not like distance, because it does not necessarily have symmetry. People assess similarity essentially by making a list of features. The more noticeable features two objects share, the more similar they are. Note that, objects have varying number of noticeable features: New York has more of them than Tel Aviv. As a result, New York is not as much like Tel Aviv as Tel Aviv is like New York. Hence, similarity can be asymmetric. (In other words, similarity is like Bayesian subset comparison of features.) The book has a nice anecdote about how Amos instated that sometimes a lack of a feature is a feature: e.g., a three legged dog.

When Amos and Danny got together as colleagues, they started an epic collaboration streak investigating the irrational ways humans make decisions about risk. Their collaborative research started the behavioral economics field.

Their research shows a prime example of academic courage and creativity. They were able ask very original and daring questions and they were brave enough to pursue those questions and break ground in uncharted territory. An example of the kind of thought experiments they performed is the Asian disease problem which elegantly demonstrates the power of framing.

I will not divulge more about the collaboration of this duo and about the book. I enjoyed the book immensely and I strongly encourage you to read the book to learn about the collaboration story of these two great men. The book gives great insights about how to approach research as well.

This paper was a little confusing to read. I think it could have been presented better to make its contributions more clear. The paper aims to enable multiple distributed ML frameworks, say TensorFlow, Petuum, MxNet, share the same cluster.

Enterprises have clusters, managed by a cluster management systems (CMSs). The paper starts with a review of existing CMSs, and mentions shortcomings with each. It is unhappy with application-level scheduling, because there each application reserves and keeps all allocated resources until completion, and this leads to low utilization of the resources as the scheduling is done for the peak/maximum resource needs of the application.

In order to enable distributed ML workloads share a single cluster, Dorm uses two techniques: a dynamically-partitioned cluster management mechanism and an utilization-fairness optimizer.

The solution Dorm proposes is simple. It uses Docker containers to partition a cluster and runs one application per partition. Each application places its tasks on the assigned partition without petitioning for resources. Dorm can then adjust the existing resource allocations (i.e., number of containers in a partition) to keep a high resource utilization.

When adjusting an application's resources, Dorm first saves its state to a reliable storage system (e.g., Lustre filesystem). Then Dorm kills this application and creates/destroys containers on corresponding servers. Finally, Dorm resumes the killed application from the saved state with new resource allocations. This way, distributed ML applications can dynamically scale up/down without recomputing from the first iteration.

But this leaves me with several questions. Is this checkpointing only for the parameter-server state and not the worker states? Would the checkpointing work with TensorFlow which has dataflow graphs and associated state at each worker? Would those worker states matter? Would the states of the channels (i.e., messages in transit) matter? Finally, how is the checkpoint done in a distributed manner? The checkpoints will naturally appear at different points in different workers/containers; does that cause a problem?

The paper reports that Dorm was implemented to work with Petuum, MxNet, TensorFlow and MPI-Caffe. But details lack about how the implementations are done and how the application frameworks are modified to work with Dorm.

Single round consensus algorithm

The code above is pretty straightforward. The while loop between lines 36-42 models how a node sends its vote to other nodes one by one. The sender node can fail in this while loop after some of the nodes received the vote. So if we model check with FAILNUM=1, the agreement invariant is violated in this single round algorithm as seen in the error trace below.
The blue highlighted line, state 15, is the last line in the error trace, and the value of the state variables are listed in the window below. If you inspect "up" you can see node 1 is down. Checking "mb" you can see node 2 received node 1's vote, but node 3 did not receive node 1's node. As a result, the decision "d" for node 2 is "1", whereas node 3 decides "2", and both decisions are finalized. So the invariant "Agreement" is violated.

(Note that even if we had 10 nodes, and FAILNUM=8, we could have extended this scenario by failing always the smallest id up node in each round after it delivers the message to the next node in the sequence keeping the "1" vote alive but hidden.)

Another interesting part in the code occurs at lines 43 and 44.
After sending its vote to the other nodes, the node increments its round number, pt, by 1. But then it "awaits" other nodes to catchup, and goes to the next round only after this synchronization await at line 44. Note that the node awaits only for the nodes that are "up". If it waits for a down node to increment its pt+1, it would have to wait forever.

This await at line 44 cuts corners: it assumes shared memory instead of message passing. One way to implement this unrealistic "await" is to use physical time, but even that is a brittle method. In reality it is hard to implement perfectly synchronized rounds. Physical clock synchronization is hard, and since the OS is not a real-time OS, timing assumptions can be violated, say due to garbage collection kicking in, or due to VM/container getting slow, or network contention.

When the round synchronization assumption is broken, this algorithm fails. Trust me, you don't want your consensus algorithm, that the rest of your coordination infrastructure depends on, to fail. That is why consensus algorithms adopted in practice, such as Paxos, Raft, Zab, Viewstamped Replication, do not rely on synchronized rounds, and can tolerate (in the sense that agreement is not violated) extreme asynchrony in the system. When the timing assumptions normalize a bit, those algorithms then achieve progress and solve consensus.

Crash tolerant synchronized round consensus algorithm

To tolerate crash faults, it is clear that the algorithm needs to be extended to delay decision to future rounds where each node can ensure that all the nodes have the same set of values from which to decide.

To this end, Line 32 introduces a top-level while loop to the original model iterate through multiple rounds.

The important question here is to determine which round is safe to decide.

The trivial way to do this is to always iterate FAILNUM+1 rounds before deciding, but that is a wasteful solution. FAILNUM is the maximum number of faults that can occur, and the algorithm should be able to decide in less number of rounds in the common case when faults do not occur. But how do you tell that asymmetrically, only using the node's own perception of the system, which is by definition partial and always slightly stale.

One way to do this is to look at the stability of the set of proposed votes and compare mb, the mailbox contents for this round, with pmb, the mailbox contents in the previous round. If there is a potential for fault, it follows that the algorithm should always go to round 2, to confirm with others. The delaying of the decision should continue until the proposed votes converge to a single value for two consecutive rounds and the cardinality of the mailbox is also important because it witnesses to the fact that there are no faults so the above pathological crash sequence of vote hiding is avoided.

Observations

I had written a faulty version of the multi-round algorithm in my first try. I had not taken the cardinality of the set into account and went with straightforward set union. It didn't give any violations of the invariant for N=5 and FAILNUM=3, but the progress part was taking more than an hour on my laptop and I stopped running it. Turns out that version was susceptible to the pathological crash sequence and vote hiding as above. All the nodes decide with "2" but there is this node who just received a "1" vote which was still alive but hidden. So this node goes to next round, but since others have decided, this node will await forever. This is a wacky flavor of consensus, which can still be acceptable maybe if this minority report node kills itself with up:=FALSE. This led me to improve the line 44 condition. Another bug was about a node in a higher round sending messages which gets consumed by a node in a lower round, which leads to nodes getting stuck. To solve this, I had to improve the condition at line 37.

I found out these mistakes when I started writing this blog post. So writing and explaining is an indispensable part of the design process. If TLA+ model checking was more performant, I wouldn't give up prematurely, and I would still know about the solution. If model checking is taking long, it may be best to do it on the cloud, say on an AWS, Azure, or GCE. But the state space explosion is an inherent and dangerous nemesis for model-checking and it will bite. The best precaution is to keep things simple/minimal in the modeling. But that is not always easy.

Monday, November 6, 2017

When you assign a distributed machine learning (ML) application resources at the application level, those resources are allotted for many hours. However, loss improvements usually occur during the first part of the application execution, so it is very likely that the application is underutilizing the resources for the rest of the time. (Some ML jobs are retraining of an already trained DNN, or compacting of a DNN by removing unused parameters, etc., so blindly giving more resources at the beginning and pulling some back later may not work well.)

To avoid this, SLAQ allocates resources to ML applications at the task level, leveraging the iterative nature of ML training algorithms. Each iteration of the ML training algorithm submits tasks to the scheduler with running times around 10ms-100ms. This is how Spark based systems operate readily anyways. (The Dorm paper criticized this iteration-based task scheduling approach saying that it causes high overhead for scheduling and introduces delays for waiting to get scheduled, but there was no analysis on those claims.)

SLAQ collects "quality" (measured by "loss" really) and resource usage information from jobs, and using these it generates quality-improvement predictions for future iterations and decides on future iteration task scheduling based on these predictions.

The paper equates "quality" with "loss", and justifies this by saying:
1) "quality" cannot be defined unless at the application level; so to keep it general let's use "loss"
2) for exploratory training jobs, reaching 90% accuracy is sufficient for quality, and SLAQ enables to get there in a shorter time frame.

On the other hand, there are drawbacks to that. While delta improvements on loss may correspond to improvements on the quality, the long-tail of the computation may still be critical for "quality", even when loss is decreasing very slowly. This is especially true for non-convex applications.

The paper normalizes quality/loss metrics as follows: For a certain job, SLAQ normalizes the change of loss values in the current iteration with respect to the largest change it has seen for that job so far.

SLAQ predicts an iteration's runtime simply by how long it would take the N tasks/CPUs can process through S the size of data processed in an iteration. (minibatch size.)

For scheduling based quality improvements, the paper considers couple metrics, like maximizing the total quality and maximizing the minimum quantity. The paper includes a good evaluation section.

In conclusion, SLAQ improves the overall quality of executing ML jobs faster, particularly under resource contention, by scheduling at a finer granularity task-level based on the observed loss improvements.

Saturday, November 4, 2017

In my distributed systems class for Fall 17, I assigned modeling of the synchronized round consensus algorithm as the first project. I have been assigning TLA+/PlusCal modeling projects in my class for the last 4 years and releasing the projects and their solutions. I believe this is useful for the distributed systems community, because at this point the barrier before wider adoption of TLA+ tools seems to be the lack of more TLA+ modeling examples of algorithms/systems. My goal is to provide a TLA+/PlusCal example for everything I teach in the class. This way the students will get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures.

Timing of the project

I think I timed this project well. In the first month of the semester, I have covered reasoning about distributed programs in terms of safety and progress properties and gave them sufficient introduction to TLA+/PlusCal as well. While the students worked on the project, I covered time/state in distributed systems (logical/vector clocks, distributed snapshots, asynchrony concerns), and distributed mutual exclusion and dining philosophers.

The submission deadline for the project was just before I cover consensus. Working on the project prepared the students to appreciate the challenges of consensus, and primed them about the issues with some hands on practice. So, this is in some sense this is my poor man's implementation of a flipped classroom.

The Project 1 makes unrealistic assumptions of synchronized rounds and reliable channels. Last week I told the students about why and how these needs to be relaxed further, and that made a very nice introduction to Paxos and the failure detectors discussion that comes after.

Synchronized consensus

Every process broadcasts (to all other processes, including itself) its initial value v. In a synchronous network, this can be done in a single "round" of messages. After this round, each process decides on the minimum value it received.

If no faults occur, this algorithm is correct. In the presence of a crash fault, however, a problem can arise. In particular, if a process crashes during a round, some processes may have received its (low) initial value, but others may not have. (Note that the channels are always assumed to be fault-free; they deliver messages reliably once a message is put to the channel.)

Use the template below as your starting point, and fill in the redacted parts. Write and test an invariant property to capture the Agreement property of the consensus protocol. The agreement property should be satisfied when FAILNUM=0, i.e., when no node is allowed to fail. The property will fail to be satisfied when FAILNUM>0. In that case, write in the comments section, after the "================" line, your findings/observations about how the Agreement property is violated.

Extending the algorithm to address crash faults

To address crash faults, consider this simplifying assumption: say that at most 1 process can crash. How can we modify the algorithm to handle such a failure? (Note again that the channels are always fault-free; they deliver messages reliably once a message is put to the channel.)

Answer: by using 2 rounds. In the 1st round, processes broadcast their own initial value. In the 2nd round, processes broadcast the minimum value they heard. Each process then decides on the min value among all the sets of values it received in the 2nd round.

If the one crash occurs during the first round, the second round ensures that all processes have the same set of values from which to decide. Else, if the one crash occurs during the second round, the first round must have completed without a crash and hence all processes have the same set of values from which to decide.

Without knowing/referring-to FAILNUM, modify your first PlusCal algorithm to achieve consensus in the presence of crash faults. The key observation is that if no crash occurs during a round, all processes have the same set of values from which to decide and they correctly decide on the same minimum value.

Future projects

For the second project, I am assigning two phase transaction commit modeling. There are already models of this available from Lamport's webpage, and I ask students to model what happens when the initiator/transaction manager (TM) fails, how would a backup (TM) take over, and what type of problems would arise in an asynchronous system where failure-detection timeouts may fail.

I don't know what projects I will assign next year. Viewstamped replication modeling maybe? Another interesting one would be modeling blockchains and consensus? Feel free to suggest me ideas in the comments or via email.

While the projects may change every year, one thing is invariant: My class enables you to start a lifelong career in modeling, regardless of your looks. And as I continue my career in modeling, I find that, yes, the night life is pretty exciting: TLA+ has a way of enticing people to burn midnight oil.