MapReduce for Decentralized Computation

I was reading Dean and Ghemawat’s MapReduce paper this morning. It describes a way to write
large-grain parallel programs to be executed on Google’s large
clusters of computers in a simple functional style, in C++. It
occurred to me that the system as described might be well-suited to
computational cooperation between mutually untrusting entities,
computation in what we at CommerceNet refer to as a “decentralized”
system.

Decentralized computational cooperation

There are five major problems with regard to computational
cooperation, by which I mean outsourcing some amount of your
computation to servers run by people you don’t trust completely:

Cracking

The server trusts the client not to break the server.

Storage

The client trusts the server to store the client’s data safely.

Correctness

The client trusts the server to execute the client’s code accurately and produce results quickly.

Confidentiality

The client trusts the server not to disclose the client’s data against the client’s wishes.

Payment

The server usually trusts the client to pay them for the service provided.

Cracking can be dealt with by known methods: metered resource usage
and well-known isolation techniques.

Payment can be dealt with in any number of ways; it should be noted
that it is not entirely separable from cracking.

Storage can be dealt with by replicating or erasure-coding data across
a number of administratively-independent servers.

This leaves the problems of correctness and confidentiality; I think
the MapReduce approach can help with correctness.

Correctness

The “map” function converts a record from an input file into a set of
records in an intermediate file; typically, each input record is
replicated in several places on the cluster, and the “map” function is
deterministic. The “reduce” function converts the set of all
intermediate records sharing the same key (gathered from the many
map-stage output files) into a set of output records, which are
written to an output file; typically the “reduce” function is also
deterministic.

In the usual case, where these functions are deterministic, they can
be executed on two administratively-independent servers, and the
results (which, in the Google case, are merely files) can be compared.
If they differ, the same results can be recomputed on more
administratively-independent servers to see which ones were correct.

(It may be worthwhile to compare Merkle tree hashes of the output
files, rather than the output files themselves, since moving the
output files over the network may entail significant expense.)

This prevents any single broken or dishonest machine or administrative
entity from affecting the correctness of the overall computation.
Higher levels of redundancy can be used to defend against stronger
attackers at relatively modest cost.

Some threats can be defeated by even weaker means with negligible
computational cost. Computing each function for a randomly selected
1% of input or intermediate records, then comparing the results, may
provide an acceptable probability of catching faults or compromises if
they are expected to affect a significant proportion of the output;
and it requires negligible computational cost. In the
ten-billion-record performance tests mentioned in the Google paper, a
corrupt “map” function would have to affect fewer than 694 of the
input records to have less than a 50% chance of detection by the 1%
sample (including 100 million randomly selected records). A corrupt
“map” function that affected 5000 input records — only one out of
every two million — would have only an 0.7% chance of not being
caught.

This probably deals adequately with machine failures and gross
attacks, but a careful attack might corrupt the output for only a
single input record — and would have only a 1% chance of being
caught. This may still be enough if the problem is a result of a
deliberate attack and the attacker is vulnerable to sufficiently
severe penalties.

Confidentiality

Confidentiality is a more difficult problem; computing with
confidential data on hardware that belongs to someone you don’t trust
requires that you compute with encrypted data. In the general case,
this is a very difficult problem.