Reverse hash encoding is a steganographic encoding scheme which transforms a
stream of binary data into a stream of tokens (eg, something resembling natural
language text) such that the stream can be decoded by concatenating the hashes
of the tokens.

TLDR: Tor over Markov chains

This encoder is given a word size (number of bits), a tokenization function (eg,
split text on whitespace), a hash function (eg, sha1), a corpus, and a modeling
function (eg, a markov model, or a weighted random model). The range of the
hash function is truncated to the word size. The model is built by tokenizing
the corpus and hashing each token with the truncated hash function. For the
model to be usable, there must be enough tokens to cover the entire hash space
(2^(word size) unique hashes). After the model is built, the input data bytes
are scaled up or down to the word size (eg, scaling [255, 18] from 8-bit bytes
to 4-bit words produces [15, 15, 1, 2]) and finally each scaled input word is
encoded by asking the model for a token which hashes to that word. (The
encoder’s model can be thought of as a probabilistic reverse hash function.)

The tokenization function needs to produce tokens which will always re-tokenize
the same way after being concatenated with each other in any order. So, for
instance, a “split on whitespace” tokenizer actually needs to append a
whitespace character to each token. The included “words” tokenizer replaces
newlines with spaces; “words2” does not, and “words3” does sometimes. The other
included tokenizers, “lines”, “bytes”, and “asciiPrintableBytes” should be
self-explanatory.

For streaming operation, the word size needs to be a factor or multiple of 8.
(Otherwise, bytes will frequently not be deliverable until after the subsequent
byte has been sent, which breaks most streaming applications). Implementing the
above-mentioned layer of timing cover would obviate this limitation. Also, when
the word size is not a multiple or factor of 8, there will sometimes be 1 or 2
null bytes added to the end of the message (due to ambiguity when converting
the last word back to 8 bits).

The markov encoder supports two optional arguments: the order of the model
(number of previous tokens which constitute a previous state, default is 1),
and –abridged which will remove all states from the model which do not lead to
complete hash spaces. If –abridged is not used, the markov encoder will
sometimes have no matching next token and will need to fall back to using the
random model. If -v is specified prior to the command, the rate of model
adherence is written to stderr periodically. With a 3MB corpus of about a half
million words (~50000 unique), at 2 bits per word (as per the SSH example
below) the unabridged model is adhered to about 90% of the time.

Example usage running Bananaphone codec as a standalone app

encode “Hello\n” at 13 bits per word, using a dictionary and random picker:

For troubleshooting it is helpful to watch the obfsproxy log outputs and tcpdump output. Here’s some sample tcpdump output:

sudo tcpdump -A -ni lo port 4703
He reached down some extent Party propaganda. white-jacketed chestnut palm bulk yapped Syme the fender, was NOT two children are furthest is a human nature. He drew the photograph we predict an exhaustive troop of the first visit to arrive of words except fear, only guess people in full nearly It
15:09:03.890842 IP 127.0.0.1.54119 > 127.0.0.1.4703: Flags [.], ack 3725471, win 1007, options [nop,nop,TS val 21694622 ecr 21694622], length 0