EBML: A binary encoding format

1EBML

This is a Racket library to read and write EBML, the "extended
binary markup language." It was designed by the implementors
of the “matroska” project.

EBML is a packed format for representing structured data.

element = header data

header = encoded-id encoded-len

data = element ...

| non-element

The encoded-id and encoded-len fields use a
variable-length encoding. The first byte of this encoding
indicates—through the number of leading zeros—how many
bytes long the field is. the leading zeros and the first one
are then trimmed from the given number of bytes, and the result
is interpreted as a big-endian unsigned number.

So, for instance, the byte string

(bytes#b10001011)

... encodes the length 11, while the byte string

(bytes#b01000001#b00000010)

... encodes the length 258. Note that there is more than one
way to encode a given length; you could also encode the length
11 with the byte string

(bytes#b00100000#b00000000#b00001011)

Header IDs are limited to four bytes, and data lengths are limited
to 8.

EBML has at least one big problem, which is that the packed
representation is ambiguous. Specifically, there’s no way to
reliably distinguish data that is a sequence of elements from
data that is a binary blob.

To choose a concrete example, if you were using EBML to encode
s-expressions, you might choose a particular header id
(say, 1) to encode a pair, and another one (say, 2) to encode
atoms (let’s use 3 to indicate null, just to simplify).
The pairs would include two sub-elements, and the atoms
would contain, say string or integer. When decoding an encoded
stream, the reader needs to have a priori knowledge that the
header id 1 contains sub-elements, and the headers 2 and 3
do not.

Reads ebml elements from the given input-port, byte string, or file.
It continues reading ebml elements until the source is exhausted.
It signals an error if the end of the source doesn’t correspond with
the end of an element.

Since the EBML parser has no way of knowing which elements are
compound elements, using the parser usually requires manual
recursion into the parser. So, for instance, if we used the encoding
scheme given above where 1 encodes a pair, 2 encodes an atom, and
3 encodes null,
a decoder might look like this:

;ping-pongrequiredtoknowwhentorecur:

(define(read-ebml-sexpsbytes)

(mapexpand-ebml-sexp(ebml-readbytes)))

;recuronelementsthatarecontainers:

(define(expand-ebml-sexpelement)

(matchelement

[(list1exps-bytes)

(applycons(read-ebml-sexpsexps-bytes))]

[(list2atom-bytes)

(string->symbol(bytes->string/utf-8atom-bytes))]

[(list3atom-bytes)

empty]))

The file "examples/sexp-ebml.rkt" contains a working
example of this.

procedure

(ebml-read/optimisticin)→(listofebml-element?)

in:(or/cinput-port?path-string?bytes?)

Read a sequence of ebml-elements from the input. Optimistically
assume that any byte sequence that can be parsed as ebml elements
should be. The danger is that a sequence that was supposed to
represent a simple byte stream will be interpreted as a sequence
of bytes.