Overview

Exploring and working with structured binary data is necessary in many
different situations in a project like HelenOS. For instance, when implementing
a file format or filesystem, it is first necessary to explore preexisting files
and disks and learn the low‐level details of the format. Debugging compiled
programs, working with core dumps, and exploring network protocols also require
some way of interpreting binary data.

The most basic tool for exploring binary data is the hex editor. Using a hex
editor is inefficient and unpleasant because it requires manual calculation of
lengths and offsets while constantly referring back to the data format.
General‐purpose scripting languages can be used instead, so a structure can be
defined once and decoded as often as necessary. However, even with useful tools
like Python’s struct module, the programmer must specify how to read the input
data, calculate lengths and offsets, and provide useful output, so there’s much
more work involved than simply specifying the format of the data. This extra
code will probably be rewritten every time a new script is made, due to
slightly differing requirements.

The Bithenge project involves creating a powerful library and tools that will
make working with structured binary data faster and easier. It will consist of:

A core library that manages structured data and provides basic building
blocks for binary data interpretation.

Data providers to access various sources of raw binary data.

Format providers, which can load and save complex format specifications. In
particular, there will be a domain‐specific language for format
specifications.

Clients, programs which use the library to work with binary data. For
instance, there will be an interactive browser.

The initial goals for the project are an interactive browser for filesystem
structures as well as a debugger backend that can interpret core dumps and task
memory.

Requirements

Work in HelenOS—this means the code must be in C and/or an easily ported
language like Lua.

View on different layers. For instance, when viewing a FAT directory entry,
it should be possible to switch between viewing the formatted date and time,
the integers, and the original bytes.

Check whether data is valid; handle broken data reasonably well.

Parse pieces of the data lazily; don’t try to read everything at once.

Work in both directions (parsing and building) without requiring too much
extra effort when specifying the format.

Support full modifications. Ideally, allow creation of a whole filesystem
from scratch.

Trees

Bithenge represents all data in the form of a data structure called a “tree,”
similar to the data structure represented by JSON. A tree consists of a boolean
node, integer node, string node, or blob node, or an internal node with
children. A boolean node holds a boolean value, an integer node holds a signed
integer, and a string holds a Unicode string.

A blob node represents an arbitrary sequence of raw bytes or bits. Blob nodes
are polymorphic, allowing any source of raw binary data to be used. Bithenge
includes blob node implementations for in‐memory buffers, files, and block
devices. An implementation has also been written that reads another task’s
virtual memory, but it hasn’t been committed because it’s unclear whether it
will be useful in its current form.

An internal node has an arbitrary number of children, each with a unique key.
The key can be any node other than an internal node. Arrays can be represented
by internal nodes with integer keys starting at 0. The tree node can provide
children in an arbitrary order; the order will be used when displaying the
tree, but should have no semantic significance. Internal nodes are polymorphic
and can delay creation of child nodes until necessary, so keeping the whole
tree in memory can be avoided.

Internal nodes are currently responsible for freeing their own children. In the
future, it should be possible for there to be multiple references to the same
node, but it isn’t clear whether this should be implemented with symbolic
links, an acyclic graph with reference counting, or a full graph.

Note that all interpreted data is represented in Bithenge with nodes.
Therefore, the word “blob” usually refers to a blob node, and so on.

Transforms

A transform is a function from a tree to a tree. One example is uint32le,
which takes a 4‐byte blob node as the input tree and provides an integer node
as the output tree. Another example would be FAT16_filesystem, a transform
that takes a byte blob node as the input tree and provides a complex output
tree with various decoded information about the filesystem. Some transforms,
like uint32le, are built in to Bithenge; more complicated transforms can be
loaded from a script file.

Transforms are represented in Bithenge with a polymorphic object. The primary
method is apply, which applies a transform to an input tree and creates an
output tree. When a transform takes a blob node as input, it is sometimes
necessary to determine the prefix of a given blob that can be used as input to
the transform; the method prefix_length can be used for this.

Built‐in transforms

These transforms are implemented in C and included with Bithenge. Note that
precise names are preferred; scripts can define shorter aliases if necessary.

name

input

output

description

example

ascii

byte blob node

string

decodes some bytes as ASCII characters

hex:6869 becomes "hi"

bit

1‐bit blob node

boolean

decodes a single bit

1 becomes true

bits_be

byte blob node

bit blob node

decodes bytes as bits, starting with the most‐significant bit

hex:0f becomes bit:00001111

bits_le

byte blob node

bit blob node

decodes bytes as bits, starting with the least‐significant bit

hex:0f becomes bit:11110000

known_length(len)

blob node

blob node

requires the input to have a known length

nonzero_boolean

integer

boolean

decodes a boolean where nonzero values are true

0 becomes false

uint8

1‐byte blob node

integer node

decodes a 1‐byte unsigned integer

hex:11 becomes 17

uint16be

2‐byte blob node

integer node

decodes a 2‐byte big‐endian unsigned integer

hex:0201 becomes 513

uint16le

2‐byte blob node

integer node

decodes a 2‐byte little‐endian unsigned integer

hex:0101 becomes 257

uint32be

4‐byte blob node

integer node

decodes a 4‐byte big‐endian unsigned integer

hex:00000201 becomes 513

uint32le

4‐byte blob node

integer node

decodes a 4‐byte little‐endian unsigned integer

hex:01010000 becomes 257

uint64be

8‐byte blob node

integer node

decodes a 8‐byte big‐endian unsigned integer

hex:0000000000000201 becomes 513

uint64le

8‐byte blob node

integer node

decodes a 8‐byte little‐endian unsigned integer

hex:0101000000000000 becomes 257

uint_be(len)

bit blob node

integer node

decodes bits as an unsigned integer, starting with the most‐significant bit

uint_le(len)

bit blob node

integer node

decodes bits as an unsigned integer, starting with the least‐significant bit

zero_terminated

byte blob node

byte blob node

takes bytes up until the first 00

hex:7f0400 becomes hex:7f04

Basic syntax

Script files are used to define complicated transforms.

Transforms (including built‐in transforms) can be referenced by name:
uint32le.

Transforms can be given a new name: transform u32 = uint32le; defines a
shorter alias for uint32le.

Transforms can be composed to create a new transform that applies them in
order. The transform ascii <- zero_terminated first removes the 0x00 from the
end of the blob, then decodes it as ascii. Note that the order of composition
is consistent with function composition and nested application in mathematics,
and also consistent with the general idea that data moves from right to left as
it is decoded.

When such a transform is used, expressions must be given to calculate its
parameters. Currently, the only possible expressions are parameters given to
the current transform, boolean and integer literals, and previously decoded
fields in the current struct or an outer struct:

Structs

Structs are used when a blob contains multiple data fields in sequence. A
struct transform applies each subtransform to sequential parts of the blob and
combines the results to create an internal node. The result of each
subtransform is either assigned a key or has its keys and values merged into
the final tree. Each subtransform must support prefix_length, so the lengths
and positions of the data fields can be determined.

Example

If labeled_point is applied to hex:0600000041000300000008000000, the result
is {"id": 6, "label": "A", "x": 3, "y": 8}.

Flow Control

Boolean conditions can use
if (expression) { transform-if-true } else { transform-if-false }, for
example: if (little_endian) { uint32le } else { uint32be }. There is also
syntax for switches: switch (expression) { expression: transform; ... }. Both
of these have syntactic sugar for use in structs; in a struct,
if (expression) { fields... } is equivalent to
<- if (expression) { struct { fields... } };.

Three kinds of repetition are supported. repeat(expression) {transform}
applies the transform a given number of times. repeat {transform} applies the
transform until it fails or the end of the data is reached.
do {transform} while(expression) applies the transform until the expression,
evaluated within the result of the transform, is false; this can be used for
things like
do { struct { .keep_going <- nonzero_boolean <- uint8; .val <- uint8; } } while(.keep_going).
For each type of repetition, the result is an internal node with sequential
keys starting at 0.

Using Bithenge

The Bithenge source code is in uspace/app/bithenge and is built along with
HelenOS. It can be built for Linux instead with make -f Makefile.linux.

The program can be run with bithenge <script file> <source>. The script file
must define a transform called main. The source can start with one of the
following prefixes:

Prefix

Example

Description

file:

file:/textdemo

Read the contents of a file. This is the default if no prefix is used.

block:

block:bd/initrd

Read the contents of a block device. (HelenOS only.)

hex:

hex:01000000

Use a string of hexadecimal characters to create a blob node.

There are some example files in uspace/dist/src/bithenge.

Using the API

An overview of the API will be written later.

Nodes, expressions, and transforms use reference counting. Functions that
produce such objects (through a bithenge_xxx_t ** argument) create a new
reference to the object; you are responsible for ensuring the reference count
is eventually decremented. If a function’s documentation says it “takes
[ownership of] a reference” to an object, the function guarantees the object’s
reference count will eventually be decremented, even if an error occurs.
Therefore, if you create an object only to immediately pass it to such a
function, you do not need to worry about its reference count.

Future language ideas

In approximate order of priority.

Other ideas

Subblobs

When there are pointers to other offsets in the blob, the script
could pass the whole blob as a parameter and apply transforms to subblobs.
This is essential for non‐sequential blobs like filesystems.

Complex expressions

Expressions that use operators or call transforms.

Assertions

These could be implemented as transforms that don't actually
change the input. There could be multiple levels, ranging from “warning” to
“fatal error”.

Currently, struct, repeat, and so on only
work with blobs, which must be either byte sequences or bit sequences.
Numbered internal nodes (such as those made by repeat) should be supported
as well.

Transforming internal nodes

After binary data is decoded into a tree, it
should be possible to apply further transforms to interpret the data
further. For instance, after the FAT and directory entries of a FAT
filesystem have been decoded, a further transform could determine the data
for each file.

More information in repeat subtransforms

Repeat subtransforms should have
access to the current index and previously decoded items.

Hidden fields

Some fields, such as length fields, are no longer interesting
after the data is decoded, so they should be hidden by default.

Search

Decoding may require searching for a fixed sequence of bytes in the
data.

Automatic parameters

It could be useful to automatically pass some
parameters rather than computing and passing them explicitly. For instance,
a version number that affects the format of many different parts of the file
could be passed automatically, without having to write it out every time. A
more advanced automatic parameter could keep track of current offset being
decoded within a blob. There would need to be some sort of grouping to
determine which transforms have the automatic parameters.

Smarter length calculation

Bithenge should automatically detect the length
of certain composed transforms, such as repeat(8) {bit} <- bits_le. This
would also be addressed by the constraint‐based version.

Constraint‐based version

This and most other projects use an imperative design, where the format
specification is always used in a fixed order, one step at a time. The
imperative design causes problems when the user wants to modify a field,
because arbitrary changes to other fields may be necessary that cannot be
determined from the format specification.

It may be possible to solve this with a constraint-based design, where the
format specification consists of statements that must be true about the raw and
interpreted data, and the program figures out how to solve these constraints.
Unfortunately, this approach seems too open-ended and unpredictable to fit
within GSoC.

Interesting formats

These formats will be interesting and/or difficult to handle. I will keep them
in mind when designing the library.

Filesystem allocation tables, which should be kept consistent with the actual
usage of the disk.

Filesystem logs, which should be applied to the rest of the disk before
interpreting it.

Formats where the whole file can have either endianness depending on a field
in the header.

The ​Blender file format
is especially dynamic. When Blender saves a file, it just copies the
structures from memory and translates the pointers. Since each Blender
version and architecture will have different structures, the output file
includes a header describing the fields and binary layout of each structure.
When the file is loaded, the header is read first and the structures will be
translated as necessary.

If the language is powerful enough, it might be possible to have a native
description of Zlib and other compression formats.

It could be interesting to parse ARM or x86 machine code.

Existing Tools

I researched existing tools related to my project, so they can be used for
inspiration.

Construct

​Construct is a Python library for creating
declarative structure definitions. Each instance of the Construct class has a
name, and knows how to read from a stream, write to a stream, and determine its
length. Some predefined Construct subclasses use an arbitrary Python function
evaluated at runtime, or behave differently depending on whether
sub‐Constructs throw exceptions. Const uses a sub‐Construct and makes
sure the value is correct. Also has lazy Constructs.

Unfortunately, if you change the size of a structure, you still have to change
everything else manually.

BinData

​BinData makes good use of Ruby syntax; it
mostly has the same features as Construct.

Imperative DSLs

DSLs in this category are used in an obvious, deterministic manner, and complex
edits (changing the length of a structure) are difficult or impossible. They
are simple imperative languages in which fields, structures, bitstructures, and
arrays can be defined. The length, decoded value, and presence of fields can be
determined by expressions using any previously decoded field, and structures
can use if/while/continue/break and similar statements. Structures can
inherit from other structures, meaning that the parent’s fields are present at
the beginning of the child. Statements can move to a different offset in the
input data. There may be a real programming language that can be used along
with the DSL.

A popular tool for extracting files from video game archives. Its main
strength is the broad number of compression formats supported. It can put
modified files back in the archive in trivial situations.

As the Construct wiki notes, you would expect these developers to have some
sort of DSL, but they just use C for everything. Wireshark does use ASN.1,
Diameter, and MIDL for protocols developed with them.

Miscellaneous ideas

Code exporter

A tool could generate C code to read and write data given a specification. A
separate file could be used to specify which types should be used and which
things should be read lazily or strictly.