main is usually a function

Sunday, May 3, 2015

Formal methods for software verification are usually seen as a high-cost tool that you would only use on the most critical systems, and only after extensive informal verification. The Alloy project aims to be something completely different: a lightweight tool you can use at any stage of everyday software development. With just a few lines of code, you can build a simple model to explore design issues and corner cases, even before you've started writing the implementation. You can gradually make the model more detailed as your requirements and implementation get more complex. After a system is deployed, you can keep the model around to evaluate future changes at low cost.

Sounds great, doesn't it? I have only a tiny bit of prior experience with Alloy and I wanted to try it out on something more substantial. In this article we'll build a simple model of a garbage collector, visualize its behavior, and fix some problems. This is a warm-up for exploring more complex GC algorithms, which will be the subject of future articles.

I won't describe the Alloy syntax in full detail, but you should be able to follow along if you have some background in programming and logic. See also the Alloy documentation and especially the book Software Abstractions: Logic, Language, and Analysis by Daniel Jackson, which is a very practical and accessible introduction to Alloy. It's a highly recommended read for any software developer.

You can download Alloy as a self-contained Java executable, which can do analysis and visualization and includes an editor for Alloy code.

The model

The garbage-collected heap consists of Objects, each of which can point to any number of other Objects (including itself). There is a distinguished object Root which represents everything that's accessible without going through the heap, such as global variables and the function call stack. We also track which objects have already been garbage-collected. In a real implementation these would be candidates for re-use; in our model they stick around so that we can detect use-after-free.

The open statement invokes a library module to provide a total ordering on States, which we will interpret as the progression of time. More on this later.

Relations

In the code that follows, it may look like Alloy has lots of different data types, overloading operators with total abandon. In fact, all these behaviors arise from an exceptionally simple data model:

Every value is a relation; that is, a set of tuples of the same non-zero length.

When each tuple has length 1, we can view the relation as a set. When each tuple has length 2, we can view it as a binary relation and possibly as a function. And a singleton set is viewed as a single atom or tuple.

Since everything in Alloy is a relation, each operator has a single definition in terms of relations. For example, the operators . and [] are syntax for a flavor of relational join. If you think of the underlying relations as a database, then Alloy's clever syntax amounts to an object-relational mapping that is at once very simple and very powerful. Depending on context, these joins can look like field access, function calls, or data structure lookups, but they are all described by the same underlying framework.

The elements of the tuples in a relation are atoms, which are indivisible and have no meaning individually. Their meaning comes entirely from the relations and properties we define. Ultimately, atoms all live in the same universe, but Alloy gives "warnings" when the type system implied by the sig declarations can prove that an expression is always the empty relation.

Here are the relations implied by our GC model, as tuple sets along with their types:

The last three relations come from the util/ordering library. Note that a sig implicitly creates some atoms.

Dynamics

The live objects are everything reachable from the root:

fun live(s: State): set Object {
Root.*(s.pointers)
}

*(s.pointers) constructs the reflexive, transitive closure of the binary relation s.pointers; that is, the set of objects reachable from each object.

Of course the GC is only part of a system; there's also the code that actually uses these objects, which in GC terminology is called the mutator. We can describe the action of each part as a predicate relating "before" and "after" states.

The mutator cannot collect garbage, but it can change the pointers of any live object. The GC doesn't touch the pointers, but it collects any dead object. In both cases we require that something changes in the heap.

This says that in the initial state, no object has been collected, and every object is in the root set except Root itself. This means we don't have to model allocation as well. Each state except the last must be followed by a mutator step or a GC step.

The syntax all x: e | P says that the property P must hold for every tuple x in e. Alloy supports a variety of quantifiers like this.

Interacting with Alloy

The development above looks nice and tidy — I hope — but in reality, it took a fair bit of messing around to get to this point. Alloy provides a highly interactive development experience. At any time, you can visualize your model as a collection of concrete examples. Let's do that now by adding these commands:

pred Show {}
run Show for 5

Now we select this predicate from the "Execute" menu, then click "Show". The visualizer provides many options to customise the display of each atom and relation. The config that I made for this project is "projected over State", which means you see a graph of the heap at one moment in time, with forward/back buttons to reach the other States.

After clicking around a bit, you may notice some oddities:

The root isn't a heap object; it represents all of the pointers that are reachable without accessing the heap. So it's meaningless for an object to point to the root. We can exclude these cases from the model easily enough:

fact {
all s: State | no s.pointers.Root
}

(This can also be done more concisely as part of the original sig.)

Now we're ready to check the essential safety property of a garbage collector:

SAT solvers and bounded model checking

"May be" valid? Fortunately this has a specific meaning. We asked Alloy to look for counterexamples involving at most 5 objects and 10 time steps. This bounds the search for counterexamples, but it's still vastly more than we could ever check by exhaustive brute force search. (See where it says "8617 vars"? Try raising 2 to that power.) Rather, Alloy turns the bounded model into a Boolean formula, and feeds it to a SAT solver.

This all hinges on one of the weirdest things about computing in the 21st century. In complexity theory, SAT (along with many equivalents) is the prototypical "hardest problem" in NP. Why do we intentionally convert our problem into an instance of this "hardest problem"? I guess for me it illustrates a few things:

The huge gulf between worst-case complexity (the subject of classes like NP) and average or "typical" cases that we encounter in the real world. For more on this, check out Impagliazzo's "Five Worlds" paper.

The fact that real-world difficulty involves a coordination game. SAT solvers got so powerful because everyone agrees SAT is the problem to solve. Standard input formats and public competitions were a key part of the amazing progress over the past decade or two.

Of course SAT solvers aren't quite omnipotent, and Alloy can quickly get overwhelmed when you scale up the size of your model. Applicability to the real world depends on the small scope hypothesis:

If an assertion is invalid, it probably has a small counterexample.

Or equivalently:

Systems that fail on large instances almost always fail on small instances with similar properties.

This is far from a sure thing, but it already underlies a lot of approaches to software testing. With Alloy we have the certainty of proof within the size bounds, so we don't have to resort to massive scale to find rare bugs. It's difficult (but not impossible!) to imagine a GC algorithm that absolutely cannot fail on fewer than 6 nodes, but is buggy for larger heaps. Implementations will often fall over at some arbitrary resource limit, but algorithms and models are more abstract.

Conclusion

It's not surprising that our correctness property

all s: State | no (s.collected & s.live)

holds, since it's practically a restatement of the garbage collection "algorithm":

t.collected = s.collected + (Object - s.live)

Because reachability is built into Alloy, via transitive closure, the simplest model of a garbage collector does not really describe an implementation. In the next article we'll look at incremental garbage collection, which breaks the reachability search into small units and allows the mutator to run in-between. This is highly desirable for interactive or real-time apps; it also complicates the algorithm quite a bit. We'll use Alloy to uncover some of these complications.

In the meantime, you can play around with the simple GC model and ask Alloy to visualize any scenario you like. For example, we can look at runs where the final state includes at least 5 pointers, and at least one collected object:

pred Show {
#(last.pointers) >= 5
some last.collected
}
run Show for 5

Thanks for reading! You can find the code in a GitHub repository which I'll update if/when we get around to modeling more complex GCs.

Wednesday, March 18, 2015

<kmc> maybe the whole project needs a better name, idk
<Ms2ger> htmlparser, perhaps
<jdm> tagsoup
<Ms2ger> UglySoup
<Ms2ger> Since BeautifulSoup is already taken
<jdm> html5ever
<Ms2ger> No
<jdm> you just hate good ideas
<pcwalton> kmc: if you don't call it html5ever that will be a massive missed opportunity

By that point we already had a few contributors. Now we have 469 commits from
18 people, which is just amazing. Thank you to everyone who helped
with the project. Over the past year we've upgraded Rust almost 50 times; I'm
extremely grateful to the community members who had a turn at this Sisyphean
task.

Several people have also contributed major enhancements. For example:

Clark Gaebel implemented zero-copy parsing. I'm in the process of reviewing
this code and will be landing pieces of it in the next few weeks.

Josh Matthews made it possible to suspend and resume parsing from the tree sink.
Servo needs this to do async resource fetching for external <script>s of the
old-school (non-async/defer) variety.

Chris Paris implemented fragment parsing and improved serialization. This means
Servo can use html5ever not only for parsing whole documents, but also for
the innerHTML/outerHTML getters and setters within the DOM.

Adam Roben brought us dramatically closer to spec conformance. Aside from foreign
(XML) content and <template>, we pass 99.6% of the html5lib tokenizer and tree
builder tests! Adam also improved the build and test infrastructure in a number
of ways.

I'd also like to thank Simon Sapin for doing the initial review of my code, and
finding a few bugs in the process.

html5ever makes heavy use of Rust's metaprogramming features. It's been
something of a wild ride, and we've collaborated with the Rust team in a number
of ways. Felix Klock came through in a big
way when a Rust upgrade
broke the entire tree builder. Lately, I've been working on improvements to
Rust's macro system ahead of the 1.0
release, based
in part on my experience with html5ever.

Even with the early-adopter pains, the use of metaprogramming was absolutely
worth it. Most of the spec-conformance patches were only a few lines, because
our encoding of parser rules is so close to what's written in the spec. This
is especially valuable with a "living standard" like HTML.

The future

Two upcoming enhancements are a high priority for Web compatibility in Servo:

Character encoding detection and conversion.
This will build on the zero-copy UTF-8 parsing mentioned above. Non-UTF-8 content
(~15% of the Web) will have "one-copy parsing" after a conversion to UTF-8. This keeps the
parser itself lean and mean.

document.write support. This API can
insert arbitrary UTF-16 code units (which might not even be valid Unicode) in the
middle of the UTF-8 stream. To handle this, we might switch to
WTF-8. Along with document.write we'll start
to do speculative parsing.

It's likely that I'll work on one or both of these in the next quarter.

Servo may get SVG support in the near future, thanks to
canvg. SVG nodes can be embedded in
HTML or loaded from an external XML file. To support the first case, html5ever
needs to implement WHATWG's rules for parsing foreign content in HTML. To
handle external SVG we could use a proper XML parser, or we could extend
html5ever to support "XML5", an
error-tolerant XML syntax similar to WHATWG HTML. Ygg01 made some progress
towards implementing XML5. Servo would most likely use it for XHTML as well.

Improved performance is always a goal. html5ever describes itself as
"high-performance" but does not have specific comparisons to other HTML
parsers. I'd like to fix that in the near future. Zero-copy parsing will be a
substantial improvement, once some performance issues in
Rust get
fixed.
I'd like to revisit SSE-accelerated
parsing as well.

I'd also like to support html5ever on some stable Rust 1.x
version, although it probably
won't happen for 1.0.0. The main obstacle here is procedural macros. Erick
Tryzelaar has done some great work recently with
syntex,
aster, and
quasi. Switching to this ecosystem will
get us close to 1.x compatibility and will clean up the macro code quite a
bit. I'll be working with Erick to use html5ever as an early validation of his
approach.

The C API for html5ever still builds, thanks to continuous integration. But
it's not complete or well-tested. With the removal of Rust's
runtime, maintaining the C API
does not restrict the kind of code we can write in other parts of the parser.
All we need now is to complete the C
API and write tests. This would
be a great thing for a community member to work on. Then we can write bindings
for every language under the sun and bring fast, correct, memory-safe HTML
parsing to the masses :)

Friday, February 20, 2015

Bitwise Cyclic Tag is an
extremely simple automaton slash programming language. BCT uses a program
string and a data string, each made of bits. The program string is interpreted
as if it were infinite, by looping back around to the first bit.

The program consists of commands executed in order. There is a single one-bit
command:

0: Delete the left-most data bit.

and a single two-bit command:

1x: If the left-most data bit is 1, copy bit x to the right of the data string.

We halt if ever the data string is empty.

Remarkably, this is enough to do universal
computation. Implementing it in
Rust's macro system gives a proof
(probably not the first one) that Rust's macro system is Turing-complete, aside
from the recursion limit imposed by the compiler.

But this too is disallowed. An $x:tt variable cannot be followed by a
repetition $(...)*, even though it's (I believe) harmless. There is an open
RFC about this issue. For now I
have to handle the "one" and "more than one" cases separately, which is
annoying.

In general, I don't think macro_rules! is a good language for arbitrary
computation. This experiment shows the hassle involved in implementing one of
the simplest known "arbitrary computations". Rather, macro_rules! is good at
expressing patterns of code reuse that don't require elaborate compile-time
processing. It does so in a way that's declarative, hygienic, and high-level.

However, there is a big middle ground of non-elaborate, but non-trivial
computations. macro_rules! is hardly ideal for that, but procedural
macros have
problems of their own. Indeed, the bct! macro is an extreme case of a
pattern I've found useful in the real world. The idea is that every
recursive invocation of a macro gives you another opportunity to pattern-match
the arguments. Some of html5ever's
macros
do this, for example.

Saturday, January 10, 2015

Part of the sales pitch for Rust is that it's "as
bare metal as C".1 Rust can do anything C can do, run anywhere C
can run,2 with code that's just as efficient, and at least as safe
(but usually much safer).

I'd say this claim is about 95% true, which is pretty good by the standards of
marketing claims. A while back I decided to put it to the test, by making the
smallest, most self-contained Rust program possible. After resolving a
fewissues along the way, I ended
up with a 151-byte, statically linked executable for AMD64 Linux. With the
release of Rust
1.0-alpha, it's time
to show this off.

This uses my syscall library, which
provides the syscall! macro. We wrap the underlying system calls with Rust
functions, each exposing a safe interface to the
unsafesyscall! macro. The
main function uses these two safe functions and doesn't need its own unsafe
annotation. Even in such a small program, Rust allows us to isolate memory
unsafety to a subset of the code.

Because of crate_type="rlib", rustc will build this as a static library, from
which we extract a single object file tinyrust.o:

Note that main doesn't end in a ret instruction. The exit function
(which gets inlined) is marked with a "return type" of !, meaning "doesn't
return". We make
good on this by invoking the unreachable
intrinsic after
syscall!. LLVM will optimize under the assumption that we
can never reach this point, making no guarantees about the program behavior if
it is reached. This represents the fact that the kernel is actually going to
kill the process before syscall!(EXIT, n) can return.

Because we use inline assembly and intrinsics, this code is not going to work
on a stable-channel
build of Rust 1.0. It
will require an alpha or nightly build until such time as inline assembly and
intrinsics::unreachable are added to the stable language of Rust 1.x.

Note that I didn't even use #![no_std]! This program is so tiny that
everything it pulls from libstd is a type definition, macro, or fully inlined
function. As a result there's nothing of libstd left in the compiler output.
In a larger program you may need #![no_std], although its role is greatly
reduced following the removal
of Rust's runtime.

Linking

This is where things get weird.

Whether we compile from C or Rust,3 the standard linker toolchain is
going to include a bunch of junk we don't need. So I cooked up my own linker
script:

Finally we stick this on the end of a custom ELF header. The header is written
in NASM syntax but contains no instructions, only data
fields. The base address 0x400078 seen above is the end of this header, when
the whole file is loaded at 0x400000. There's no guarantee that ld will
put main at the beginning of the file, so we need to separately determine the
address of main and fill that in as the e_entry field in the ELF file
header.

The final trick

To get down to 151 bytes, I took inspiration from this classic
article, which
observes that padding fields in the ELF header can be used to store other data.
Like, say, a string
constant.
The Rust code changes to access this constant:

A Rust slice
like &[u8] consists of a pointer to some memory, and a length indicating the
number of elements that may be found there. The module
std::raw exposes this as an
ordinary struct that we build, then
transmute to the actual
slice type. The transmute function generates no code; it just tells the type
checker to treat our raw::Slice<u8> as if it were a &[u8]. We return this
value out of the unsafe block, taking advantage of the "everything is an
expression" syntax, and then print the message as before.

The object code is the same as before, except that the relocation for the
string constant has become an absolute address. The binary is smaller by 7
bytes (the size of "Hello!\n") and it still works!

You can find the full code on
GitHub. The code in this article
works on rustc 1.0.0-dev (44a287e6e 2015-01-08). If I update the code on GitHub,
I will also update the version number printed by the included build script.

Wednesday, October 29, 2014

If, like me,
you've been frustrated with the status quo in systems languages,
this article will give you a taste of why Rust is so exciting.
In a tiny amount of code,
it shows a lot of ways that Rust really kicks ass compared to C and C++.
It's not just safe and fast,
it's a lot more convenient.

Web browsers do string interning
to condense the strings that make up the Web,
such as tag and attribute names,
into small values that can be compared quickly.
I recently added
event logging support to
Servo's string interner.
This will allow us
to record traces from real websites,
which we can use
to guide further optimizations.

Interned strings have a 64-bit ID,
which is recorded in every event.
The String
we store for "insert" events
is like C++'s std::string;
it points to a buffer in the heap,
and it owns that buffer.

This enum is a bit fancier than a C enum,
but its representation in memory
is no more complex than a C struct.
There's a tag for the three alternatives,
a 64-bit ID,
and a few fields that make up the String.
When we pass or return an Event by value,
it's at worst a memcpy
of a few dozen bytes.
There's no implicit heap allocation,
garbage collection,
or anything like that.
We didn't define a way to copy an event;
this means the String buffer
always has a unique owner
who is responsible for freeing it.

The deriving(Show) attribute
tells the compiler to auto-generate
a text representation,
so we can print an Event
just as easily as a built-in type.

lazy_static! will initialize both of them
when LOG is first used.
Like String, the Vec is a growable buffer.
We won't turn on event logging in release builds,
so it's fine to pre-allocate space for 50,000 events.
(You can put underscores
anywhere in a integer literal
to improve readability.)

lazy_static!, Mutex, and Vec are all implemented
in Rust
using gnarly low-level code.
But the amazing thing
is that all three expose a safe interface.
It's simply not possible
to use the variable before it's initialized,
or to read the value the Mutex protects without locking it,
or to modify the vector while iterating over it.

The worst you can do is deadlock.
And Rust considers that pretty bad, still,
which is why it discourages global state.
But it's clearly what we need here.
Rust takes a pragmatic approach to safety.
You can always write the unsafe keyword
and then use the same pointer tricks
you'd use in C.
But you don't need to be quite so guarded
when writing the other 95% of your code.
I want a language that assumes I'm brilliant but distracted :)

Rust catches these mistakes at compile time,
and produces the same code you'd see
with equivalent constructs in C++.
For a more in-depth comparison,
see Ruud van Asseldonk's
excellent series of articles
about porting a spectral path tracer from C++ to Rust.
The Rust code performs basically the same as
Clang / GCC / MSVC on the same platform.
Not surprising,
because Rust uses LLVM
and benefits from
the same backend optimizations as Clang.

lazy_static! is not a built-in language feature;
it's a macro provided by
a third-party library.
Since the library uses Cargo,
I can include it in my project by adding

to src/lib.rs.
Cargo will automatically fetch and build all dependencies.
Code reuse becomes no harder
than in your favorite scripting language.

Finally, we define a function
that pushes a new event onto the vector:

pubfnlog(e: Event) {
LOG.lock().push(e);
}

LOG.lock() produces an
RAII handle
that will automatically unlock the mutex
when it falls out of scope.
In C++ I always hesitate to use temporaries like this
because if they're destroyed too soon,
my program will segfault or worse.
Rust has compile-time lifetime checking,
so I can do things that would be reckless in C++.

If you scroll up you'll see
a lot of prose and not a lot of code.
That's because I got
a huge amount of functionality for free.
Here's the logging module again:

Any project which doesn't opt in to log-events
will see zero impact from any of this.

If you'd like to learn Rust,
the Guide is a good place to start.
We're getting close to 1.0
and the important concepts have been stable for a while,
but the details of syntax and libraries are still in flux.
It's not too early to learn,
but it might be too early to maintain a large library.

By the way,
here are the events generated by
interning the three strings
foobarbazfooblockquote:

There are
three different kinds of IDs,
indicated by the least significant bits.
The first is a pointer
into a standard interning table,
which is protected by a mutex.
The other two are created without synchronization,
which improves parallelism
between parser threads.

In UTF-8,
the string foo
is smaller than a 64-bit pointer,
so we store the characters directly.
blockquote is too big for that,
but it corresponds to a well-known HTML tag.
0xb is the index of blockquote in
a static list
of strings that are common
on the Web.
Static atoms
can also be used
in pattern matching, and
LLVM's optimizations
for C's switch statements will apply.

Wednesday, August 27, 2014

One reason I'm excited about Rust is that I can compile Rust code to a simple native-code library, without heavy runtime dependencies, and then call it from any language. Imagine writing performance-critical extensions for Python, Ruby, or Node in a safe, pleasant language that has static lifetime checking, pattern matching, a real macro system, and other goodies like that. For this reason, when I started html5ever some six months ago, I wanted it to be more than another "Foo for BarLang" project. I want it to be the HTML parser of choice, for a wide variety of applications in any language.

Today I started work in earnest on the C API for html5ever. In only a few hours I had a working demo. And this is a fairly complicated library, with 5,000+ lines of code incorporating

lots and lots of generic code — if this library were written in C++, almost all of it would be in header files.

It's pretty cool that we can use all this machinery from C, or any language that can call C. I'll describe first how to build and use the library, and then I'll talk about the implementation of the C API.

html5ever (for C or for Rust) is not finished yet, but if you're feeling adventurous, you are welcome to try it out! And I'd love to have more contributors. Let me know on GitHub about any issues you run into.

The build process is pretty standard for C; we just link a .a file and its dependencies. The biggest obstacle right now is that you won't find the Rust compiler in your distro's package manager, because the language is still changing so rapidly. But there's a ton of effort going into stabilizing the language for a Rust 1.0 release this year. It won't be too long before rustc is a reasonable build dependency.

The struct h5e_token_ops contains pointers to callbacks. Any events we don't care to handle are left as NULL function pointers. Inside main, we create a tokenizer and feed it a string. html5ever for C uses a simple pointer+length representation of buffers, which is this struct h5e_buf you see being passed by value.

This demo only does tokenization, not tree construction. html5ever can perform both phases of parsing, but the API surface for tree construction is much larger and I didn't get around to writing C bindings yet.

Implementing the C API

Some parts of Rust's libstd depend on runtime services, such as task-local data, that a C program may not have initialized. So the first step in building a C API was to eliminate all std:: imports. This isn't nearly as bad as it sounds, because large parts of libstd are just re-exports from other libraries like libcore that we can use with no trouble. To be fair, I did write html5ever with the goal of a C API in mind, and I avoided features like threading that would be difficult to integrate. So your library might give you more trouble, depending on which Rust features you use.

The next step was to add the #![no_std] crate attribute. This means we no longer import the standard prelude into every module. To compensate, I added use core::prelude::*; to most of my modules. This brings in the parts of the prelude that can be used without runtime system support. I also added many imports for ubiquitous types like String and Vec, which come from libcollections.

I also had to remove all uses of format!(), println!(), etc., or move them inside #[cfg(not(for_c))]. I needed to copy in the vec!() macro which is only provided by libstd, even though the Vec type is provided by libcollections. And I had to omit debug log messages when building for C; I did this with conditionally-defined macros.

With all this preliminary work done, it was time to write the C bindings. Here's how the struct of function pointers looks on the Rust side:

The processing of tokens is straightforward. We pattern-match and then call the appropriate function pointer, unless that pointer is NULL. (Edit: eddyb points out that storing NULL as an extern "C" fn is undefined behavior. Better to use Option<extern "C" fn ...>, which will optimize to the same one-word representation.)

To create a tokenizer, we heap-allocate the Rust data structure in a Box, and then transmute that to a raw C pointer. When the C client calls h5e_tokenizer_free, we transmute this pointer back to a box and drop it, which will invoke destructors and finally free the memory.

You'll note that the functions exported to C have several special annotations:

#[no_mangle]: skip name mangling, so we end up with a linker symbol named h5e_tokenizer_free instead of _ZN5for_c9tokenizer18h5e_tokenizer_free.

One remaining issue is that Rust is hard-wired to use jemalloc, so linking html5ever will bring that in alongside the system's libc malloc. Having two separate malloc heaps will likely increase memory consumption, and it prevents us from doing fun things like allocating Boxes in Rust that can be used and freed in C. Before Rust can really be a great choice for writing C libraries, we need a better solution for integrating the allocators.

If you'd like to talk about calling Rust from C, you can find me as kmc in #rust and #rust-internals on irc.mozilla.org. And if you run into any issues with html5ever, do let me know, preferably by opening an issue on GitHub. Happy hacking!