Wednesday, October 29, 2014

If, like me,
you've been frustrated with the status quo in systems languages,
this article will give you a taste of why Rust is so exciting.
In a tiny amount of code,
it shows a lot of ways that Rust really kicks ass compared to C and C++.
It's not just safe and fast,
it's a lot more convenient.

Web browsers do string interning
to condense the strings that make up the Web,
such as tag and attribute names,
into small values that can be compared quickly.
I recently added
event logging support to
Servo's string interner.
This will allow us
to record traces from real websites,
which we can use
to guide further optimizations.

Interned strings have a 64-bit ID,
which is recorded in every event.
The String
we store for "insert" events
is like C++'s std::string;
it points to a buffer in the heap,
and it owns that buffer.

This enum is a bit fancier than a C enum,
but its representation in memory
is no more complex than a C struct.
There's a tag for the three alternatives,
a 64-bit ID,
and a few fields that make up the String.
When we pass or return an Event by value,
it's at worst a memcpy
of a few dozen bytes.
There's no implicit heap allocation,
garbage collection,
or anything like that.
We didn't define a way to copy an event;
this means the String buffer
always has a unique owner
who is responsible for freeing it.

The deriving(Show) attribute
tells the compiler to auto-generate
a text representation,
so we can print an Event
just as easily as a built-in type.

lazy_static! will initialize both of them
when LOG is first used.
Like String, the Vec is a growable buffer.
We won't turn on event logging in release builds,
so it's fine to pre-allocate space for 50,000 events.
(You can put underscores
anywhere in a integer literal
to improve readability.)

lazy_static!, Mutex, and Vec are all implemented
in Rust
using gnarly low-level code.
But the amazing thing
is that all three expose a safe interface.
It's simply not possible
to use the variable before it's initialized,
or to read the value the Mutex protects without locking it,
or to modify the vector while iterating over it.

The worst you can do is deadlock.
And Rust considers that pretty bad, still,
which is why it discourages global state.
But it's clearly what we need here.
Rust takes a pragmatic approach to safety.
You can always write the unsafe keyword
and then use the same pointer tricks
you'd use in C.
But you don't need to be quite so guarded
when writing the other 95% of your code.
I want a language that assumes I'm brilliant but distracted :)

Rust catches these mistakes at compile time,
and produces the same code you'd see
with equivalent constructs in C++.
For a more in-depth comparison,
see Ruud van Asseldonk's
excellent series of articles
about porting a spectral path tracer from C++ to Rust.
The Rust code performs basically the same as
Clang / GCC / MSVC on the same platform.
Not surprising,
because Rust uses LLVM
and benefits from
the same backend optimizations as Clang.

lazy_static! is not a built-in language feature;
it's a macro provided by
a third-party library.
Since the library uses Cargo,
I can include it in my project by adding

to src/lib.rs.
Cargo will automatically fetch and build all dependencies.
Code reuse becomes no harder
than in your favorite scripting language.

Finally, we define a function
that pushes a new event onto the vector:

pubfnlog(e: Event) {
LOG.lock().push(e);
}

LOG.lock() produces an
RAII handle
that will automatically unlock the mutex
when it falls out of scope.
In C++ I always hesitate to use temporaries like this
because if they're destroyed too soon,
my program will segfault or worse.
Rust has compile-time lifetime checking,
so I can do things that would be reckless in C++.

If you scroll up you'll see
a lot of prose and not a lot of code.
That's because I got
a huge amount of functionality for free.
Here's the logging module again:

Any project which doesn't opt in to log-events
will see zero impact from any of this.

If you'd like to learn Rust,
the Guide is a good place to start.
We're getting close to 1.0
and the important concepts have been stable for a while,
but the details of syntax and libraries are still in flux.
It's not too early to learn,
but it might be too early to maintain a large library.

By the way,
here are the events generated by
interning the three strings
foobarbazfooblockquote:

There are
three different kinds of IDs,
indicated by the least significant bits.
The first is a pointer
into a standard interning table,
which is protected by a mutex.
The other two are created without synchronization,
which improves parallelism
between parser threads.

In UTF-8,
the string foo
is smaller than a 64-bit pointer,
so we store the characters directly.
blockquote is too big for that,
but it corresponds to a well-known HTML tag.
0xb is the index of blockquote in
a static list
of strings that are common
on the Web.
Static atoms
can also be used
in pattern matching, and
LLVM's optimizations
for C's switch statements will apply.