The hardest program I’ve ever written, once you strip out the whitespace, is
3,835 lines long. That handful of code took me almost a year to write. Granted,
that doesn’t take into account the code that didn’t make it. The commit
history shows that I deleted 20,704 lines of code over that time. Every
surviving line has about three fallen comrades.

If it took that much thrashing to get it right, you’d expect it to do something
pretty deep right? Maybe a low-level hardware interface or some wicked graphics
demo with tons of math and pumping early-90s-style techno? A likely-to-turn-evil
machine learning AI Skynet thing?

Nope. It reads in a string and writes out a string. The only difference between
the input and output strings is that it modifies some of the whitespace
characters. I’m talking, of course, about an automated code
formatter.

Introducing dartfmt

I work on the Dart programming language. Part of my job is helping make more
Dart code, readable, idiomatic, and consistent, which is why I ended up writing
our style guide. That was a good first step, but any style guide written in
English is either so brief that it’s ambiguous, or so long that no one reads it.

Go’s “gofmt” tool showed a better solution: automatically format
everything. Code is easier to read and contribute to because it’s already in the
style you’re used to. Even if the output of the formatter isn’t great, it ends
those interminable soul-crushing arguments on code reviews about formatting.

Of course, I still have to sell users on running the formatter in the first
place. For that, having great output really does matter. Also, I’m pretty
picky with the formatting in my own code, and I didn’t want to tell users to use
a tool that I didn’t use myself.

Getting that kind of quality means applying pretty sophisticated formatting
rules. That in turn makes performance difficult. I knew balancing quality and
speed would be hard, but I didn’t realize just how deep the rabbit hole went.

I have finally emerged back into the sun, and I’m pleased with what I brought
back. I like the output, and the performance is solid. On my laptop, it can blow
through over two million lines of code in about 45 seconds, using a single core.

Why is formatting hard?

At this point, you’re probably thinking, “Wait. What’s so hard about
formatting?” After you’ve parsed, can’t you just walk the AST and
pretty-print it with some whitespace?

If every statement fit within the column limit of the page, yup. It’s a piece of
cake. (I think that’s what gofmt does.) But our formatter also keeps your code
within the line length limit. That means adding line breaks (or “splits” as the
formatter calls them), and determining the best place to add those is famously
hard.

There are thirteen places where a line break is possible here according to our
style rules. That’s 8,192 different combinations if we brute force them all 1. The search space we have to cover is
exponentially large, and even ranking different solutions is a subtle problem.
Is it better to split before the .any()? Why or why not?

What is up with the skulls?

I had two goals with this article: to explain how dartfmt works, and to show a realistic picture of how a real programmer solves a difficult problem with all of the messiness that entails. Alas, the first is more than long enough to try your patience, so I shunted all of the dead ends and failed attempts to footnotes. Click the skulls to laugh at my misfortune.

In Dart, we made things harder on ourselves. We have anonymous functions, lots
of higher-order functions, and—until we added async and
await—used futures for concurrency. That means lots of
callbacks and lots of long method chains. Some Dart users really dig a
functional style and appear to be playing a game where whoever crams the most
work before a single semicolon wins.

(The funny names are because this was sanitized from internal code.) That’s a
single statement, all 565 characters of it. There are about 549 billion ways
we could line break it.

Ultimately, this is what the formatter does. It applies some fairly
sophisticated ranking rules to find the best set of line breaks from an
exponential solution space. Note that “best” is a property of the entire
statement being formatted. A line break changes the indentation of the
remainder of the statement, which in turn affects which other line breaks are
needed. Sorry, Knuth. No dynamic programming this time 2.

I think the formatter does a good job, but how it does it is a mystery to
users. People get spooked when robots surprise them, so I thought I would trace
the inner workings of its metal mind. And maybe try to justify to myself why it
took me a year to write a program whose behavior in many ways is
indistinguishable from cat.

How the formatter sees your code

As you’d expect from a program that works on source code, the formatter is
structured much like a compiler. It has a front end that parses your code
and converts that to an intermediate representation3. It does some optimization and clean up on
that 4, and then the IR goes to a
back end 5 that produces the final
output. The main objects here are chunks, rules, and spans.

Chunks

A chunk is an atomic unit of formatting. It’s a contiguous region of
characters that we know will not contain any line breaks. Given this code:

format/* comment */this;

We break it into these chunks: format/* comment */this;.

Chunks are similar to a token in a conventional compiler, but they tend to
be, well, chunkier. Often, the text for several tokens ends up in the same
chunk, like this and ; here. If a line break can never occur between two
tokens, they end up in the same chunk 6.

Chunks are mostly linear. For example, given an expression like:

some(nested,function(ca+ll))

We chunk it to the flat list: some(nested,function(ca +ll)).

We could treat an entire source file like a single flat sequence of chunks, but
it would take forever and a day to line break the whole thing 7. With things like long chains of asynchronous
code, a single “statement” may be hundreds of lines of code containing several
nested functions or collections that each contain their own piles of code.

We can’t treat those nested functions or collection literals entirely
independently because the surrounding expression affects how they are indented.
That in turn affects how long their lines are. Indent a function body two more
spaces and now its statements have two fewer spaces before they hit the end of
the line.

Instead, we treat nested block bodies as a separate little list of chunks to be
formatted mostly on their own but subordinate to where they appear. The chunk
that begins one of these literals, like the { preceding a function or map,
contains a list of child block chunks for the contained block. In other words,
chunks do form a tree, but one that only reflects block nesting, not
expressions.

The end of a chunk marks the point where a split may occur in the final output,
and the chunk has some data describing it 8. It keeps track of whether a blank line should be added
between the chunks (like
between two class definitions), how much the next line should be indented, and
the expression nesting depth at that point in the code.

The most important bit of data about the split is the rule that controls it 9.

Rules

Each potential split in the program is owned by a rule. A single rule may
own the splits of several chunks. For example, a series of binary operators of
the same kind like a + b + c + d uses a single rule for the splits after each
+ operator.

A rule controls which of its splits break and which don’t. It determines this
based on the state that the rule is in, which it calls its value. You can
think of a rule like a dial and the value is what you’ve turned it to. Given a
value, the rule will tell you which of its chunks get split.

The simplest rule is a “hard split” rule. It says that its chunk
always splits, so it only has one value: 0. This is useful for things like
line comments where you always need to split after it, even in the middle of an
expression 10.

Then there is a “simple” split rule. It allows two values: 0 means
none of its chunks split and 1 means they all do. Since most splits are
independent of the others, this gets used for most of the splits in the program.

Beyond that, there are a handful of special-case rules. These are used
in places where we want to more precisely control the configuration of a set of
splits. For example, the positional argument list in a function list is
controlled by a single rule. A function call like:

function(first,second,third)

Will have splits after function(, first,, second,, and third). They are
all owned by a single rule that only allows the following configurations:

// 0: Don't split at all.function(first,second,third)// 1: Split before the first.function(first,second,third)// 2: Split before only the last argument.function(first,second,third)// 3: Split before only the middle argument.function(first,second,third)// 4: Split before all of them.function(first,second,third)

Having a single rule for this instead of individual rules for each argument lets
us prohibit things like:

function(first,second,third)

Constraints

Grouping a range of splits under a single rule helps us prevent split
configurations we want to avoid like this, but it’s not enough. There are more
complex constraints we want to enforce like: “if a split occurs inside a list
element, the list should split too”. That avoids output like this:

[first,second+third,fourth]

Here, the list and the + expression have their own rules, but those rules need
to interact. If the + takes value 1, the list rule needs to as well. To
support this, rules can constrain each other. Any rule can limit the values
another rule is allowed to take based on its own value. Typically, this is used
to make a rule inside a nested expression force the rules surrounding itself to
split when it does 11.

Finally, each rule has a cost. This is a numeric penalty that applies when any
of that rule’s chunks are split. This helps us determine which sets of splits
are better or worse than others 12.

Rule costs are only part of how overall fitness is calculated. Most of the cost
calculation comes from spans.

Spans

A span marks a series of contiguous chunks that we want to avoid splitting.
I picture it like a rubber band stretching around them. If a split happens in
any of those chunks, the span is broken. When that happens, the solution is
penalized based on the cost of the span.

Spans can nest arbitrarily deeply. In an expression like:

function(first(a,b),second(c,d))

There will be spans around a, b and c, d to try to keep those argument lists
from splitting, but also another span around first(a, b), second(c, d) to keep
the outer argument list from splitting.

If a split occurs between a, and b, the a, b span splits, but so does the
first(a, b), second(c, d) one. However, if a split occurs after first(a, b),
then the a, b span is still fine. In this way, spans teach the formatter to
prefer splitting at a higher level of nesting when possible since it breaks
fewer nested spans.

Parsing source to chunks

Converting your raw source code to this representation is fairly
straightforward. The formatter uses the wonderful analyzer package to parse
your code to an AST. This gives us a tree structure that represents every
single byte of your program. Unlike many ASTs, it even includes comments.

There’s no rocket science here, but there are a lot of hairycornercases. Comments can appear in weird places. We have
to handle weird things like:

function(argument,// commentargument)

Here, we normally would have a split after the first argument owned by an
argument list rule. But the line comment adheres to the , and has a hard split
after it, so we need to make sure the argument list rule handles that.

Whitespace is only implicitly tracked by the AST so we have to reconstitute
it in the few places where your original whitespace affects the output.
Having a detailed test suite really helps here.

Once we’ve visited the entire tree, the AST has been converted to a tree of
chunks and a bunch of spans wrapped around pieces of it.

Formatting chunks

We’ve got ourselves a big tree of chunks owned by a slew of rules. Earlier, I
said a rule is like a knob. Now we get to dial them in.

Doing this naïvely is infeasible. Even a small source file contains hundreds of
individual rules and the set of possible solutions is exponential in the number
of rules.

The first thing we do is divide the chunk list into regions we know
can’t interfere with each other. These are roughly “lines” of code. So with:

first(line);second(line);

We know that how we split the first statement has no effect on the second one.
So we run through the list of chunks and break them into shorter lists whenever
we hit a hard split that isn’t nested inside an expression.

Each of these shorter chunk lists is fed to the line splitter. Its job is to
pick the best set of values for all the rules used by the chunks in the line. In
most cases, this is trivial: if the whole line fits on the page, every rule gets
set to zero—no splits—and we’re done.

When a line doesn’t fit, the splitter has to figure out which combination of
rule values produces the best result. That is:

The one with the fewest characters that go over the column limit.

The one with the lowest cost, based on which rules and spans were split.

Calculating the cost for a set of rule values is pretty easy, but there are
still way too many permutations to brute force it. If we can’t brute force it,
how do we do it?

How line splitting works

I’m a college dropout so my knowledge of algorithms was fairly, um, rudimentary.
So before I interviewed at Google, I spent two days in a hotel room cramming as
many of them—mostly graph traversal—in my head as I could. At the
time, I thought graphs would never come up in the interviews…

Then I had multiple interview questions that reduced down to doing the right
kind of traversal over a graph. At the time, I thought this stuff would never be
relevant to my actual job…

Then I spent the past few years at Google discovering that damn near every
program I have to write can be reduced down to some kind of graph search. I
wrote a package manager where dependencies are a transitive closure and
version constraint solving is graph based. My hobby roguelike uses
graphs for pathfinding. Graphs out the wazoo. I can do BFS in my sleep now.

Naturally, after several other failed approaches, I found that line splitting
can be handled like a graph search problem 13. Each node in the graph represents a
solution—a set of values for each rule. Solutions can be
partial: some rules may be left with their values unbound.

From a given partial solution (including the initial “no rules bound” one),
there are edge to new partial solutions. Each binds one additional rule to a
value. By starting from an empty solution and walking this graph, we eventually
reach complete solutions where all of the rules have been bound to values.

Graph search is great if you know where your destination is and you’re trying to
find the best path. But we don’t actually know that. We don’t know what the best
complete solution is. (If we did, we’ve be done already!)

Given this, no textbook graph search algorithm is sufficient. We need to apply
some domain knowledge—we need to take advantage of rules and conditions
implicit in the specific problem we’re solving.

After a dozen dead ends, I found three (sort of four) that are enough to get it
finding the right solution quickly:

Bailing early

We are trying to minimize two soft constraints at the same time:

We want to minimize the number of characters that overflow the line length
limit. We can’t make this a hard constraint that there is no overflow
because it’s possible for a long identifier or string literal to overflow in
every solution. In that case, we still need to find the one that’s closest
to fitting.

We want to find the lowest cost—the fewest split rules and broken
spans.

The first constraint dominates the second: we’ll prefer a solution with any cost
if it fits one more character in. In practice, there is almost always a solution
that does fit, so it usually comes down to picking the lowest cost solution 14.

We don’t know a priori what the cost of the winning solution will be, but we
do know one useful piece of information: forcing a rule to split always
increases the cost.

If we treat any unbound rule as being implicitly unsplit 15, that means the starting solution with
everything unbound always has the lowest cost (zero). We can then explore
outward from there in order of increasing cost by adding one rule at a time.

This is a basic best-first search: we keep a running queue of all of
the partial solutions we’ve haven’t explored yet, sorted from lowest cost to
highest. Each iteration, we pop a solution off.

If the solution completely fits in the page width, then we know we’ve won the
overflow constraint. Since we’re exploring in order of increasing cost, we also
know it’s the lowest cost. So, ta-da!, we found the winner and can stop
exploring. Otherwise, if it has any unbound rules, we enqueue new
solutions, each of which binds one of those to a value.

We basically explore the entire solution space in order of increasing cost. As
soon as we find a solution that fits in the page, we stop.

Avoiding dead ends

The above sounds pretty promising, but it turns out that there can be an
imperial ton of “low-cost but overflowing” solutions. When you’re trying to
format a really long line, there are plenty of ways it can not fit, and this
algorithm will try basically all of them. After all, they’re low cost since they
don’t have many splits.

We need to avoid wasting time tweaking rules that aren’t part of the problem.
For example, say we’re looking at a partial solution like this:

There are a bunch of ways we can split the arguments to firstCall(), but we
don’t need to. Its line already fits. The only line we need to worry about is
the secondCall() one.

So, when we are expanding a partial solution, we only bind rules that have
chunks on overflowing lines. If all of a rule’s chunks are on lines
that already fit, we don’t mess with it. In fact, we don’t even worry about
rules on any overflowing line but the first. Since tweaking the first line will
affect the others, there’s no reason to worry about them yet 16.

This dramatically cuts down “branchiness” of the graph. Even though a partial
solution may have dozens of unbound rules, usually only a couple are on long
lines and only those get explored.

Pruning redundant branches

This gets us pretty far, but the splitter can still go off the deep end in some
cases. The problem is that within large statements, you still run into cases
where how you format part of the statement is mostly independent of later
parts.

Each of those named arguments can be split in a few different ways. And, since
those are less nested—which means fewer split spans—than that nasty
liveAnalysis: line, it will try every combination of all of them before it
finally gets down to the business of splitting that check() call.

The best way to split the liveAnalysis: line is the best way to split it
regardless of how we split assertions: or annotations:. In other words,
there are big branches of the solution space that initially differ in irrelevant
ways, but eventually reconvene to roughly the same solution. We traverse every
single one of them.

What we need is a way to prune entire branches of the solution space. Given two
partial solutions A and B, if we could say not just “A is better than B” but
“every solution we can get to from A will be better than every solution we can
get to from B” then we can discard B and the entire branch of solutions
stemming from it.

It took some work, but I finally figured out that you can do this in many
cases. Given two partial solutions, if one has a lower cost than the other and:

They have the same set of unbound rules (but their bound rules have different
values, obviously).

None of their bound rules are on the same line as an unbound rule.

None of their bound rules place constraints on an unbound rule.

If all of those are true, then the one with a lower cost will always lead to
solutions that also have a lower cost. Its entire branch wins. We can discard
the other solution and everything that it leads to. Once I got this
working, the formatter could line split damn near anything in record time 17.

An escape hatch

Alas, that “damn near” is significant. There are still a few cases where the
formatter takes a long time. I’ve only ever seen this on machine generated code.
Stuff like:

Yeah, welcome to my waking nightmare. Unsurprisingly, code like this bogs down
the formatter. I want dartfmt to be usable in things like presubmit scripts
where it will have a ton of weird code thrown at it and it must complete in a
reliable amount of time.

So there is one final escape hatch. If the line splitter tries, like, 5,000
solutions and still hasn’t found a winner yet, it just picks the best it found
so far and bails.

In practice, I only see it hit this case on generated code. Thank God.

Finally, output

Once the line splitter has picked values for all of the rules, the rest is easy.
The formatter walks the tree of chunks, printing their text. When a
rule forces a chunk to split, it outputs a newline (or two), updates the
indentation appropriately and keeps trucking.

The end result is a string of (I hope!) beautifully formatted Dart code. So much
work just to add or remove a few spaces!

Footnotes

1 Yes, I really did brute force
all of the combinations at first. It let me focus on getting the output correct
before I worried about performance. Speed was fine for most statements. The
other few wouldn’t finish until after the heat death of the universe.

2 For most of the time, the
formatter did use dynamic programming and memoization. I felt like a wizard
when I first figured out how to do it. It worked fairly well, but was a
nightmare to debug.

It was highly recursive, and ensuring that the keys to the memoization table
were precise enough to not cause bugs but not so precise that the cache
lookups always fail was a very delicate balancing act. Over time, the amount of
data needed to uniquely identify the state of a subproblem grew, including
things like the entire expression nesting stack at a point in the line, and the
memoization table performed worse and worse.

3 The IR evolved constantly.
Spans and rules were later additions. Even the way chunks tracked indentation
changed frequently. Indentation used to be stored in levels, where each level
was two spaces. Then directly in spaces. Expression nesting went through a
number of representations.

In all of this, the IR’s job is to balance being easy for the front-end to
produce while being efficient for the back end to consume. The back end
really drives this. The IR is structured to be the right data structure for the
algorithm the back end wants to use.

4 Comments were the one of the
biggest challenges. The formatter initially assumed there would be no newlines
in some places. Who would expect a newline, say, between the keywords in
abstract class? Alas, there’s nothing preventing a user from doing:

abstract// Oh, crap. A line comment.classFoo{}

So I had to do a ton of work to make it resilient in the face of comments and
newlines appearing in all sorts of weird places. There’s no single clean
solution for this, just lots of edge cases and special handling.

5 The back end is where all of
the performance challenges come from, and it went through two almost complete
rewrites before it ended up where it is today.

6 I started from a simpler
formatter written by a teammate that treated text, whitespace, and splits all as
separate chunks. I unified those so that each chunk included non-whitespace
text, line split information, and whitespace information if it didn’t split.
That simplified a lot.

7 When I added support for
better indentation of nested functions, that broke the code that split source
into separately splittable regions. For a while, a single top-level statement
would be split as a single unit, even if it contained nested functions with
hundreds of lines of code. It was… not fast.

8 Ideally, the split information
in a chunk would describe the split before the chunk’s text. This would avoid
the pointless split information on the last chunk, and also solve annoying
special-case handling of the indentation before the very first chunk.

I’ve tried to correct this mistake a number of times, but it causes a
near-infinite number of off-by-one bugs and I just haven’t had the time to push
it all the way through and fix everything.

9 Rules are a relatively recent
addition. Originally each chunk’s split was handled independently. You could
specify some relations between them like “if this chunk splits then this other
one has to as well”, but you could not express things like “only one of these
three chunks may split”

Eventually, I realized the latter is what I really needed to get argument lists
formatting well, so I conceived of rules as a separate concept and rewrote the
front and line splitter to work using those.

10 At first, I thought hard
splits weren’t needed. Any place a mandatory newline appears (like between two
statements) is a place where you could just break the list of chunks in two and
line split each half independently. From the line splitter’s perspective, there
would be no hard splits.

Which would work… except for line comments:

some(expression,// with a line commentrightInTheMiddleOfIt);

This has to be split as a single unit to get the expression nesting and
indentation correct, but it also contains a mandatory newline after the line
comment.

11 There used to be a separate
class for a “multisplit” to directly handle forcing outer expressions to split
when inner ones did. Once rules came along, they also needed to express
constraints between them, and eventually those constraints were expressive
enough to be able to handle the multisplit behavior directly and multisplits
were removed.

12 I spent a lot of time
tuning costs for different grammar productions to control how tightly bound
different expressions were. The goal was to allow splits at the places where the
reader thought code was “loosest”, so stuff like higher precedence expressions
would have higher costs.

Tuning these costs was a nightmare. It was like a hanging mobile where tweaking
one cost would unbalance all of the others. On more than one occasion, I found
myself considering making them floating point instead of integers, a sure sign
of madness.

It turns out spans are what you really want in order to express looseness.
Nested infix operators then fall out naturally because you have more spans
around the deeper nested operands. The parse tree gives it to you for free.

These days, almost every chunk and span has a cost of 1, and it’s the quantity
of nested spans and contained chunks that determine where it splits.

13 I had known that
clang-format worked this way for a long time, but I could never wrap my head
around how to apply it to dartfmt’s richer chunk/rule/span system.

I took a lot of walks along the bike trail next to work trying to think through
a way to get graph search working when the two numbers being optimized (overflow
characters and cost) are in direct opposition, and we don’t even know what the
goal state looks like. It took a long time before it clicked. Even then, it
didn’t work at all until I figured out the right heuristics to use to optimize
it.

14 For a long time, overflow
and cost were treated as a single fitness function. Every overflow character
just added a very high value to the cost to make the splitter strongly want to
avoid them.

Splitting overflow out as a separate metric turned out to be key to getting the
graph search to work because it let us order the solutions by cost independently
of overflow characters.

15 I went back and forth on
how an unbound rule should implicitly behave. Treating it as implicitly split
gives you solutions with fewer overflow characters sooner. Treating it as
unsplit gives you lower costs.

16 Oh, God. I tried a million
different ways to reduce the branchiness before I hit on only looking at rules
in the first long line. I’m still amazed that it works.

I could also talk about how controlling branchiness lets us avoid reaching the
same state from multiple different paths. After all, it’s a graph, but
everything I’ve described talks about it like it’s a tree. By carefully
controlling how we extend partial solutions, we ensure we only take a single
path to any given complete solution.

Before I got that working, I had to keep a “visited” set to make sure we didn’t
explore the same regions twice, but just maintaining that set was a big
performance sink.

17 Discarding overlapping
branches is the last macro-optimization I did and its behavior is very subtle.
Correctly detecting when two partial solutions overlap took a lot of
iteration. Every time I thought I had it, one random weird test would fail where
it accidentally collapsed two branches that would eventually diverge.

That bullet list was paid for in blood, sweat, and tears. I honestly don’t think
I could have figured them out at all until late in the project when I had a
comprehensive test suite.