Under the Hood: Building and open-sourcing flint

Lint programs are an odd class of program verifiers, and for a while I wasn’t convinced they were something I should focus on building out for Facebook. I don't like the style police on my back, and false error warnings can trip up an entire task. There's a lot of good, however, about a verifier that mechanically looks for issues that are not traditionally monitored by the compilers and that would almost always improve code quality once fixed.

flint, Facebook’s lint program, issues the lint errors so that warnings appear automatically in our code review system (phabricator) alongside each proposed code change, notifying the programmer where something may be going wrong. Flint has become important to the work that we do at Facebook, and I'm very excited to open-source flint so that everyone can check out what we’ve built and try it for themselves.

So, why not use an existing linter?

Writing a C++ linter in particular is not for the faint of heart because of C++'s high barrier to entry for parsing. That said, there are now dozens of C++ lint programs with a variety of features, some open sourced. So the natural question is why we chose to write our own instead of using a pre-existing one.

When we started this project, the lint programs we tried were too slow and most didn't support the C++11 features that our codebase had already started to use. Clang, which today would be logical starting point for a C++ analysis tool, offered too little support back then. Even now, clang cannot compile some of our C++ codebase.

Most importantly, lint rules have plenty of variability with the particular ethos of the organization using it. We knew what we were looking for, and figured that with whatever linter we chose, we'd had to do extensive surgery on it. So we decided to write our own.

Tokens and comments and the D language, oh my!

Based on the principle of "simplest design that could possibly work," flint is token-oriented, as opposed to building a parse tree. The linter loads the input file, converts it to an array of tokens, and then runs various analyses against that array. Each token saves the immediately preceding comment (if any) so comment information is preserved. Some of our lint rules require the use of stylized comments, as you'll see below.

The intent with this design was to implement a fast tokenizer, add a couple of simple rules, and let it loose around Facebook with the hope that people would add interesting things to it. The decision to add parsing got kicked down the road, but our engineers added some two dozen rules, which we’ll get to in a minute.

flint is written in the D language, making it the first D codebase open-sourced by Facebook. In fact, our initial version of flint was written in C++; the D rewrite started as an experiment. From the measurements and anecdotes we gathered, the translation turned out to be a win on all fronts: the D version of the linter turned out smaller, dramatically faster to build, significantly faster to run, and easier to contribute to.

Translating flint from C++ to D

In order to move from C++ to D, I set out to make a semi-mechanical translation, in the sense that I'd use the closest D code with the same semantics as the C++ code.

When I wrote the C++ implementation back in the day, I'd figured using a lexer generator was more trouble than it was worth, so I simply wrote a fast, dedicated lexer with the help of macros. D does not have macros, which makes mot-a-mot translation impossible. However, D has something much better--a full interpreter-during-compilation, combined with the ability to introspect, generate, and compile generated code on the spot.

I wrote a 58-line function that generates an unrolled trie matcher. For all of C++, the code expands to 965 lines, which are then fed back into the compiler with a mixin statement.

The nice part about the trie matcher is that it's independent from the language being lexed, which makes it possible to hoist it in a library and only have user code do the language-specific parts (such as parsing numbers or comments). In that setup, it becomes easy to create and use optimal lexers for any language, without ever needing to resort to an external generator. This code generation business took the better part of the first day, after which I had a working lexer; subsequent days were spent on porting the lint checks themselves.

After translation, flint's edit/build/test cycle became quite enjoyable. Building flint/D is roughly five times faster than flint/C++, which essentially made build times immaterial while iterating development. The run speed was quite a bit better, too - between 5% and 25%, depending on the file, with more pronounced improvement for larger files.

Quick builds have had an interesting subjective effect on me. The difference in typical build times of C++ and D is no surprise, but this was the first time I could work on similar parallel projects, switch between them, and feel the difference.

C++ can produce fast code all right, but C++ builds involve many moving pieces that heat, hiss by the gaskets, and occasionally jam. Initiating a build is inevitably accompanied by some pomp and circumstance ("back off boys, this thing has some recoil!") and over the day I'd carefully space out my builds so as to maximize the value out of the time invested. In contrast, in a sustained development cycle, D builds disconcertingly fast – even offensively so. The glibness with which the D compiler rummages through code at first made me wonder if I’d made an error or if my code was not complicated enough. In truth, the new language is simply better able to run through the code at significantly greater speeds.

Checks made by flint

Each check follows the actual function name in the codebase, so you can just grep for it and see how it exactly works.

1. Blacklisted token sequences (checkBlacklistedSequences). Some token sequences may be simply verboten by a given organization. At Facebook we consider "volatile" blacklisted, but with a twist - we do allow for the sequence "asm volatile" which means something quite different.

2. Blacklisted identifiers (checkBlacklistedIdentifiers). Certain identifiers may be deemed non grata by an organization. For Facebook, we seeded the list with the infamous strtok C library function – there are safe alternatives to it so there's no reason to ever use it.

3. Reserved identifiers (checkDefinedNames). A C and C++ naming rule that often gets forgotten is that all identifiers starting with an underscore followed by an uppercase letter, plus all identifiers containing two consecutive underscores, are reserved by the implementation. (Of course there are exceptions to the rule in our code, such as _GNU_SOURCE or _XOPEN_SOURCE, which is why flint keeps aside a whitelist while checking for reserved identifier patterns.)

4. The include guard idiom (checkIncludeGuard). The majority of header files should be guarded appropriately (by using either #pragma once or the #ifndef idiom), so we added a rule for ensuring that.

5. memset argument order (checkMemset). Many of us wrote at some point memset(&foo, sizeof(foo), 0); and lived to tell stories about it to terrified nieces on Halloween. A lint rule is the perfect fix for this pernicious mistake.

6. Questionable #includes (checkQuestionableIncludes). Many organizations have a few libraries that are only kept for backward compatibility reasons and should not be used in new code, meaning the corresponding headers should not be included—a perfect job for flint. One problem that we had identified at Facebook was that certain headers were large enough after preprocessing to make their inclusion in other headers prohibitive due to long compile times. We were able to rig flint to solve this issue.

7. Dedicated "...-inl.h" file inclusion (checkInlHeaderInclusions). A popular approach to organizing inline and template-heavy code is to separate inline and template artifacts corresponding to e.g. "Widget.h", into "Widget-inl.h". The latter file is not supposed to be included outside of "Widget.h" itself, and a dedicated lint rule is in order.

8. Initializing a variable from itself (checkInitializeFromItself). We found that people wrote constructors like:

The intent was to use a_(rhs.a_) in the first constructor and a_(a) in the second. That hardly ever helps, and the compiler keeps mum about it. We like to say, "There's a flint rule for that," in order to resolve the problem.

9. Prefer shared_ptr<T> p(make_shared<T>(args)) instead of shared_ptr<T> p(new T) (checkSmartPtrUsage). The former version makes only one memory allocation instead of two.

10. Open Source includes (checkOSSIncludes). Facebook uses internally-developed libraries across different projects, but there is one dependency we don’t use: an open-source project. Folly, for example, may not depend on a private library. A linter rule keeps the project in order and makes the process of open-sourcing projects safe and easy. Did I mention we love open source?

11. Namespace-level static data in headers (checkNamespaceScopedStatics). Marking namespace-level data as static inside a header is almost always a bad idea. Labeling the data as such potentially generates one instance of the static data inside each compilation unit, including that header. Fixing these issues has led to measurably smaller and faster executables.

12. "Widget.cpp" must include "Widget.h" before anything else (checkIncludeAssociatedHeader). This is a good coding standard because it reveals possible mistakes in defining "Widget.h" such as forgetting to include files that "Widget.h" itself needs.

13. All exceptions must be caught by reference (checkCatchByReference). 'Nuff said.

14. Eliminate common mistakes in defining constructors. Forgetting explicit with one-argument constructors is a common mistake, so flint warns the author about this issue. The author may shut off the warning by prefixing the constructor with the stylized comment /* implicit */. flint warns about non-const copy constructors and useless const move constructors, as follows:

15. Throw specifications (checkThrowSpecification). Throw specifications are deprecated and should be weeded out of code. flint allows for one notable amend - classes that inherit std::exception must add throw() to the destructor and the what() method.

17. Conflicting using namespace directives (checkUsingNamespaceDirectives). Certain namespaces define the same identifiers, for example "boost" and "std" both define shared_ptr for certain versions of boost and std. We've experienced library interoperability issues because of that, so we added a lint rule to prevent using conflicting namespaces simultaneously.

18. Namespaces in headers (checkUsingDirectives). "using namespace" should not occur at namespace level inside a header. However, it may occur inside an inline function in a header. Facebook prescribes that the eponymous namespace "facebook" may only be introduced at top level. Replace appropriately with your own namespace convention.

19. Using library internals improperly (checkFollyDetail). It is common practice to place library-private code in a namespace such as "detail". Sometimes such code cannot be encapsulated away from the client due to the white-box nature of C++ templates. flint warns against uses of such a namespace, namely folly::detail.

20. Pass cheap types by value (checkFollyStringPieceByValue). Certain user-defined types, such as iterators or pair of iterators, are small and cheap to copy so it's virtually always better to pass them by value instead of the conservative reference to const. Folly's StringPiece is an example - it occupies two words and has raw copy semantics.

21. No protected inheritance (checkProtectedInheritance). Protected inheritance is an oddity that's there for completeness, but has no practical use.

The author wanted to define a custom exception but forgot to specify public inheritance. Per a somewhat regrettable language rule , the inheritance protection will default to private. Ultimately, the interesting effect here is that if code later issues catch (const std::exception&) to ensure all standard and custom exceptions are caught, it will not be successful because private inheritance precludes implicit conversion (is not "is-a"). This is not the kind of error that can be discovered without rather involved testing. To prevent this occurrence, flint statically disables non-public inheritance of std::exception.

25. Ensure against the "fleeting rvalue" construct, which is sometimes useless (checkMutexHolderHasName). Consider:

mutex m_lock;
...
lock_guard<mutex>(m_lock);

Regardless what the intent was, this code does nothing of use: it's a bare call of the lock_guard<mutex>'s constructor. It will create an rvalue that will last about as long as the room between the closing ")" and the semicolon. Corroborate that with John Carmack's Law: "Whatever is syntactically valid will wind up in your codebase,” and you get an imperious need for a lint rule. Ideally, that would catch all unnecessary temporaries. At this time we only detect a few usual suspects.

Open-sourcing flint

We’re really excited to open-source flint because it's a good illustration of the "simplest design we could get away with,” and an interesting example of cross-language translation. Hopefully you'll find flint useful--we sure did. Stay tuned for future improvements, and we look forward to hearing your feedback.

Acknowledgments

Many thanks are due to Nicholas Ormrod and Robbert Haarman for reviewing an early draft of this post. Too many engineers to mention here have contributed to the source code of flint proper; I’ll leave that to git log.