Type-based alias analysis, where pointers to different types are assumed to point to distinct objects, gives compilers a simple and effective way to disambiguate memory references in order to generate better code. Unfortunately, C and C++ make it easy for programmers to violate the assumptions upon which type-based alias analysis is built. “Strict aliasing” refers to a collection of rules in the C and C++ standards that restrict the ways in which you are allowed to modify and look at memory objects, in order to make type-based alias analysis work in these weakly-typed languages. The problem is that the strict aliasing rules contain tricky and confusing corner cases and also that they rule out many idioms that have historically worked, such as using a pointer type cast to view a float as an unsigned, in order to inspect its bits. Such tricks are undefined behavior. See the first part of this post for more of an introduction to these issues.

The purpose of this piece is to call your attention to a new paper, Detecting Strict Aliasing Violations in the Wild, by Pascal Cuoq and his colleagues. C and C++ programmers should read it. Sections 1 and 2 introduce strict aliasing, they’re quick and easy.

Section 3 shows what compilers think about strict aliasing problems by looking at how a number of C functions get translated to x86-64 assembly. This material requires perseverance but it is worth taking the time to understand the examples in detail, because compilers apply the same thinking to real programs that they apply to tiny litmus tests.

Section 4 is about a new tool, built as part of Trust-in-Soft’s static analyzer, that can diagnose violations of the strict aliasing rules in C code. As the paper says, “it works best when applied on definite inputs,” meaning that the tool should be used as an extended checker for tis-interpreter. Pascal tells me that a release containing the strict aliasing checker is planned, but the time frame is not definite. In any case, readers interested in strict aliasing, but not specifically in tools for dealing with it, can skip this section.

Section 5 applies the strict aliasing checker to open source software. This is good reading because it describes problems that are very common in the wild today. Finding a bug in zlib was a nice touch: zlib is small and has already been looked at closely. Some programs mitigate these bugs by asking the compiler to avoid enforcing the strict aliasing rules: LLVM, GCC, and Intel CC all take a -fno-strict-aliasing flag and MSVC doesn’t implement type-based alias analysis at all. Many other programs contain time bombs: latent UB bugs that don’t happen to be exploited now, that might be exploited later when the compiler becomes a bit brighter.

ARM processors are ubiquitous in mobile, but you won’t find a lot of developers using ARM-based boxes for their day-to-day work. Recently I needed some ARM machines for compiler work and got a few Raspberry Pi 3 boards, which seemed reasonable since these have four cores at 1.2 GHz. However, while these boards are far faster than the original Raspberry Pi, they’re still very painful for development work; the SD card interface seems to be a major bottleneck and also I would imagine that the 512 KB last-level-cache is not holding enough of the working set of a developer workload to be all that effective.

The obviously desirable ARM chip for a developer box is a ThunderX2, with up to 54 cores at 3 GHz, but these don’t seem to be available at all until later in 2017 and who knows when someone will get around to packaging these up in an affordable, developer-friendly box. The original ThunderX is available for servers but I didn’t find an inexpensive boxed-up version (there’s the Gigabyte R270-T61 that’s hard to even get a price on). Various fastish Qualcomm chips are also available on development boards but not as complete systems, as far as I could tell. I’ve done my time bringing up software environments on random development boards and am no longer interested in that.

Anyway, I settled on the SoftIron OverDrive 1000, which has pretty good specs: 8 GB of DDR4 RAM, 1 TB disk, and an AMD A1120 processor, an implementation of ARM’s Cortex-A57 design. The cores run at 1.7 GHz and each pair of cores shares a 1 MB L2 cache, with 8 MB of L3 cache shared across all four cores. The box is headless but has a USB serial console, gigabit Ethernet, SATA, and USB interfaces available. At $600 the price is right. Here’s the documentation.

The OverDrive 1000 ships with 64-bit openSUSE installed. Though I hadn’t used its zypper package manager before, it is very easy and pleasant to interact with. My one gripe with the default configuration so far is that it ships with under a GB of swap space and when doing big compiles (and especially links, and even more especially links of debug builds) it’s easy to get zapped by the OOM killer. The disk is formatted with btrfs which does not support swapfiles. Their support was responsive but didn’t have a recipe for resizing the root partition without removing the hard drive and plugging it into a different machine, which I haven’t bothered to do yet.

Now let’s look at how fast this thing is. I’m not trying to be scientific here, just giving a general idea of what to expect from this box. Since I often sit around waiting for compilers, the benchmark is going to be LLVM 4.0 rc1 compiled using itself. The test input (program being compiled) is Crypto++ 5.6.5, which I chose since it compiles fairly quickly and doesn’t seem to have a lot of external dependencies. I compiled it with the -DCRYPTOPP_DISABLE_ASM flag to disable use of assembly language that might add compile-time differences across the platforms.

The processors I’m testing are just some that happened to be convenient. There’s a machine based on a Core i7-2600, a quad-core from 2011 running at 3.4 GHz, it has hyperthreading turned off. Another is based on an i7-6950X, a 10-core from 2016 running at 3.0 GHz, it has hyperthreading turned on. Finally there’s a Macbook Pro 2.2 GHz retina model from mid-2015, it has a Core i7-4770HQ, a quad-core, also with hyperthreading turned on.

The i7-2600 and the i7-6950X run hot: they are in the 100 W range. The i7-4770HQ is rated at 47 W but this includes the GPU. I’ll speculate that perhaps the power used by the CPU part is not that different from the 25 W used by the AMD A1120 (please leave a comment if you know more about this) (update: see this comment).

First, build time using one core (i.e. make -j1):

OverDrive 1000

390 s

Macbook Pro

177 s

i7-2600

113 s

i7-6950X

139 s

So the ARM chip is about 3.5x slower than the fastest of the Intel chips.

Second, compile times using 4 cores:

OverDrive 1000

137 s

Macbook Pro

57 s

i7-2600

36 s

i7-6950X

41 s

The OverDrive 1000 gets about a 2.8x speedup from four cores. Of course some of the non-linearity is due to sequential processing in the software build; when compiling other projects, I’ve seen more like a 3.5x speedup from using four cores.

Finally, compile times using all cores:

OverDrive 1000

137 s

Macbook Pro

49 s

i7-2600

36 s

i7-6950X

19 s

So here’s the worst case slowdown of the OverDrive 1000: it’s 7.2x slower than the big Intel chip, but it’s a pretty unfair comparison since that chip costs $1,600.

Overall, the OverDrive 1000 is an inexpensive and capable machine, but it isn’t going to compete performance-wise with Intel boxes at the same price point (for example, here are the PCs you can buy at NewEgg for between $500 and $600). If you buy an OverDrive 1000, buy it because you want an ARMv8 machine that shows up ready to use and that isn’t an embedded systems toy like the Raspberry Pi family.

[My father, David Regehr, encouraged me to write this piece, provided some of its content, edited it, and agreed to let me use data from his farm.]
[For readers outside the USA: Alas, we do not farm in metric here. In case you’re not familiar with the notation, 10″ is ten inches (25.4 cm) and 10′ is ten feet (3.05 m). An acre is 0.4 hectares.]

Agriculture and technology have been intimately connected for the last 10,000 years. Right now, information technology is changing how we grow food; this piece takes a quick look at how that works.

Measurement

If soil conditions aren’t right, crops will grow poorly. For example, alfalfa grows best in soils with a pH between 6.5 and 7.5. Soils that are too acidic can be “fixed” by applying ground limestone (CaCO3) at rates determined by formulae based on chemical analysis. The process typically begins with taking soil samples (to an appropriate depth) in a zig-zag pattern across each field, mixing the samples in a bucket, and then sending a sub-sample to a laboratory where it’s analyzed for pH, cation exchange capacity, major nutrients such as phosphorus, potassium, and sulfur, and micro nutrients such as zinc. For more details, see this document about interpreting soil test results.

Applying a uniform rate of ag (agricultural) lime to an entire field is suboptimal where there is variation in soil pH within the field. Ag lime applied where it is not needed is not only a waste of money, it can raise soil pH to a point that is detrimental to crop growth. To characterize a field more accurately, it needs to be sampled at a finer granularity. For example, GPS grid lines can be super-imposed on a field to locate points each representing an area of, say, 2.5 acres. Around each such point, ten or more soil samples would be taken along a 30’ radius, mixed, sub-sampled, and GPS-tagged. From the resulting analysis, the lime requirement, and adequacy of other nutrients essential for plant growth, of all areas in the field can be interpolated using a model.

Let’s look at an example. This image shows the farm near Riley KS that my parents bought during the 1980s. I spent many afternoons and weekends working there until I moved out of the area in 1995. It’s a quarter-section; in other words, a half-mile on a side, or 160 acres. 135.5 of the acres are farmland and the remaining 24.5 are used by a creek, waterways (planted in grass to prevent erosion), buildings, and the driveway.

This image shows the points at which the fields were sampled for soil analysis in November 2015:

Each point represents a 1.25 acre area; this is pretty fine-grained sampling, corresponding to relatively small fields with terraces and other internal variation. A big, relatively homogeneous field on the high plains might only want to be sampled every 5 or 10 acres.

Here are the soil types:

This image shows how much sulfur the soil contains:

In the past it wasn’t necessary to fertilize with sulfur due to fallout from coal-burning power plants. This is no longer the case.

Another quantity that can be measured is crop yield: how much grain (or beans or whatever) is harvested at every point in a field? A combine harvester with a yield monitor and a GPS can determine this. “Point rows,” where a harvested swath comes to a point because the field is not completely rectangular, need to be specially taken into account: they cause the grain flow to be reduced not because yield is low but rather because the full width of the combine head is not being used. Yield data can be aggregated across years to look for real trends and to assess changes in how low-yield areas are treated.

Aerial measurement with drones or aircraft can be used to look for irregularities in a field: color and reflectivity at various wavelengths can indicate problems such as weeds (including, sometimes, identification of the offending species), insect infestations, disease outbreaks, and wet or dry spots. The alternative, walking each field to look for problems, is time consuming and risks missing things.

Some of the procedures in this section (maintaining a drone, intensive grid-sampling, interpreting soil test and yield results) are time-consuming and complicated, or require expensive equipment that would be poorly utilized if owned by an individual farmer. Such jobs can be outsourced to crop consultants who may be hired on a per-acre basis during the growing season to monitor individual fields for pests and nutrient problems, irrigation scheduling, etc. During the off-season, consultants may do grid sampling, attend subject-matter updates to maintain certification, and assist growers with data interpretation and planning, etc. Many crop consultants have years of experience, and see many fields every day; the services of this sort of person can reduce risks. Here’s the professional society for crop consultants and some companies that provide these services (1, 2).

Application

“Variable-rate application” means using the results of intensive soil grid sampling to apply seed, fertilizer, herbicide, insecticide, etc. in such a way that each location in the field receives the appropriate amount of whatever is being applied. For example, fewer seeds can be planted in parts of a field that have weaker capacity to store water in the soil, reducing the likelihood of drought stress.

Variable-rate can apply to an entire implement (planter or whatever) but it can also be applied at a finer granularity: for example, turning individual spray heads on and off to prevent harmful overspray or turning individual planter rows on and off to prevent gaps or double-planting on point rows and other irregularities. Imagine trying to achieve this effect using a 12-row planter without computer support:

Here’s the soil pH for my Dad’s farm and also the recommended amount of ag lime to apply for growing alfalfa:

For cropland on this farm, 443,000 pounds (221 US tons / 201 metric tons) of ag lime are needed to bring the soils to the target pH of 6.5, the minimum pH for good alfalfa or soybean production. Purchase, hauling, and variable-rate application of ag lime in this area would be $20-25/ton, so the cost is roughly $5,000. However, because the land is farmed with no-till practices (i.e. no deep tillage to incorporate the lime), no more than about 1 ton/acre of ag lime is applied per year, so there will be a doubling or tripling of application costs, spread over several years, to some parts of the farm. Soil conditions will change in fairly predictable ways and it should be at least five years before these fields need to be sampled again.

Of course there are limits on how precisely a product can be applied to a field. Ag lime would typically be applied using a truck that spreads a 40′ swath of lime. Even if the spreader is calibrated well, there will be some error due to the width of the swath and also some error stemming from the fact that the spreader can’t instantaneously change its application rate. There might also be error due to latency in the delivery system but this could be compensated for by having the software look a few seconds ahead.

Here’s an analogous recommendation, this time for phosphorus in order to meet a target of 60 bushels per acre of winter wheat:

Phosphorus fertilizer application is an annual cost, which can vary greatly depending on type and price of formulation used. Most cropland farmers in this part of the world would figure on $25-35/acre for purchase and variable-rate application.

And finally, here’s the zinc recommendation for growing soybeans:

As you can see, much less zinc than lime is required: less than a ton of total product across the entire farm.

Automation

Driver-assist systems for cars are primarily about safety, and driverless cars need to pay careful attention to the rules of the road while not killing anyone. Automated driving solutions for tractors and harvesters seem to have evolved entirely independently and have a different focus: following field boundaries and swaths accurately.

An early automated row-following technology didn’t do any steering, but rather provided the farmer with a light bar that indicated deviation from an intended path. This was followed by autosteer mechanisms that at first just turned the steering wheel using a servo, and in modern machines issues steering commands via the power (hydraulic) steering system. The basic systems only handle driving across a field, leaving the driver to turn around at the end of each row. To use such a system you might make a perimeter pass and then a second pass around a field; this provides room to turn around and also teaches the autosteer unit about the area to be worked. Then, you might choose one edge of the field to establish the first of many parallel lines that autosteer will follow to “work” the interior of the field. Static obstacles such as trees or rocks can be marked so the GPS unit signals the driver as they’re approached. Dynamic obstacles such as animals or people are not accounted for by current autosteer systems; it’s still up to the driver to watch out for these. Autoturn is an additional feature that automates turning the tractor around at the end of the row.

Autosteer and autoturn aren’t about allowing farmers to watch movies and nap while working a field. Rather, by offloading the tiring, attention-consuming task of following a row to within a couple of inches, the farmer can monitor the field work: Is the planter performing as expected? Has it run out of seed? Autosteer also enables new farming techniques that would otherwise be unfeasible. For example, one of my cousins has corn fields in central Kansas with 30″ row spacing, that are sub-surface irrigated using lines of drip tape that are buried about 12″ deep, spaced 60″ apart. Sub-surface irrigation is far more efficient than overhead sprinkler irrigation, as it greatly reduces water loss to evaporation. As you can imagine, repairing broken drip tape is a difficult, muddy affair. So how does my cousin knife anhydrous ammonia into the soil to provide nitrogen for the corn? Very carefully, and using RTK-guidance (next paragraph) to stay within 1-2 cm of the intended path, to avoid cutting the drip lines.

GPS readings can drift as atmospheric conditions change. So, for example, after taking a lunch break you might find your autosteer-guided tractor a foot or two off of the line it was following an hour earlier. My Dad says this is commonplace, and there can be larger variance over larger time scales. Additionally, it is expected that a GPS will drop out or give erratic readings when signals reflect and when satellites are occluded by hills or trees. So how do we get centimeter-level accuracy in a GPS-based system? First, it is augmented with an inertial measurement unit: an integrated compass, accelerometer, and gyroscope. I imagine there’s some interesting Kalman filtering or similar going on to fuse the IMU readings with the GPS, but I don’t know too much about this aspect. Second, information about the location of the GPS antenna on the tractor is needed, especially the height at which it is mounted, which comes into play when the tractor tilts, for example due to driving over a terrace. Third, real-time kinematic uses a fixed base station to get very precise localization along a single degree of freedom. Often, this base station is located at the local Coop and farmers pay for a subscription. This web page mentions pricing: “Sloan Implement charges $1000 for a 1 year subscription to their RTK network per radio. If you have multiple radios on the farm, then it is $2500 for all of the radios on a single farm.”

A farm’s income depends entirely on a successful harvest. Often, harvesting is done during a rainy time of year, so fields can be too wet to harvest and in the meantime if a storm knocks the crops down, yields will be greatly reduced. Thus, as soon as conditions are right, it is imperative to get the harvest done as quickly as possible. In practice this means maximizing the utilization of the combine harvester, which isn’t being utilized when it is parked next to a grain wagon to unload. It is becoming possible to have a tractor with a grain cart autonomously pull up alongside a working combine, allowing it to unload on-the-go, without requiring a second driver.

Conclusions

The population of the world is increasing while the amount of farmland is decreasing. Precision agriculture is one of the things making it possible to keep feeding the human race at an acceptable cost. I felt that this piece needed to be written up because awareness of this material seemed low among computer science and computer engineering professionals I talk to.

As we enter faculty and grad recruiting season, I’d like to present a bit of Utah propaganda. No heroics are required to see this stuff: just a few hours driving from Salt Lake City (on pavement) and some mild day hiking. I’ll provide detailed instructions for visiting any of these locations upon request.

Background

Once a piece of software reaches a certain size, it is guaranteed to be loosely specified and not completely understood by any individual. It gets committed to many times per day by people who are only loosely aware of each others’ work. It has many dependencies including the compiler, operating system, and libraries, all of which are buggy in their own special ways, and all of which are updated from time to time. Moreover, it usually has to run atop several different platforms, each one individually quirky. Given the massive number of possibilities for flaky behavior, why should we expect our large piece of software to work as expected? One of the most important reasons is testing. That is, we routinely ensure that it works as intended in every important configuration and on every important platform, and when it doesn’t work we have smart people tracking down and fixing the issues.

Today we’re talking about testing LLVM. In some ways, a compiler makes a very friendly target for testing:

The input format (source code) and output format (assembly code) are well-understood and have independent specifications.

Many compilers have an intermediate representation (IR) that has its own documented semantics and can be dumped and parsed, making it easier (though not always easy) to test internals.

It is often the case that a compiler is one of several independent implementations of a given specification, such as the C++ standard, enabling differential testing. Even when multiple implementations are unavailable, we can often test a compiler against itself by comparing the output of different backends or different optimization modes.

Compilers are usually not networked, concurrent, or timing-dependent, and overall interact with the outside world only in very constrained ways. Moreover, compilers are generally intended to be deterministic.

Compilers usually don’t run for very long, so they don’t have to worry too much about resource leaks or recovering gracefully from error conditions.

But in other ways, compilers are not so easy to test:

Production compilers are supposed to be fast, so they are often written in an unsafe language and may skimp on assertions. They use caching and lazy evaluation when possible, adding complexity. Furthermore, splitting compiler functionality into lots of clean, independent little passes leads to slow compilers, so there tends to be some glomming together of unrelated or not-too-closely-related functionality, making it more difficult to understand, test, and maintain the resulting code.

The invariants on compiler-internal data structures can be hellish and are often not documented completely.

Some compiler algorithms are difficult, and it is almost never the case that a compiler implements a textbook algorithm exactly, but rather a close or distant relative of it.

Compiler optimizations interact in difficult ways.

Compilers for unsafe languages do not have lots of obligations when compiling undefined behaviors, placing the responsibility for avoiding UB outside of the compiler (and on the person creating test cases for the compiler). This complicates differential testing.

The standards for compiler correctness are high since miscompilations are tough to debug and also they can quietly introduce security vulnerabilities in any code that they compile.

So, with that background out of the way, how is LLVM tested?

Unit Tests and Regression Tests

LLVM’s first line of defense against bugs is a collection of tests that get run when a developer builds the check target. All of these tests should pass before a developer commits a patch to LLVM (and of course many patches should include some new tests). I have a fairly fast desktop machine that runs 19,267 tests in 96 seconds. The number of tests that run depends on what auxiliary LLVM projects you have downloaded (compiler-rt, libcxx, etc.) and, to a lesser extent, on what other software gets autodetected on your machine (e.g. the OCaml bindings don’t get tested unless you have OCaml installed). These tests need to be fast so developers can run them often, as mentioned here. Additional tests get run by some alternate build targets such as check-all and check-clang.

Some of the unit/regression tests are at the API level, these use Google Test, a lightweight framework that provides C++ macros for hooking into the test framework. Here’s a test:

The first argument to the TEST_F macro indicates the name of the test case (a collection of tests) and the second names the actual test shown here. The parseAssembly() and expectPattern() methods respectively call into an LLVM API and then check that this had the expected result. This example is from ValueTrackingTest.cpp. Many tests can be put into a single file, keeping things fast by avoiding forks/execs.

The other infrastructure used by LLVM’s fast test suite is lit, the LLVM Integrated Tester. lit is shell-based: it executes commands found in a test case, and considers the test to have been successful if all of its sub-commands succeed.

Here’s a test case for lit (I grabbed the top of this file, which contains additional tests that don’t matter to us right now):

This test case is making sure that InstCombine, the LLVM-level peephole optimization pass, is able to notice some useless instructions: the zext, shl, and add are not needed here. The CHECK-LABEL line looks for the line of optimized code that begins the function, the first CHECK-NEXT makes sure that the and instruction is on the next line, and the second CHECK-NEXT makes sure the ret instruction is on the line following the and (thanks Michael Kuperstein for correcting an earlier explanation of this test).

To run this test case, the file is interpreted three times. First, lit scans it looking for lines containing RUN: and executes each associated command. Second, the file is interpreted by opt, the standalone optimizer for LLVM IR; this happens because lit replaces the %s variable with the name of the file being processed. Since comments in textual LLVM IR are preceded by a semicolon, the lit directives are ignored by opt. The output of opt is piped to the FileCheck utility which parses the file yet again, looking for commands such as CHECK and CHECK-NEXT; these tell it to look for strings in its stdin, and to return a non-zero status code if any of the specified strings isn't found. (CHECK-LABEL is used to divide up a file into a collection of logically separate tests.)

An important part of a long-term testing campaign is using coverage tools to find parts of the code base that aren't being tested. Here's a recent LLVM coverage report based on running the unit/regression tests. This data is pretty interesting to poke around in. Let's take a quick look at coverage of InstCombine, which is generally very good. An interesting project for someone wanting to get started with LLVM would be to write and submit test cases that cover untested parts of InstCombine. For example, here's the first uncovered code (colored red) in InstCombineAndOrXor.cpp:

The comment tells us what the transformation is looking for, it should be fairly easy to target this code with a test case. Code that can't be covered is dead; some dead code wants to be removed, other code such as this example (from the same file) is a bug if it isn't dead:

Trying to cover these lines is a good idea, but in that case you're trying to find bugs in LLVM, as opposed to trying to improve the test suite. It would probably be good to teach the coverage tool to not tell us about lines that are marked unreachable.

The LLVM Test Suite

In contrast with the regression/unit tests, which are part of the main LLVM repository and can be run quickly, the test suite is external and takes longer to run. It is not expected that developers will run these tests prior to committing; rather, these tests get run automatically and often, on the side, by LNT (see the next section). The LLVM test suite contains entire programs that are compiled and run; it isn't intended to look for specific optimizations, but rather to help ascertain the quality and correctness of the generated code overall.

For each benchmark, the test suite contains test inputs and their corresponding expected outputs. Some parts of the test suite are external, meaning that there is support for invoking the tests, but the tests themselves are not part of the test suite and must be downloaded separately, typically because the software being compiled is not free.

LNT

LNT (LLVM Nightly Test) doesn't contain any test cases; it is a tool for aggregating and analyzing test results, focusing on monitoring the quality of the compiler's generated code. It consists of local utilities for running tests and submitting results, and then there's a server side database and web frontend that makes it easy to look through results. The NTS (Nightly Test Suite) results are here.

BuildBot

The Linux/Windows BuiltBot and the Darwin one (I don't know why there are two) are used to make sure LLVM configures, builds, and passes its unit/regression tests on a wide variety of platforms and in a variety of configurations. The BuildBot has some blame support to help find problematic commits and will send mail to their authors.

Eclectic Testing Efforts

Some testing efforts originate outside of the core LLVM community and aren't as systematic in terms of which versions of LLVM get tested. These tests represent efforts by individuals who usually have some specific tool or technique to try out. For example, for a long time my group tested Clang+LLVM using Csmith and reported the resulting bugs. (See the high-level writeup.) Sam Liedes applied afl-fuzz to the Clang test suite. Zhendong Su and his group have been finding a very impressive number of bugs. Nuno Lopes has done some awesome formal-methods-based testing of optimization passes that he'll hopefully write about soon.

A testing effort that needs to be done is repeatedly generating a random (but valid) IR function, running a few randomly-chosen optimization passes on it, and then making sure the optimized function refines the original one (the desired relationship is refinement, rather than equivalence, because optimizations are free to make the domain of definedness of a function larger). This needs to be done in a way that is sensitive to LLVM-level undefined behavior. I've heard that something like this is being worked on, but don't have details.

Testing in the Wild

The final level of testing is, of course, carried out by LLVM's users, who occasionally run into crashes and miscompiles that have escaped other testing methods. I've often wanted to better understand the incidence of compiler bugs in the wild. For crashes this could be done by putting a bit of telemetry into the compiler, though few would use this if opt-in, and many would (legitimately) object if opt-out. Miscompiles in the wild are very hard to quantify. My hypothesis is that most miscompiles go unreported since reducing their triggers is so difficult. Rather, as people make pseudorandom code changes during debugging, they eventually work around the problem by luck and then promptly forget about it.

A big innovation would be to ship LLVM with a translation validation scheme that would optionally use an SMT solver to prove that the compiler's output refines its input. There are all sorts of challenges including undefined behavior and the fact that it's probably very difficult to scale translation validation up to the large functions that seem to be the ones that trigger miscompilations in practice.

Alternate Test Oracles

A "test oracle" is a way to decide whether a test passes or fails. Easy oracles include "compiler terminates with exit code 0" and "compiled benchmark produces the expected output." But these miss lots of interesting bugs, such as a use-after-free that doesn't happen to trigger a crash or an integer overflow (see page 7 of this paper for an example from GCC). Bug detectors like ASan, UBSan, and Valgrind can instrument a program with oracles derived from the C and C++ language standards, providing lots of useful bug-finding power. To run LLVM under Valgrind when executing it on its test suite, pass -DLLVM_LIT_ARGS="-v --vg" to CMake, but be warned that Valgrind will give some false positives that seem to be difficult to eliminate. To instrument LLVM using UBSan, pass -DLLVM_USE_SANITIZER=Undefined to CMake. This is all great but there's more work left to do since UBSan/ASan/MSan don't yet catch all undefined behaviors and also there are defined-but-buggy behaviors, such as the unsigned integer overflow in GCC mentioned above, that we'd like to flag when they are unintentional.

What Happens When a Test Fails?

A broken commit can cause test failure at any level. The offending commit is then either amended (if easy to fix) or backed out (if it turns out to be deeply flawed or otherwise undesirable in light of the new information supplied by failing tests). These things happen reasonably often, as they do in any project that is rapidly pushing changes into a big, complicated code base with many real-world users.

When a test fails in a way that is hard to fix right now, but that will get fixed eventually (for example when some new feature gets finished), the test can be marked XFAIL, or "expected failure." These are counted and reported separately by the testing tool and they do not count towards the test failures that must be fixed before a patch becomes acceptable.

Conclusions

Testing a large, portable, widely-used software system is hard; there are a lot of moving parts and a lot of ongoing work is needed if we want to prevent LLVM's users from being exposed to bugs. Of course there are other super-important things that have to happen to maintain high-quality code: good design, code reviews, tight semantics on the internal representation, static analysis, and periodic reworking of problematic areas.

In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.

I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:

The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.

The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.

The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.

bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).

cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.

docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!

examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.

include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.

projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.

runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.

test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.

tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.

unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.

utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.

Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:

Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.

Bitcode: serialize IR into the compact format and read it back into RAM

CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.

DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.

ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.

Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.

IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.

Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.

LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.

MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.

Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.

ProfileData: Read and write profile data to support profile-guided optimizations

Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.

TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.

Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.

Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.

And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.

The other day Geoff Challen posted a blog entry about his negative tenure vote. Having spent roughly equal time on the getting-tenure and having-tenure sides of the table, I wanted to comment on the process a little. Before going any further I want to clarify that:

I know Geoff, but not well

I wasn’t involved in his case in any capacity, for example by writing a letter of support

I have no knowledge of his tenure case beyond what was written up in the post

Speaking very roughly, we can divide tenure cases into four quadrants. First, the professor is doing well and the tenure case is successful — obviously this is what everybody wants, and in general both sides work hard to make it happen. Second, the professor is not doing well (not publishing at all, for example) and the tenure case is unsuccessful. While this is hugely undesirable, at least the system is working as designed. Third, the professor is not doing well and the tenure case is successful — this happens, but very rarely and usually in bizarre circumstances, for example where the university administration overrules a department’s decision. Finally, we can have a candidate who is doing well and then is denied tenure. This represents a serious failure of the system. Is this what happened to Geoff? It’s hard to be sure but his academic record looks to me like a strong one for someone at his career stage. But keep in mind that it is (legally) impossible for the people directly involved in Geoff’s case to comment on it, so we are never going to hear the other side of this particular story.

So now let’s talk about how tenure is supposed to work. There are a few basic principles (I suspect they apply perfectly well to performance evaluations in industry too). First, the expectations must be made clear. Generally, every institution has a written document stating the requirements for tenure, and if a department deviates from them, decisions they make can probably be successfully appealed. Here are the rules at my university. Junior faculty need to look up the equivalent rules at their institution and read them, but of course the university-level regulations miss out on department-specific details such as what exactly constitutes good progress. It is the senior faculty’s job to make this clear to junior faculty via mentoring and via informal faculty evaluations that lead up to the formal ones.

If you look at the rules for tenure at Utah, you can see that we’re not allowed to deny tenure just because we think someone is a jerk. On the other hand, there is perhaps some wiggle room implied in this wording: “In carrying out their duties in teaching, research/other creative activity and service, faculty members are expected to demonstrate the ability and willingness to perform as responsible members of the faculty.” I’m not sure what else to say about this aspect of the process: tenure isn’t a club for people we like, but on the other hand the faculty has to operate together as an effective team over an extended period of time.

The second principle is that the tenure decision should not be a surprise. There has to be ongoing feedback and dialog between the senior faculty and the untenured faculty. At my institution, for example, we review every tenure track professor every year, and each such evaluation results in a written report. These reports discuss the candidate’s academic record and provide frank evaluations of strengths and weaknesses in the areas of research, teaching, and service (internal and external). The chair discusses the report with each tenure-track faculty member each year. The candidate has the opportunity to correct factual errors in the report. In the third and sixth years of a candidate’s faculty career, instead of producing an informal report (that stays within the department), we produce a formal report that goes up to the university administration, along with copies of all previous reports. The sixth-year formal evaluation is the one that includes our recommendation to tenure (or not) the candidate.

A useful thing about these annual evaluations is that they provide continuity: the reports don’t just go from saying glowing things about someone in the fifth year to throwing them under the bus in the sixth. If there are problems with a case, this is made clear to the candidate as early as possible, allowing the candidate, the candidate’s mentor(s), and the department chair to try to figure out what is going wrong and fix it. For example, a struggling candidate might be given a teaching break.

Another thing to keep in mind is that there is quite a bit of scrutiny and oversight in the tenure process. If a department does make a recommendation that looks bad, a different level of the university can overrule it. I’ve heard of cases where a department (not mine!) tried to tenure a research star who was a very poor teacher, but the dean shot down the case.

If you read the Hacker News comments, you would probably come to the conclusion that tenure decisions are made capriciously in dimly lit rooms by people smoking cigars. And it is true that, looking from the outside, the process has very little transparency. The point of this piece is that internally, there is (or should be) quite a bit of transparency and also a sane, well-regulated process with plenty of checks and balances. Mistakes and abuses happen, but they are the exception and not the rule.

Phil Guo, Sam Tobin-Hochstadt, and Suresh Venkatasubramanian gave me a bit of feedback on this piece but as always any blunders are mine. Sam pointed me to The Veil, a good piece about tenure.

I ran into this derivation when I was nine or ten years old and it made me deeply uneasy. The explanation, that you’re not allowed to divide by (a – b) because this term is equal to zero, seemed to raise more questions than it answered. How are we supposed to keep track of which terms are equal to zero? What if something is equal to zero but we don’t know it yet? What other little traps are lying out there, waiting to invalidate a derivation? This was one of many times where I noticed that in school they seemed willing to teach the easy version, and that the real world was never so nice, even in a subject like math where — you would think — everything is clean and precise.

Anyway, the point is that undefined behavior has been confusing people for well over a thousand years — we shouldn’t feel too bad that we haven’t gotten it right in programming languages yet.

I’ve had a post with this title on the back burner for years but I was never quite convinced that it would say anything I haven’t said before. Last night I watched Chandler Carruth’s talk about undefined behavior at CppCon 2016 and it is good material and he says it better than I think I would have, and I wanted to chat about it a bit.

Chandler is not a fan of the term nasal demons, which he says is misleadingly hyperbolic, since the compiler isn’t going to maliciously turn undefined behavior (UB) into code for erasing your files or whatever. This is true, but Chandler leaves out the fact that our 28-year-long computer security train wreck (the Morris Worm seems like as good a starting point as any) has been fueled to a large extent by undefined behavior in C and (later) C++ code. In other words, while the compiler won’t emit system calls for erasing your files, a memory-related UB in your program will permit a random person on the Internet to insert instructions into your process that issue system calls doing precisely that. From this slightly broader point of view, nasal demons are less of a caricature.

The first main idea in Chandler’s talk is that we should view UB at the PL level as being analogous to narrow contracts on APIs. Let’s look at this in more detail. An API with a wide contract is one where you can issue calls in any order, and you can pass any arguments to API calls, and expect predictable behavior. One simple way that an API can have a wider contract is by quietly initializing library state upon the first call into the library, as opposed to requiring an explicit call to an init() function. Some libraries do this, but many libraries don’t. For example, an OpenSSL man page says “SSL_library_init() must be called before any other action takes place.” This kind of wording indicates that a severe obligation is being placed on users of the OpenSSL API, and failing to respect it would generally be expected to result in unpredictable behavior. Chandler’s goal in this first part of the talk is to establish the analogy between UB and narrow API contracts and convince us that not all APIs want to be maximally wide. In other words, narrow APIs may be acceptable when their risks are offset by, for example, performance advantages.

Coming back to programming languages (PL), we can look at something like the signed left shift operator as exposing an API. The signed left shift API in C and C++ is particularly narrow and while many people have by now internalized that it can trigger UB based on the shift exponent (e.g., 1 << -1 is undefined), fewer developers have come to terms with restrictions on the left hand argument (e.g., 0 << 31 is defined but 1 << 31 is not). Can we design a wide API for signed left shift? Of course! We might specify, for example, that the result is zero when the shift exponent is too large or is negative, and that otherwise the result is the same as if the signed left-hand argument was interpreted as unsigned, shifted in the obvious way, and then reinterpreted as signed.

At this point in the talk, we should understand that “UB is bad” is an oversimplification, that there is a large design space relating to narrow vs. wide APIs for libraries and programming language features, and that finding the best point in this design space is not straightforward since it depends on performance requirements, on the target platform, on developers’ expectations, and more. C and C++, as low-level, performance-oriented languages, are famously narrow in their choice of contracts for core language features such as pointer and integer operations. The particular choices made by these languages have caused enormous problems and reevaluation is necessary and ongoing. The next part of Chandler’s talk provides a framework for deciding whether a particular narrow contract is a good idea or not.

Chandler provides these four principles for narrow language contracts:

Not widely violated by existing code that works correctly and as intended

The first criterion, runtime checkability, is crucial and unarguable: without it, we get latent errors of the kind that continue to contribute to insecurity and that have been subject to creeping exploitation by compiler optimizations. Checking tools such as ASan, UBSan, and tis-interpreter reduce the problem of finding these errors to the problem of software testing, which is very difficult, but which we need to deal with anyhow since there’s more to programming than eliminating undefined behaviors. Of course, any property that can be checked at runtime can also be checked without running the code. Sound static analysis avoids the need for test inputs but is otherwise much more difficult to usefully implement than runtime checking.

Principle 2 tends to cause energetic discussions, with (typically) compiler developers strongly arguing that UB is crucial for high-quality code generation and compiler users equally strongly arguing for defined semantics. I find the bug-finding arguments to be the most interesting ones: do we prefer Java-style two’s complement integers or would we rather retain maximum performance as in C and C++ or mandatory traps as in Swift or a hybrid model as in Rust? Discussions of this principle tend to center around examples, which is mostly good, but is bad in that any particular example excludes a lot of other use cases and other compilers and other targets that are also important.

Principle 3 is an important one that tends to get neglected in discussions of UB. The intersection of HCI and PL is not incredibly crowded with results, as far as I know, though many of us have some informal experience with this topic because we teach people to program. Chandler’s talk contains a section on explaining signed left shift that’s quite nice.

Finally, Principle 4 seems pretty obvious.

One small problem you might have noticed is that there are undefined behaviors that fail one or more of Chandler’s criteria, that many C and C++ compiler developers will defend to their dying breath. I’m talking about things like strict aliasing and termination of infinite loops that violate (at least) principles 1 and 3.

In summary, the list of principles proposed by Chandler is excellent and, looking forward, it would be great to use it as a standard set of questions to ask about any narrow contract, preferably before deploying it. Even if we disagree about the details, framing the discussion is super helpful.

The other day a non-CS friend remarked to me that since computer science is a quantitative, technical discipline, most issues probably have an obvious objective truth. Of course this is not at all the case, and it is not uncommon to find major disagreements even when all parties are apparently reasonable and acting in good faith. Sometimes these disagreements spill over into the public space.

The purpose of this post is to list a collection of public debates in academic computer science where there is genuine and heartfelt disagreement among intelligent and accomplished researchers. I sometimes assign these as reading in class: they are a valuable resource for a couple of reasons. First, they show an important part of science that often gets swept under the rug. Second, they put discussions out into the open where they are widely accessible. In contrast, I’ve heard of papers that are known to be worthless by all of the experts in the area, but only privately — and this private knowledge is of no help to outsiders who might be led astray by the bad research. For whatever reasons (see this tweet by Brendan Dolan-Gavitt) the culture in CS does not seem to encourage retracting papers.

N-version programming is a software development method where several implementations of a specification are run in parallel and voting is used to determine the correct result. Knight and Leveson wrote a paper showing that the assumption of independent faults in independent implementations may not be a good one. This finding did not sit well with the proponents of n-version programming and while I cannot find online copies of their rebuttals, Knight and Leveson’s reply to the criticisms includes plenty of quotes. This is great reading, a classic of the genre.