Menu

Stuff Matters

This is a short, enjoyable read on material science, explaining some basic properties of materials that are the building blocks of our modern world, such as cement, steel, plastic, and glass. I learned a few interesting things, such as the use of bacteria in concrete, to make it self-healing, and the existence of a wonderous type of material called Aerogel.

The Secret War

One of the central themes of Max Hasting’s book on espionage in the Second World War is the disparity between human intelligence—sending spies out into the field, or recruiting double agents—and signals intelligence—intercepting and decrypting enemy communication. All major powers invested significant resources in humint, but for a variety of reasons almost all of this effort was in vain.

The three big totalitarian belligerents’ intelligence agencies suffered pressure from the top to prefer fantasy to reality. Germany:

Before the June 1941 invasion of Russia, Gen. Georg Thomas of the WiRuAmt—the Wehrmacht’s economics department—produced estimates of Soviet weapons production which approached the reality, though still short of it, and argued that the loss of European Russia would not necessarily precipitate the collapse of Stalin’s industrial base. Hitler dismissed Thomas’s numbers out of hand, because he could not reconcile their magnitude with his contempt for all things Slavonic. Field-Marshal Wilhelm Keitel eventually instructed the WiRuAmt to stop submitting intelligence that might upset the Führer.

Japan:

In Japan as in Nazi Germany, it had become an institutional precept that no intelligence assessment could be countenanced by policy-makers which ran contrary to a desired national course. Again and again between the 1930s and 1945, strategy was distorted to conform with the visceral inclinations and ambitions of commanders, rather than with realities, of which by far the most important were America’s economic superiority and Germany’s precarious strategic predicament.

And the Soviet Union, in particular before and in the earlier stages of the war:

Soviet intelligence officers feared for their lives, with good reason, if they told Stalin what he did not want to hear. He seemed to credit only reports that identified plots against himself or the state, at home and abroad. Where these did not exist, Russia’s most senior intelligence officers invented them. Stalin used the product of his codebreakers to some effect where and when this was available, but entered the greatest conflict in history almost blind through his own acts of will.

Stalin smartened up as the war went on, and spent much effort spying not only on his Axis enemies but also on his allies who, in turn, were wholly focused on the Axis, a fact the Soviets had difficulty believing:

The Russians had renewed contact with [a spy in London] nine months earlier, but his initial reports about life at Broadway earned their scorn. He asserted that the Soviet Union stood only tenth on MI6’s penetration target list, an incredible proposition to [Soviet intelligence], which was convinced that the existential purpose of the British secret service was to achieve the destruction of the Soviet Union. Russia’s leaders inhabited a society in which nobility of conduct was alien, indeed dangerous to the state. They were thus unable to credit the fact that for the war’s duration even the most impassioned anti-communists, including Churchill, had set aside their hostility to throw everything into the struggle against the Axis.

As a result of that Stalin had a huge advantage when negotiating with the other Allies:

The Russians’ default diplomatic posture towards the Western Allies of stone-faced indignation successfully concealed from Washington and London the fact that, at summit meetings, the Soviet delegation was fully informed in advance of intended British and American positions. Churchill especially, who often awaited apprehensively Stalin’s response to unwelcome surprises, especially about delays to D-Day, might have spared himself discomfort. The ‘surprises’ were nothing of the sort: the Soviet dictator merely brilliantly simulated amazement, then unleashed anger to order. It was impossible for Churchill and Roosevelt to play poker with the Kremlin, because Stalin knew their hands.

Aversion to truth was much less pronounced in the Western Allies’ hierarchies, but it did appear in some areas:

So fixated were senior RAF and USAAF officers with their determination to demonstrate that strategic air bombardment could win the war, that the history of the bomber commands’ intelligence departments shows an institutionalised commitment to fantasy, of a kind more usual in the German and Japanese high commands.

Humint for the Western Allies was ineffectual for other reasons, one of which being that they just weren’t very good at it:

Meanwhile two Dutch SOE agents were dropped under circumstances which suggested fantastic carelessness in Baker Street: both were issued with forged identity cards on which the royal arms of Holland were represented by two lions which both faced the same way, instead of addressing each other.

One aspect of intelligence the Americans were particularly successful at was economic analysis:

[…] R&A became fascinated by the possibilities of tabulating vehicle serial numbers to compute German production.

The book is full of incredible stories, like that of agent “Max”, a German “collaborator” invented by the Soviets to feed the Germans (mis)information. Max’s most important contribution was informing the Germans of the impending offensive “Mars”. All the information given to the Germans about Mars was correct. The purpose of telling them was to shift Wehrmacht troops away from Stalingrad, where an even bigger Soviet offensive loomed, of which the Germans knew nothing, eventually encircling their army there. The Mars offensive still went ahead and was repulsed by the prepared Germans, at the cost of 70,000 Soviet lives. Max, having correctly predicted Mars, continued to be trusted by Berlin until the end of the war.

Hastings also has much to say about signals intelligence, which, especially for the Western Allies, was spectacularly successful. He leaves the cryptographic theory to other authors (see, for example Simon Singh’s “The Code Book”, or Andrew Hodges’ “Alan Turing: The Enigma”[1]), and focuses instead on how it was exploited, and on the practical aspects of its acquisition. Many ciphers could not be broken automatically, but required secret codes which had to be obtained in the field, without the enemy’s knowledge, lest they changed them:

On 1 April, two Imperial Japanese Navy flying-boats were damaged in a tropical storm, en route from Palau to Davao. One of them carried Admiral Mineichi Koga, commander-in-chief of the Combined Fleet. In the second was Vice-Admiral Shigeru Fukudome. When this plane ditched off Cebu island, Fukudome floundered ashore without his attaché case, containing Japanese codes and important strategy documents in plain language. Guerrillas on Cebu alerted the Americans, who got to the plane. A US submarine rushed the attaché case to the Australian Army’s intelligence department, where Fukudome’s codes and documents were photographed. Then the case was hastened back to the crash area for local people to hand over to the Japanese, claiming that they had chanced upon it. Fukudome himself eventually got home, to be forgiven and promoted. The Japanese navy never suspected that its haul of secrets had passed through American hands.

Once they could read the enemy’s dispatches, it was in their interest that they kept coming:

Signals intelligence became so central to the Allied war effort that from 1944 onwards the Americans became reluctant to bomb identified Japanese wireless communications centres, because their output seemed more useful to Allied military operations than to those of Nippon.

One of my favorite tidbits in the book is this, on the very unstereotypical treatment by the Nazis of prisoners:

The Wehrmacht’s ‘Guidelines for the interrogation of English prisoners of war’, dated Berlin, 16 April 1940, urged commanders whenever possible to use interrogators familiar with Britain and the British. ‘If cordially addressed,’ said the briefing note, ‘every Englishman will at once answer all questions entirely frankly.’

Doing Good Better

William MacAskill’s canonical book on Effective Altruism deals with the question of how to seriously, truly, improve the lives of other people (and animals).

A large part of the book is devoted to charities, which have historically been very bad at measuring how much good they actually do, or even whether their net impacts are positive at all. Effective Altruism tries to change that trend. GiveWell, for instance, evaluates charities’ impacts and gives recommendations. If you give money to charity, you should most definitely read this book. If you can’t be bothered, please just give your money to the Against Malaria Foundation.

MacAskill touches on many other topics, such as what kind of career you should choose if you want to do the most good, whether you should buy FairTrade, how terrible (or not) sweatshops are, and whether immigration is a good thing. The answers to most of these questions might surprise you.

The most stunning factoid I learned from this book, about the eradication of smallpox in 1973:

Suppose we’d achieved world peace in 1973. How many deaths would have been prevented? That timescale includes the killings of Cambodia’s Khmer Rouge, the Rwandan genocide, the two Congo wars, the 9/11 attacks and the wars in Afghanistan and Iraq. If you add up all the wars, genocides, and terrorist acts that occurred since 1973, the death toll is a staggering twelve million. Prior to its eradication, smallpox killed 1.5 to 3 million people every year, so by preventing these deaths for over forty years, its eradication has effectively saved somewhere between 60 and 120 million lives. The eradication of smallpox is one success story from aid, saving five times as many lives as world peace would have done.

The Three Body Problem Trilogy

The trilogy by Cixin Liu is a science fiction story about contact with an alien civilization. It is as brilliant and full of wonderful ideas as it is frustrating and inscrutable. When I watch or read fiction I mostly enjoy stories about smart people solving difficult problems in smart ways. That’s why I can’t watch “The Walking Dead”, for instance, which devolves from “we’ll have to be really smart and work together to not get eaten by those zombies” to “whatever, let’s just kill each other instead”. That’s why “Captain America: Civil War” is a bad movie.

The most infuriating character is the unbearably sentimental protagonist of the third book, who on multiple occasions has to make decisions that determine the fate of humanity, and every single time she chooses the obviously wrong option. The most egregious instance is when she considers being a candidate for a position crucial to the deterrence of alien aggression. She gives three reasons for why she should be a candidate, all of them vacuous. The question of whether she is actually qualified does not enter her mind, which of course means that she isn’t. Alien aggression follows shortly thereafter.

Still, the books are brilliant and full of wonderful ideas.

Worst book I’ve read in 2016: Aurora

Kim Stanley Robinson’s book, on the other hand, is horribly frustrating while being neither brilliant nor full of wonderful ideas. A short summary of it is: everything sucks, so why bother?

A generation ship is sent to the star Tau Ceti, to colonize the moon Aurora there. Upon disembarkation the colonists discover a substance on the surface that kills them on direct, unprotected contact. Half the ship promptly decides they want to go back to Earth, the other faction wants to go on to find another suitable planet. Neither faction considers solving the problem, or contacting Earth to ask them to solve the problem. In fact, we only discover very late in the book, and almost as an aside, that the ship even has bidirectional communication with Earth. It doesn’t seem important. What help could the science of a ten billion strong civilization possibly provide, after all? The answer is cryogenics, finally developed, which the homeward-bound faction uses to survive their long trip back to Earth where, they learn, it sucks, too.

One of the revenants lays it all out in a speech at the end:

There are ecological, biological, sociological, and psychological problems that can never be solved to make this idea work.

And please disregard the movie “The Imitation Game”, which is historically accurate inasmuch as people with the names of the movies’ protagonists did exist, and were involved in one way or another in code breaking. Hastings actually refers to the movie when he talks about Turing, saying it “offered a version of his experience at Bletchley Park that was a travesty of the reality: […]” ↩

Anathem

Maybe 15 years ago I read William Gibson’s Neuromancer and hated it. At some point after that my brain started confusing Gibson with Neal Stephenson, with the result that I would not touch any of Stephenson’s books, until this year when finally an overwhelming number of recommendations accumulated. I relented and picked up “Anathem”, expecting confused and unfinished sentences, when instead I was hooked right from the start.

“Anathem” is set on a world that is home to an ancient system and culture of secular[1] monasteries, dedicated to science and philosophy, and carefully sealed from the rest of civilization. Most of the plot plays out in those monasteries, slowly revealing their weird culture, such as how even the monasteries are strictly separated into parts that open every ten years, every hundred years, and every thousand years, respectively.

The civilizations outside the monasteries are volatile, sometimes reverting from a modern culture to medieval barbarism within decades, it seems, something that’s never explained. The book starts off with a tenner monk interviewing somebody from the outside world, in preparation for the opening of the tenner-part of the monastery, leading off with a question from a script for making contact, “Do your neighbors burn one another alive?”

During the time the monastery is open, a significant and shocking discovery is made, the nature of which would be a spoiler if you haven’t read it, but it took me from appreciating the world Stephenson had built, and being curious about its many intricacies to being totally engrossed in the plot and longing for explanations, which Stephenson, through his characters’ philosophical discussions, does provide.

I proceeded to read “Cryptonomicon” and “Seveneves”, but out of those three “Anathem” was by far the strongest book for me.

Zero to One

This is a book about entrepreneurship and innovation, by Peter Thiel, one of PayPal’s founders. He puts out a lot of ideas, not all of which I agree with, but many of which I found good food for thought.

One of Thiel’s central theses is that only monopolies can innovate. Businesses in competition have too much on their hands just to survive and cannot do long-term development:

In business, money is either an important thing or it is everything. Monopolists can afford to think about things other than making money.

A monopoly is not necessarily a huge business. Most successful start-ups begin very small but they’re the only ones providing a particular service. In their domains, they are monopolies.

To get there, according to Thiel, you need a definite outlook on the future—having an idea of where things are going. He believes that most of Western society today is indefinite, explaining the prevalence of finance:

Finance epitomizes indefinite thinking because it’s the only way to make money when you have no idea how to create it.

But in an indefinite world, people actually prefer unlimited optionality; money is more valuable than anything you could possibly do with it.

We place far too much importance on “well-rounded” education:

a definite person determines the one best thing to do and then does it. Instead of working tirelessly to make herself indistinguishable, she strives to be great at something substantive—to be a monopoly of one.

This resonates with me, having had to waste time at school studying parochial subjects such as Austrian history and geography or Latin, when I would have been better off just doing math and English, and having the rest of my time for playing with computers.

Another central theme in the books is secrets. A secret is something you know, but nobody else yet knows, or believes. It’s an essential ingredient to a start-up. Discovering secrets might seem daunting, but Thiel is optimistic:

There’s plenty more to learn: we know more about the physics of faraway stars than we know about human nutrition.

Thing Explainer

Bill Gates wrote a wonderful review of Randall Munroe’s “Thing Explainer”, a book that uses illustrations and only the one thousand most used English words to explain difficult concepts and products.

The language is not just a gimmick. Avoiding difficult terminology makes Munroe reduce the inferential distance to his readers, making even difficult concepts easy to grasp. His explanation of turbofan jet engines—sky boat pushers—is much better than the one on the Wikipedia page.

Sometimes the reduction of vocabulary makes banal statements utterly hilarious, like when, describing “bags of stuff inside you”, he writes about the mouth:

It’s where air goes in and out, food goes in, and words come out. Note: Some people don’t like it when you make words come out while you’re putting food in.

Aside from being entertained, I learned quite a few things. For example, in nuclear reactors, there is a room below the core:

If there are problems and everything is on fire, the special metal can get so hot that it starts moving like water. Sometimes, it can get hot enough to burn a hole through the floor. If that happens, this room is here so the watery metal can fall down and then spread out over the floor.

It’s good if the metal can spread out, since when it’s all close together, it keeps making itself hotter. If this room ever gets used, it means everything has gone very, very wrong.

I read the Kindle edition on my iPad, which worked very well, except that the page explaining operating rooms—rooms for helping people—was broken. I hope that’ll get fixed in an update soon.

Rationality: From AI to Zombies

This book is a compilation of Eliezer Yudkowsky’s writings on the topics of rationality and artificial intelligence. It contains a lot of wisdom. As far as I know all of it can be found online, but the book provides a nice structure.

I have highlighted and annotated more passages in this book than in any other by far. This is but a small selection.

Yudkowsky is big on separating the map, your beliefs, from the territory, reality. A lot of “philosophy” and deep-sounding “wisdom” vanishes when you do, like some people’s love of mystery:

But ignorance exists in the map, not in the territory. If I am ignorant about a phenomenon, that is a fact about my own state of mind, not a fact about the phenomenon itself. A phenomenon can seem mysterious to some particular person. There are no phenomena which are mysterious of themselves. To worship a phenomenon because it seems so wonderfully mysterious is to worship your own ignorance.

The ontological argument is just one weird application of this confusion, treating the map (your idea of God) like territory that just lacks existence. It’s as if the shape I drew into the Pacific Ocean part of my globe were a full-fledged continent whose only drawback was that it doesn’t exist.

This carries over to other topics, like in his discussion of quantum mechanics (he’s a firm proponent of the Many-worlds “interpretation” of QM, as he should be):

Quantum physics is not “weird.” You are weird. You have the absolutely bizarre idea that reality ought to consist of little billiard balls bopping around, when in fact reality is a perfectly normal cloud of complex amplitude in configuration space. This is your problem, not reality’s, and you are the one who needs to change.

If we want our map to reflect the territory as best we can, we better use words whose boundaries are boundaries in the real world, too. Hence we should be careful about how we define words:

Your brain doesn’t treat words as logical definitions with no empirical consequences, and so neither should you. The mere act of creating a word can cause your mind to allocate a category, and thereby trigger unconscious inferences of similarity. Or block inferences of similarity; if I create two labels I can get your mind to allocate two categories.

Many chapters on cognitive biases and how to combat them. For example, when confronted with a problem, hold off on proposing solutions, because

Once you can guess what your answer will be, you have probably already decided. If you can guess your answer half a second after hearing the question, then you have half a second in which to be intelligent. It’s not a lot of time.

Some beliefs are very important to us, but yet we might be wrong. How can we even begin to evaluate those beliefs dispassionately?

You would advise a religious person to try to visualize fully and deeply the world in which there is no God, and to, without excuses, come to the full understanding that if there is no God then they will be better off believing there is no God. If one cannot come to accept this on a deep emotional level, one will not be able to have a crisis of faith. So you should put in a sincere effort to visualize the alternative to your belief, the way that the best and highest skeptic would want you to visualize it.

Also,

it is wisest to regard our past selves as fools beyond redemption— to see the people we once were as idiots entire. […] As long as we are making excuses for the past, trying to make it look better, respecting it, we cannot make a clean break.

This obviously applies to non-religious beliefs, too.

A lot of discussion about politics:

Politics is an extension of war by other means. Arguments are soldiers. Once you know which side you’re on, you must support all arguments of that side, and attack all arguments that appear to favor the enemy side; otherwise it’s like stabbing your soldiers in the back—providing aid and comfort to the enemy.

This is why I have a hard time watching documentaries on politicized issues. They are usually done by a proponent of one side of the issue, and they will lie through their teeth to make their side look good. In reality, it’s rarely the case that one proposed solution has no downsides:

If you defend yourself, you may have to kill. If you kill someone who could, in another world, have been your friend, that is a tragedy. And it is a tragedy. The other option, lying down and dying, is also a tragedy. Why must there be a non-tragic option? Who says that the best policy available must have no downside?

A big thing Yudkowsky is wrong about is Bayesianism and positivism. The central idea of positivism:

Above all, don’t ask what to believe—ask what to anticipate. Every question of belief should flow from a question of anticipation, and that question of anticipation should be the center of the inquiry.

As David Deutsch points out, this very idea, that only beliefs about physical observables are in any way relevant, is itself a belief that is not about physical observables, shooting itself in the foot.

The Bayesian comes through here:

Rationality is not for winning debates, it is for deciding which side to join.

Sometimes, all sides are wrong, and only a new idea will help, yet Bayesianism is very quiet on the topic of the generation of new hypotheses. Again, see David Deutsch for more arguments why science is not Bayesian.

Becoming Steve Jobs

Walter Isaacson’s biography was badly received by the cognoscenti, so I didn’t bother reading it and can’t draw any comparison’s to Brent Schlender and Rick Tetzeli’s “Becoming Steve Jobs”, but I can definitely recommend the latter.

This is not a collection of anecdotes about Jobs, although it does have its fair share. It’s the story of a person who learned and changed.

The story starts with the visionary product creator who led the development of the Apple II and the Macintosh but failed to command the then much larger company Apple, ultimately being fired by the board. With NeXT he created yet another technological masterpiece but failed to bring the company to success, for a variety of reasons, while at the same time helping Pixar become what it is today. He emerged having learned from his failures, taking the helm at Apple again, but this time succeeding spectacularly, having become the Steve Jobs he’s remembered for now.

What stayed most with me from this book was Jobs’ ability to completely change his mind, very quickly, and damn the sunk costs:

On the car ride over to the prototype hangar, Johnson told Steve that he thought they’d gotten it all wrong. “Do you know how big a change this is,” Steve roared. “I don’t have time for this. I don’t want you to say a word to anyone about this. I don’t know what I think of this.” They sat for the rest of the short ride in silence.

When they arrived at the hangar, Steve spoke to the assembled group: “Well,” he said, “Ron thinks we’ve designed our stores all wrong.” Johnson waited to hear where this line of thought would go. “And he’s right,” said Steve, “so I’m going to leave now and you should just do what he’s going to tell you to do.” And Jobs turned around and left.

Stephenson doesn’t help by calling the world outside the monasteries “saecular”, when it clearly isn’t secular. It takes a while to get used to his invented words here, but it’s worth it. ↩

Often, when working on private branches with long histories, it so happens that a commit somewhere in between breaks the build, or some tests don’t pass anymore. If you know which commit it is you can fix it with git-rebase. But maybe you don’t know. Maybe you aren’t even sure whether all the commits in your history build and pass the tests.

Let’s say you are on a private feature branch with quite a few commits that you are unsure about. The last one that’s been pushed is last-pushed-commit. To check whether your code compiles and passes all the tests you can run make check, but you don’t want to do that manually for each commit, so let’s automate it:

git polish-history --test="make check" start last-pushed-commit

git-polish-history will now go through all commits since last-pushed-commit and run make check on them. On the first one that fails it will stop and tell you to fix the problem. You can do this any way you want as long as you commit your changes. Typically you’d amend the last commit. Once you’re done, you do

git polish-history continue

and the process continues. It will stop again whenever a commit fails the test, when a merge conflict occurs, or when it’s done. If you want it to do the committing of your changes automatically, use

In my first post on Swift I expressed mild bewilderment at Apple’s decision to make protocols generic not by giving them generic type arguments, like C# does, but via what they call “associated types”. To reiterate the example, the Sequence protocol in Swift is defined as

whereas if Swift had generic type arguments in protocols (as it does in classes), the definition would look more like

protocol Sequence<T> {
func generate() -> Generator<T>
}

When playing around with a set implementation that Alexis Gallagher posted, I came across a problem with associated types: it exposes more type information than the user of a class needs, and worse, more than the user of a class should have.

The type that generate is actually required to return (to adhere to the Sequence interface) is Generator. It’s not possible, however, (as far as I can tell) to declare generate as such:

func generate() -> Generator

The reason being that Generator here is under-specified—it must be a Generator whose associated type Element is KeyType, but that can’t be specified. If Swift had generic type arguments for protocols, it would simply be Generator<KeyType>.

The problem here is that the set type needs to publicly expose an implementation detail, namely the concrete type by which it implements its Generator. If that implementation detail changes, the method’s type changes, and code might break.

This problem is exacerbated by the fact that users of the generator might have to depend on its type. As an example, look at my simplification of Alexis’ code:

The body of the function explains the weird GeneratorType: The dictionary has a keys property that is exactly what I need—a sequence of all the elements in the set. The type that I’m returning now is the generator type of that sequence, instead of my own, and I must specify the concrete return type. The only way to get the concrete type, however, is to look up the type that keys returns, which means that I now depend on an implementation detail of Dictionary, and, making matters worse, I’m surfacing it to my users.

For comparison, let’s look at how this would work it Swift had generic type arguments for protocols. Sequence and Generator would be defined as

InterposeGenerator is a little state machine that alternates between producing either an element of the original generator or the separator. A bit of care is required to make sure that it doesn’t produce the separator again after the generator is exhausted.

One noteworthy thing here is the invocation of self.gen.next() in init. If we try to call next on the constructor argument gen instead of the property self.gen we get a compiler error:

immutable value of type 'G' only has mutating members named 'next'

This makes sense: the Generator is a value type, and it’s passed in here as an argument, which apparently makes it immutable, and next is a mutating function. At least, that is how it would typically happen—the compiler doesn’t seem to have any information to that effect, though. Generator is a protocol, which can be implemented by reference as well as value types:

protocol Generator {
typealias Element
func next() -> Element?
}

As you can see, next is not marked as mutating, either. I suspect there are two things going on here: the protocol definition that Xcode shows is probably incorrect, in that the mutating modifier is missing. The definition that Swift actually uses contains the modifier. In addition, the compiler doesn’t know whether the Generator we are using to create InterposeGenerator is a value type or not, so it must assume that it is, for the sake of safety in the case of mutating functions.

Let’s move on to the Sequence. It doesn’t really have to do anything other than creating an InterposeGenerator when asked to, so apart from all the type fiddling involved it’s rather simple:

Swift is now happy with the type of the function, but unfortunately it doesn’t let us instantiate the array. Instead it thinks we’re trying to subscribe S.GeneratorType.Element, which isn’t possible, because it’s a type and not a value. As far as I can tell, the only way to get around this problem is to introduce a second generic argument, which we can force to the correct type via a where clause:

Now and then we get bug reports of test cases crashing on a specific configuration in an indeterminate manner or frequency, both from our own build bots as well as from our users. One of those I looked at recently involved a runtime test case on Linux/AMD64. The test would only crash infrequently, but luckily I could reproduce it in a VM.

What the program does is run two threads, one of which causes null reference exceptions in a loop over and over again:

Mono generates null reference exceptions by just letting the code trigger a segmentation violation signal, which we catch in a signal handler that then proceeds to throw the NRE. This approach, as opposed to doing a null check whenever we access an object, is that there is no overhead at all in the case the reference is legit, and the code we generate is shorter.

Garbage collection, on the other hand, also involves signals: stopping the world for a collection on Linux is done by sending a signal to each managed thread. Each thread catches its signal, which tells it that it has to suspend until the GC has done its work.[1]

When we look for the cause of a bug like this the first thing we try to do is catch the crash in a debugger, in this case GDB. Unfortunately, the crash was caused by a segmentation violation that our signal handler cannot handle (because it doesn’t occur in managed code), but a regular run of the program also triggers segmentation violations like crazy, so just breaking at each SIGSEGV didn’t get us anywhere.

My first attempt at getting to the crashing SIGSEGV was to write an Expect script that drove GDB to the first SIGSEGV, presumably caused by the null reference in the managed code loop, remember the offending address and then keep continuing until it hit a SIGSEGV at a different address. Sadly (and quite typical of cases like this), running the code that way in GDB never succeeded in triggering the bug, so we had to think of other options.

At first we thought the crash was caused by the interaction of the two signals (the segmentation violation from the null reference and the stop-the-world signal), but we looked into other hypotheses. We tried two different ways of getting rid of the segmentation violation: using explicit null checks to trigger NREs and just throwing an exception manually without accessing a null reference.

Explicit null checks (enabled with the environment variable MONO_DEBUG=explicit-null-checks) makes the JIT emit less efficient and more elaborate null checks whenever an object is accessed, so an NRE can be thrown without going through a SIGSEGV.

The other way to get rid of the SIGSEGV while still retaining most of the code’s behavior is to just throw an exception, like so:

We figured it probably didn’t matter which type the exception was, and indeed it did not: both with explicit null checks as well as with the manual exception throw the crash still occurred. We also looked into whether invoking the garbage collector was critical to the crash, and it turned out it was: just making the other thread sleep instead of calling the GC made the crash go away. Also, despite what the bug report claims, the crash was not dependent on SGen—it occurred with the Boehm collector, as well.

Armed with this new data we set out to finally get a close look at the crash in the debugger. Since the crash still only occurred very infrequently I brought my Expect script to bear again, and after a while we did get the crash in GDB. It did, regrettably, give us less information that we had hoped: instead of crashing somewhere in unmanaged code while dereferencing an invalid pointer, the crash resulted from the program counter being set to an invalid address, namely zero.

There are three main ways of setting the program counter on AMD64: CALLing a function, RETurning from a function and JMPing to an address. Spending some time looking at the stack dump it seemed that the most likely cause was a jump, which did not make our job easier, since it leaves no trace of where it came from.

After some more tinkering we decided to investigate whether the point at which the exception throwing thread was interrupted by the GC played any role. We got lucky—it so happened that every time the test crashed, the last time the thread was stopped was always in the exact same piece of code; even more, at the exact same instruction. This is the code in which the thread suspension occurred, with the instruction where it happened labeled:

This piece of code is executed as the last step of throwing an exception. It jumps to the catch handler, which involves first restoring a bunch of registers, including the stack pointer, and then doing the actual jump. We guessed that this jump to register r11 was the one that landed us at address zero. Observing in a few more crashes that r11 was indeed always zero corroborated that hypothesis.

The code that calls this restore_context() function looks like this:

void
mono_amd64_throw_exception (MonoObject *exception, ...)
{
MonoContext ctx;
... figure out where on the stack the matching catch clause is ...
... set ctx to the values required to end up in that catch block ...
restore_context (&ctx);
}

Note that the context—the data structure that contains all the values for the registers, including the program counter—is a local variable of this function, i.e., it resides in this function’s stack frame. Said stack frame is on the same stack as the catch handler that it eventually dispatches to, but further down, of course. We pass the address of the context to the above restore_context() function.

It seemed obvious now that for whatever reason the value for the program counter was set to zero in the context, so we added an assertion to check for that directly before the call to restore_context(). The assertion never failed, but the program still crashed.

Looking at the assembly code again Rodrigo suddenly proclaimed that he knew exactly what caused the bug: after setting the stack pointer to the catch handler’s frame we did another load from the context, which we weren’t allowed to do. By moving the stack pointer upward the context ended up lying not above the stack pointer, where it used to be, but below:

Most ABIs specify that programs must assume that data below the stack pointer (plus the red zone) is subject to arbitrary change because it’s used by the operating system for invoking signal handlers, for example—exactly what happened in our case with the thread suspend signal.

The correct way to do the context restore, then, is to load everything from the context before restoring the stack pointer. In our case we use the scratch register r8 to hold the value that will end up in rsp:

Looking at our code for x86, after fixing AMD64, we found that we had the same bug there, but nobody had reported it yet, probably because it was harder to trigger.

Reality is a tiny bit more complicated: Sending a signal to a thread is asynchronous, meaning that the GC has no direct way of knowing that a thread has already gotten the signal and reacted to it. Instead, the suspend signal handlers message back to the GC that the point of suspension has been reached, upon which they actually suspend, waiting for yet another signal from the GC, telling them to resume. Note that if the resume signal arrived between the point where the GC is notified of the suspension and the point of initiating the wait for said resume signal, the signal would be lost and the thread would never resume. ↩

Assume two people want to split a pizza in a fair way. In our context, fair means that if both people follow the procedure, each person gets half a pizza, or, if not, only has themselves to blame. The obvious answer seems to be “split it in the middle”, but that leads to all kinds of further questions and complications: Who does the splitting? How is “the middle” determined? Even if we had protocols for that, it would still leave the possibility of one person not being satisfied with how they were carried out—sloppy measurements or sloppy cutting.

The standard way to avoid all such problems is to let one person cut the pizza in two halves, and the other person choose who gets which half. If the cutter thinks they got the smaller piece it’s because they didn’t cut fairly or precisely enough, and if the chooser thinks their piece is smaller, they should have chosen the bigger one.

What if there are three people? Before we try to come up with a procedure, let’s consider people’s motivations first. In the case of two people, whatever one person wins, the other loses. With three people, one person might elect to get a smaller piece for themselves in order to also make one of the other two worse off. Or, more realistically, two people might collude against the third. We’d like a procedure that ensures that even under those circumstances, fairness is guaranteed.

One generalization of the two-people procedure that comes to mind is to let one person—let’s call them “A”—cut, and then let the two others choose, first B, then C. Unfortunately, if A cuts in such a way that there is one large piece and two pieces that are each less than 1/3, and B chooses the large piece, C doesn’t get their fair share (neither does A, but that’s A’s fault). Randomizing the order of cutter and choosers is not a solution, either, because there is still the possibility of somebody getting the short end of the stick.

Here’s a procedure that works: A cuts the pizza, B chooses their slice. C has two options: they can either choose one of the remaining two slices, or they can call shenanigans. In the latter case, B picks a second slice[1], then both slices are put together and C divides them into two halves. B picks one half, C gets the other.

To see why this procedure is fair, let’s consider each person in turn. A will always get the piece that B and C didn’t pick, so if A cut fairly, they get their fair share. B will either get the piece they pick, or at least half of two pieces they pick, so if they pick the two largest pieces, they will get at least one third of the pizza. C will either get the piece they pick, or, in the case of shenanigans, half of the two pieces that B picked. It’s not obvious that they will get at least a third in the latter case. We assume that C only calls shenanigans if the two pieces that are left after B picked one are both smaller than one third—if one piece is at least a third, C should of course just pick it, thereby getting their fair share. The larger of the two pieces will have size 1/3-x, the smaller one 1/3-x-y, where x>0 and y>=0. If the two pieces are the same size, y is zero. The piece that B picked therefore has size 1/3+2x+y. B now picks one of the smaller pieces. If B wants to minimize C’s share, B must pick the smallest piece, i.e.1/3-x-y, which then gets put together with B’s original piece for C to cut. The two pieces together have size 2/3+x, so if C cuts fairly, they will actually get a size larger than their fair share.

I don’t know whether this procedure can be generalized to a larger number of people.

Update: As many commenters have thankfully pointed out, I have solved a problem that smarter people have already solved for more general cases. For a good overview of a few methods, read this article. My personal favorite is the Last Diminisher algorithm. The method I came up with is a variant of the three-person Lone Divider procedure.

It is important that it is B, not C, who picks the second slice when C calls shenanigans. ↩

Introduction

In my last blog post about SGen, Mono’s garbage collector, I wrote about concurrent mark-and-sweep, which makes most of the garbage collection work happen concurrently with the running program.

One thing we cannot do in a concurrent collector, as opposed to a stop-the-world collector, is move objects around in memory[1]. This is typically not a problem in a mark-and-sweep collector, but SGen goes beyond basic mark-and-sweep and prevents fragmentation by evacuating objects when the need arises.

Fragmentation is no less of an issue in a concurrent collector, so being unable to do evacuation is problematic. Our best solution at this point is to stop the world when it happens[2]. The threshold we use to decide when to do evacuation with concurrent mark-and-sweep is higher than on our non-concurrent collector, but there is still a chance that a full collection pause will hit the program when interactive speeds are desired.

The API

We’re introducing a new API in the Mono.Runtime class. The application can use it to ask the GC not to do full major collection pauses:

One possible use case would be a game, where you want the lowest latency possible while the player is in a level, but between levels, or during loads, a longer collection pause is not much of an issue. Note that there are still no guarantees about the lengths of the pauses—this is a best-effort approach, not a real-time garbage collector.

This API will be available in the next Mono release, but only via reflection. It will be a public API on mobile soon.

This is technically possible, but it involves using a read barrier, which would not only be forbiddingly complicated to implement in Mono, but would also probably incur a significant performance hit as well as break our embedding API. ↩

An approach that would keep marking mostly concurrent would be to decide which blocks to evacuate before starting concurrent mark, and then during marking to keep tabs on the references from already marked objects to objects within those blocks. In the finishing collection pause we could evacuate the marked objects in those blocks, but we’d have to update the references to them. Depending on the workload that might still require a significant pause. ↩

Introduction

Like most garbage collectors, SGen is of the stop-the-world variety, which means that when the time has come to perform its work, it stops all managed threads in their tracks and only resumes them once it has finished.

Nursery collections are usually relatively short affairs, and can be sped up even further by using a smaller nursery size. Major collections, on the other hand, can take a long time because they must scan every single live object in the system. Long garbage collection pauses are clearly a serious problem for many applications, be they interactive or on the server, and garbage collector engineers make considerable efforts to shorten pause times.

One approach to this is to do parts of the major collection concurrently with the running program, aka the mutator. SGen’s major collector, mark-and-sweep, consists of two main phases, unsurprisingly called mark and sweep.

Thanks to Zoltan’s recent work—lazy sweep—SGen already does most of sweep’s work outside of major collection pauses. I have been working on the other part of the story for the past few weeks: concurrent mark.

How it works

A major garbage collection starts with an initial marking pause, during which all roots are scanned.[1] Then it restarts the world and in the background a worker thread performs the marking and scanning of major heap objects, in the usual manner. When the concurrent marking phase is done, the collector stops the world again to finish up the collection in the final marking pause.

Dealing with concurrent mutation

Consider the following code, which manipulates a linked list, swapping the order between the two nodes following a:

Let’s assume that right after a->next has been set in line 3, the concurrent worker thread scans and marks the node referenced by a, then follows next and processes c. If at that point our program hasn’t come around to setting c->next to b in line 5, the garbage collector will never encounter b—it will have been lost because for a brief moment that node wasn’t part of the list.

We can recruit the help of our friend the write barrier to deal with this issue, but we need to extend its scope a little: instead of just recording stores that point from the major heap to the nursery, it must now also record stores pointing from the major heap to the major heap. In other words: all reference stores that happen in the major heap, no matter where they point to, must be recorded.[2]

With this new write barrier the store to c->next will record the node c as requiring a re-scan. This re-scanning of all modified major heap objects happens in the final marking pause, so it’s not further disturbed by the mutator.

What if the final marking pause happens before c->next is set in line 5, however? The node b might not be pointed to by any object in the heap, so we won’t find it that way, yet it is still live. It’s live because it’s referenced by the variable b on the stack, so the final marking pause must also re-scan all roots.[3]

Nursery collections

Nursery collections that happen while concurrent mark is running stop the world, but concurrent mark keeps chugging along. It never follows references into the nursery anyway, so there is little potential for conflict.

There is an issue with the card table, however: nursery collections clear the card table, but its data is required during the final marking pause, which might only occur a few nursery collections later. The solution to this is to maintain a separate card table that accumulates the marked bits from the “regular” card table, and to use this union table in the final scan.

Results

This graph shows maximum pause times for a few of the programs in our GC benchmarks suite. All of the pause times are relative to Boehm’s, so values below 1 are better than Boehm’s. Note that the values correspond to the longest stop-the-world pause that occurred when running the program, not the median or some average.

The three SGen configurations are standard mark-and-sweep in blue, concurrent mark-and-sweep in green and concurrent mark-and-sweep in conjunction with cementing, in yellow.

The programs represented in this graph are those whose maximum pause times with Boehm were longer than 10ms. The ones with the longest pause times are on the left—on graph4 Boehm had a pause of 2.5s—which are those that benefit most from concurrent mark, as expected: the longest pause on graph4 is less than 180ms with concurrent mark. I have not yet investigated why graph8 is such an outlier.

For concurrent mark the roots not only include the usual suspects like thread stacks and global variables, but also nursery objects. In fact, the initial marking pause performs a nursery collection, after which only pinned objects will be left in the nursery. Those are treated as roots, or rather the references they contain are. ↩

Since we use a single card table for both kinds of references, this will increase the number of cards marked that don’t have references pertinent to nursery collections. The card scanning overhead will therefore increase, but in practice this doesn’t seem to be a problem. An alternative would be to use two separate card tables, one for references to nursery objects and the other for references to major heap objects. That would be more expensive both in terms of memory usage as well as in write barrier performance. ↩

This entails doing a full mark-scan loop, of course, which might still take a very long time in pathological cases. Imagine a program that keeps a huge linked list around, referenced only by a single object on the major heap. At some point concurrent mark kicks in. The program now loads the reference to the list into a local variable and nulls the reference on the heap. The garbage collector will not be able to mark the list until it scans the stack in the final marking pause, and will now have to traverse the list with the world stopped. ↩