Search This Blog

Hardest bug I ever solved

A lot of people probably have that hard bug they spent a lot of hours trying to solve. Almost 1.5 years after solving my hardest bug I got nostalgic and thought "Hey, this was pretty cool! I should write about it to not forget it". Since it's so long ago there might be things I misremember, but the gist of it should still apply.

Let's jump to it! Presenting: "The curious case of the crashing compiler".

Compiling the following code works in my standard GCC but causes another GCC to crash:

Debugging this I know from my IRC logs that I found and fixed a bug with regards to the so called TLS relocations in our binutils port, but I cannot access the commit anymore since all patches were squashed together in preparation for upstreaming. What I have is this line:

< blueCmd> stekern: the human readable version: I think it was a race depending on which relocation was read first and that TLS and non-TLS locations would change the sreloc pointer in the section back and forth
Alas, now the problem is that linker works, but GCC crashes instead.

After some more printouts I arrive at the conclusion that the call to mpfr_cache would get called with a totally bogus value for x and the pointer __gmpfr_cache_const_catalan would be something very obviously broken (like 0x1). Inside mpfr_const_catalan it would however be fine.

A thread-local variable can be resolved in a bunch of different ways depending on how "close" the compiler can assume the real value is. The furthest one requires a call to our C library for a function called __tls_get_addr that will resolve the pointer to something that is valid in the current running thread. Since this only happens when TLS is used we'r now pretty confident that the bug is somewhere in the TLS code.

Let's have a look on how mpfr_const_catalan is generated. On OpenRISC 1200 the calling convention is to place the first argument in r3. Remember that we want to pass the first argument straight through to mpfr_cache, so we shouldn't touch it. This is how the assembly was generated (I removed a lot of unrelated stuff):

Note: OpenRISC 1200 has a delay-slot (indented) after jumps, so you need to think out-of-order.

And at this point I had a clear case what was wrong, but as I expressed it a few hours after finding this:

< blueCmd> I honestly have no idea how to solve this
This feeling of helplessness had never struck me before. I had none I could ask, I had nobody that knew the system better than me. This was unique for me and I was very excited and frustrated at the same time. Fast forward a few days of ball planking and looking at old bug reports of seemingly maybe-related-but-probably-not bug reports in GCC me and stekern arrived that the code is probably emitting the library call to __tls_get_addr at the wrong state in the code expansion (remember! this is not a normal function call, it's inferred by accessing a specific pointer) and therefore messing up the register scheduling.

Moving the so called "address legitimization" to an early stage (sounds easy but is a *pain*) and removing some parts where I tried to be clever fixed the issues and made the generated code use sensible registers. The clever part was that I thought I was helping GCC by telling it that "you can use the same register multiple times at this point" causing GCC to make some weird optimizations that were very broken. Simply letting GCC schedule that by itself solved that problem.

Special thanks to stekern for all the help and idea bouncing through all of this! GCC code is not easy to reason about, or as stekern and I put it:

< blueCmd> gah, gcc's damn .md files - they are confusing as f*ck

< stekern> you don't say

< stekern> I can't count the number of times I've went "ah, now I get it! ...no, I don't" =)

< blueCmd> "error: unrecognizable insn:" *sigh*

When asked "What is the hardest bug you ever solved?" I will probably think of this for a long time.

Get link

Facebook

Twitter

Pinterest

Email

Other Apps

Comments

Post a Comment

Popular posts from this blog

Let's say that you are a system administrator in a decently sized company. You're responsible for selecting new servers for a pretty decently sized upgrade round in your data centers, or maybe you're building a new datacenter. It's not that long ago this happened last time for your company, but since we're living in an ever-changing world things of course are complicated. What do you do? Let's look at your options.

Go with the same vendor and same models as you did previously. Safe bet, will not get you fired even if it's the least cost effective solution. Probably will not get you promoted either though, and the pains of today will be the pains of tomorrow.Look around for new vendors. Maybe you're finally looking at Supermicro instead of only doing HPE, or the other way around. Risks are higher, but all those issues with iDRAC/iLO/IPMI are surely fixed on the other vendor - right?
This is the world I remember living in when I worked for a small ISP in a…

Christmas is the time I get some time away from work to tinker on those projects I've put off for so long. One of them has been to build something cool with the hardware I have acquired over the years. Let's have a look on what the current state of my laboratory looks like!

Introduction
The Leopard is an OpenRack v1 compliant 12V server commissioned by Facebook to offer compute power. It consists of 2x Intel Xeon E5-2678 v3 and is available with either DDR3 or DDR4 memory. The model is manufactured by two vendors primarily: Quanta and Wiwynn.

Leopard features a 24x PCIe slot which can fit either a PCIe card with low profile, or a riser card with 1x 16x and 1x 8x slots. The server also supports a 3.5" SATA drive as well as either an mSATA or an M.2 drive mounted on the motherboard.

Figure 1 and figure 2 shows the server layout. The server is made to fit inside an OpenRack v1 enclosure, at which point it looks something like figure 3. Due to power constraints an OpenRack v1 can fit 30 of these servers before power starts to become an issue. The Leopard servers that the organization Serverfarm Upstate provides are all fitted with 256GiB DDR3 RAM and 2x …