Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

Either are equally bad from the perspective of a software developer who spends a month trying to work out just exactly what is wrong with their code, especially if something like this occurs on a test machine but not on a development machine.

I found out about the division bug as a beginner programmer! I was trying to write the first MMORPG using Quick Basic. I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close. It fixed it, but new programmers shouldn't be forced to deal with stuff like that.

I've preferred AMDs to Intels because AMD was one of the first sponsors to Esports back in 99. Too bad Columbine happened and I suspect they wanted to distance themselves from Quake tournaments. Another thing I like about AMD was that their processors don't melt if they get hot because they have a self preservation shutdown mode. People said Intel had this, but I melted a processor just a few months ago on SWTOR.

Heh. I coded a nice tile based RPG out of it, but I couldn't make it MMOG because there is no socket code in Quick Basic. The trick to making big games in Quick Basic is to write your own Virtual Disk so you can get past the 640k memory limit. Once you have a virtual disk, you can write an interpreted language inside Quick Basic, then your code is simply loaded up in a custom database. I rewrote the whole thing in C/C++ because people told me I could get socket libraries in it, but I gave up on my game entirely when Ultima Online came out because I felt I wouldn't be able to build up a market because my graphics are so bad. I was partially right in thinking there is only enough room for one MMORPG at a time back in 97, but I think I shouldn't have gave up after having coded for thousands of hours with things like Farmville succeeding today.

What is the problem with Quick Basic? It came for free and it was quite ok.

NO it did NOT. What came free was QBasic, which was a stripped-down version of Quick Basic. The full Quick Basic did not have the 640k memory limitation, was able to fully link/compile stand-alone executables, and had a host of other Professional features that QBasic lacked.

Don't get me wrong- QBasic was great for a free environment (at the time). But it was severely limited, and all the references to "quick basic" in this thread appear to be referring to shortcomings in QBasic, which were not present in Qu

This is not insightful; it's wrong. Floating point on any modern system conforms, or at least is intended and assumed to conform, to IEEE 754. There are exact answers specified for every basic arithmetic operation and non-transcendental functions. Of course there are decimals that have no representation in binary, but 4.0 is not one of them.

Finite numbers, which may be either base 2 (binary) or base 10 (decimal). Each finite number is described by three integers: s = a sign (zero or one), c = a significand (or 'coefficient'), q = an exponent. The numerical value of a finite number is
(1)^s × c × b^q
where b is the base (2 or 10). For example, if the sign is 1 (indicating negative), the significand is 12345, the exponent is 3, and the base is 10, then the value of the number

I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

I remember division not being exactly accurate, so the solution I needed to use was to round up and down results that are really close.

Just so you know, division is never accurate in floats, even when the CPU doesn't have bugs. If you're using doubles you'll get better accuracy, but with a 32 bit floating point number, you shouldn't be surprised to find errors in the third digit after the decimal point.

Just to be clear its not limited to division. Hell, errors can creep in just by converting a decimal number to floating point. This is why calculators use decimal arithmetic, well some of them - like Perpenso Calc for iPhone iPad [perpenso.com] RPN Scientific Stats Business Hex. Try "0.5 - 0.4 - 0.1" in your favorite calculator app, it might indicate whether the app is using the FPU or decimal arithmetic. Of course the app may be doing something naive like the "BASIC MMORG", rounding results. Its naive because it is anoth

Division is division, regardless of the base used. The issue is that in base 10 (aka decimal numbers), division by 2 and 5 always comes out to a finite decimal; in binary numbers only division by 2 comes out to a finite decimal. Dividing by any primes other than 2 and 5 (and numbers involving those primes) will require rounding in both bases (and they may not necessarily round the same way). That is, unless you're only dividing by combinations of 2 and 5, there really is no preferred base.

Just because you find an error in a division when you were programming your MMORPG in visual basic doesn't mean you've found the pentium bug. If you noticed it happening a lot, it probably wasn't the bug, just normal IEEE precision issues.

Anyone who's programmed long enough has found unexplainable bugs that are eventually traced down to some bad hardware.:)

I've preferred AMD over Intel for years. Long ago, in a distant computer store, far away.... We sold 386s, 486s, and Pentiums (or their reasonable clone) from Intel, IBM, AMD, and Cyrix. At the time, I didn't really care who made the chip, they were just built out for the customer.

Over the years, I learned to prefer AMD for both the price and performance. Plenty of people will argue "but this Pentium is faster than that AMD". Well, it's all nice, but I don't *have* to stay bleeding edge. I never liquid cooled my CPU, video card, and memory. Friends did. I was always impressed with how much they wasted. I'd just wait 6 months or so, and get something better, faster, and cheaper.:) I do like having a high performance computer, so I upgrade every year or so.

For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99. For the difference in price, I could build out a 3rd server, and still have money left over. Toms hardware suggests the i5-2500K (4 core, 3.7Ghz turbo) for $224.99 or i7-2600K (4 core 3.8Ghz turbo) for $324.99 as comparable. If I wanted to spend a little more, I could have gone with the AMD FX-8150 (8 core, 4.2Ghz turbo) for $249.99. Was $50 for.2Ghz worth it? Not really. Something bigger, better, and faster will be out next year, and the year after, and then I'll buy something new.

I used newegg.com for all the prices, so it would be fairly even.

The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

My desktop/gaming machine still has a Phenom IIx6 1100T in it. All the games I play, I can leave all the settings turned all the way up. Maybe if I ran benchmarks, I'd see something else gets a slightly faster frame rate, but I can't see any difference. As we all know, various benchmarks show different things.

Pardon my curiosity, but it sounds like you're building ~$400 servers out of basic desktop components. What kind of workload are you putting on these boxes that scales so well, yet doesn't justify the added expense of high-end server class hardware ? Maybe I'm at the other end of the spectrum, but I wouldn't dream of running a server without redundant power supplies and premium boards that have been built and tested to rigorous specs. The added hardware expense more than makes up for decreased maintenanc

Like a RAID, but then a RAIS (Redundant Array of Independent Servers). Load distribution may be an issue as it has to seamlessly reassign tasks when a server is down for whatever reason. But for sufficiently large operations (five servers or more) this sounds to me like the way to go. Instead of trying to make every individual server highly reliable, go with the still very reliable user-grade stuff and get your reliability by redundancy. And companies like Google need more than one server anyway.

I used to work for a guy who built servers out of whatever spare parts he had lying around - obsolete desktops, refurbs, ebay junk, whatever. For a while, we were spending at least 10-15 hours a week keeping those things up, or driving down to the datacenter to physically reboot them.

YMMV but I've got junker machines which have been running as servers 24/7 for years without a glitch. I'm just about to replace a few of them with Intel Atom boxes to save power/eardrums and I'm worrying about the reliability of the new machines. I'm going to keep the old machines lying around for at least a couple of months.

Well, $604.94/ea. The memory came with an 8GB Class 4 micro SD, and we got a $10 newegg gift card each. I forget what that was bundled with. If you consider a gift card as cash, they were under $600/ea.

All of those are quantity 1, except the hard drives. They are 8 core 4Ghz (always running in Turbo mode), with 16GB ram, and RAID 1 on the drives. I opted to go more like Google's topless server. I used cable ties to mount up everything on wire racks from Home Depot. Ya, the same plastic/rubber coated ones you'd use in your closet. This is serving out of my house on a business FiOS line, so no one at a datacenter can complain.:) They're running amazingly cool. Because there's nothing interrupting normal convection air currents, all the heat sinks and drives are cool to the touch. They're a bit quieter than my desktop PC, because I don't require an extra fans to pull the hot air out of the case. My regular desktop has a 250cfm fan on it to keep it cool. Without it, and with the side on, it can overheat in a few minutes when gaming.

The room does have an air conditioning return in it, which helps keep the room cool. The only fan I added was a HEPA filter. It's oversized for the room, but it'll help keep dust off the machines. The room is the same temperature as the rest of the house, so I'm happy with it. It serves no purpose for cooling the machines, since it's not even pointed at them.:)

I have some pretty low load servers. Rather than buying a dozen of anything, I opted for using virtual machines. These two servers are hosting 4 VMs at this time, and there will be more. It's a young setup, and I have a lot of work to do on it. I opted to use VirtualBox. It works very well. I had intended trying VMWare ESXi or Citrix XenServer. unfortunately, neither would use the crappy software RAID that the boards provide, and I wasn't willing to drop money on real RAID controllers. I looked around a bit, and it seems that you can try to use some workarounds, but I didn't have the time or inclination to do it, where I could have VirtualBox going in less than an hour.

The VMs are redundant between servers. Further on, you can read more about how I did it in the past between physical boxes. So if a single VM crashes, who cares. If a VM host crashes, well, it's reduced redundancy, but I'm still operating. I'm going to put out more VM hosts, and increase the redundancy. 4 machines with 6 VMs each is like 24 physical boxes. That's a serious savings, especially where the VM host costs about $600.

Let me give you a little history.:)

Long before Google made the pictures of the way they do servers, the company I was at was using COTS parts. That was voyeurweb.com (NSFW). They were hosting with a company not to be named (as in, I can't remember), who sold them on a $50k investment of a Sun server. They promised it was more power than anyone could ever want. That lasted about 3 days. It was after this, I got involved with them. We dropped about $15k on 10 servers. They were fairly cheap machines. Asus gaming motherboards, AMD K6/2 300 CPU, 512MB RAM, 8GB and 20GB IDE drives. The most expensive part at the time was the cases. It was pretty much what you'd be using at home at the time.

We had the occasional failures, but they were usually due to load or CPU fan failures. At the time, they had under 1 million daily viewers, so we could handle that load on 4 of the 10 machines. Load balancing was done with DNS round robin. I know people say it's a poor system, but it worked well. There was typically a 3 second delay if you happened to hit a bad server, and then you'd roll off to the nex

It got more complex as it grew. This is all from memory, so there may be some inconsistencies with what I'm saying.

The main site primarily used 3 subdomains. There were also 3 pay sites. For a while there were 6 machines for video streaming. The video streaming site went away, mostly due to cost vs income. We added free hosting, which was small to start, and picked up popularity and grew to dozens of servers. There were side projects launched on their own servers. Some worked. Some were dropped.

Masterstats.com was an example of that. In function, it was similar to Google Analytics. It started as 1 DB and 2 web servers. Because of the load thrown at it, it became 3 DB, 4 web, and 2 offline calculation servers. The offline machines were just to process stats that were too intensive to do live, and created an undue load on the web servers.

Another example was our backups. If you go digging through my journal posts, you'll find me talking about multi TB arrays, back when the largest drives on the market were 250GB. Back in the first iteration, it was fairly simple to keep backups. That grew as we kept throwing more into it.

These are the rough counts just for the main sites.

Before I took over, there were 3 subdomains (www, voy, and ww3.) Each was on one server. If one server failed, that part of the site stopped. Needless to say, that was bad if (and when) www. went down.

4 servers in one city. All 4 machines had all 3 subdomains.

8 servers in two cities. That's 4 machines in 2 cities. Each set could handle the full load, in the event we lost a city. That meant each city handled 50% of the traffic normally, or 100% in a city failure. We could operate on 2 per city, but the extra 2 provided for redundancy.

When we scaled out to 3 cities, we split the 3 subdomains between groups of servers in each city. We also divided up the load equally, so each city got 33% of the traffic. Having a city fail increased the traffic to the other cities by 17%. The sites data existed on all the servers, so in the event of a failure of one server, we could distribute that load. So now we had 3 to 5 servers per group, in 3 cities. Newer hardware was faster, so those would have combined duties. Clearly a dual 1.4Ghz machine could handle more than a single 350Mhz machine. At this point, we had retired almost all of the 350Mhz machines, except for a few that were recycled to do DNS and other low-load tasks.

When we got to 5 cities (New York, Los Angeles, San Diego, Tampa, and Frankfurt), the loads got divided up differently. Frankfurt had one 100Mb/s circuit. San Diego had one 100Mb/s circuit. New York had two GigE circuits. Los Angles had 2 GigE circuits, which grew to 3 when we retired San Diego. Tampa had 2 GigE circuits. Frankfurt was retired, and the traffic was just added to New York. That barely made a blip on the bandwidth graphs.

Members sites had a warm friendly map so they could pick their city to view from. We looked at other options, but it was simple to give them a map to choose where to serve from, and they could pick another city any time they wanted. Forums had people discussing which ones were "faster". It was very subjective. People in the West would pick Los Angeles or San Diego, with most preferring Los Angeles. People in the East picked Tampa or New York. People in Europe said Frankfurt was too slow, and they had better access via New York or Tampa.

Each city had different load characteristics. Free hosting servers were deployed in 3 cities. There were some special case servers too. For example, where someone had a very high load "free hosting" site that made a lot of money via the AVS, could get their own server or servers.

For example, I just set up a couple servers from COTS parts. They used AMD FX-8120's (8 core, 4.0Ghz turbo) for $199.99/ea. It seems the comparable Intel is the i7-980 (6 core, 3.6Ghz), which is selling at $589.99.

Modded informative? Only on slashdot... Also you compare turbo speeds (and GHz is silly anyway due to the difference in IPC), yet say:

The servers actually use as many cores as I can throw at them, so it's extremely beneficial to have more cores at high speeds.

If all cores are 100% loaded, you're not going to get anywhere close to max turbo. That's the extra boost it can give if only one core is working.

It's too late in the evening for me to go chase down the Tom's Hardware link. I closed that tab a while ago.

As for the rest... It works. It works well. It's cheaper. I don't upgrade desktops or servers every day. No one does, unless you're filthy rich and don't know where to spend your money. If that's the cast, you can send me some of that via PayPal on my site.

This round of upgrades is replacing dual Opteron 1.4Ghz boxes, putting their respon

A floating point precision error. Floating points cannot represent quite a diverse collection of numbers, this is especially problematic when you're doing intersections with small objects. Say a ray projected from an object will, because of the minute errors in floating point, collide with the same object (which produces some cool patterns).

Floating points are kind of crappy. Not that I have a better option with viable performance on a desktop machine. That's not a division bug, that's just the nature of representing numbers in binary with a fixed number of bits.

Except I very much doubt that would solve whatever "problems" this guy was having. As a newbie programmer, it's entirely understandable that he wouldn't know about the fun you can (or can't) have with floating point operations. However, I very much doubt that sheer accuracy was the issue, rather he was probably making assumptions such as 1.0 - 1.0 == 0.0, when in reality the result isn't necessarily exactly 0.0. Considering it's an MMO, he probably had something like "Why is this guy not dying, he has 4 HP left and this attack does exactly 4 damage? Must be a bug!".Really, it doesn't matter a huge amount, if such "accuracy" is important to your game then instead of doing "if(Health is less than 0.0)/* die */", you do something like "if (Health is less than 0.0 + epsilon)/* die */", with "epsilon" being a very small number (such as 0.00000001).The real fun with floats, however, is that each platform does something different. It's possible that the OP ran the game on Intel hardware and got one result (which may have seemed more "correct"), then ran it on an AMD machine and got a different (seemingly less-correct) result - you can see why he naturally jumped to the conclusion that the AMD system had a bug.In reality, chances are both systems were "wrong" anyway, they just happen to use different implementations for floating-point logic. To solve this, once again higher rates of calculations aren't the answer, but rather there's a compiler switch (/fp:strict in VS) that will use the ISO standard floating point model. It's not as fast as the other methods, but you will at least game the same results across different platforms (assuming that CPU has implemented the standard correctly which these days is almost certain).

The same binary on the same platform (x86 in this case) will render the same results. That's part of the contract of the x86 ISA. However, unless you're using/fp:strict or equivalent in your compiler, just recompiling with ANY kind of changes can alter behavior.

No, you are mistaken because you didn't read my post (or any of the posts above it). We're talking about floating point rounding irregularities that are present in ALL modern processors, not the floating point bug you're referring to.

I did not take issue with the floating point irregularities. In fact, I also believe that the issues he experienced were not due to the FDIV problem he believed to be the cause. I probably would have used the fact that the last release of QuickBasic was in about 1989, before the widespread inclusion of FPUs in PCs, and that QuickBasic would almost certainly use software emulation for floating point arithmetic. It therefore would not have triggered a bug with the FDIV instruction.

I'm pretty sure it was with the introduction of the Pentium (which had the famous FDIV bug) that John Carmack officially made the switch to single precision FP for most things because it was finally fast enough. FP wasn't cheap, per se, but the simplification it brings over keeping track of binary points and precision/range tradeoffs in integerized algorithms should not be underestimated either.

For example, if I want to do a floating point multiply and add, I just say: f3 = f0 * f1 + f2. Before I even start writing a fixed-point multiply and add, I need to ask what the Q points (binary points) are for each of the terms, what Q point you'd like for the result, and what sort of rounding (if any) the result requires for stability. You can end up with a monstrosity like this, assuming all four numbers are at the same Q point:

x3 = (int)(((long long)x0 * x1 + (1LL > Q) + x2;

Ok, maybe you hide that behind a macro, but what about cases where some of the terms are at different Q points? A fully general macro (which is no fun to write, BTW) would also have a ton of arguments, and only reduce you to something like x3 = FXMULADD(x0, Q0, x1, Q1, x2, Q2, Q3); which won't win you any awards in the clarity department.

And look at the operations themselves, too. You have type promotion, extra adds and shifts... the instruction sequence itself isn't super efficient. It pays off when floating point takes 10s and 100s of cycles, but is a dubious win when most of the core FP starts coming down into the single digits. With the Pentium's dual pipes and the fact you could keep integer instructions flowing in parallel to the float, that's effectively what happened. And notice we haven't even talked about dynamic range and overflow errors and how they screw you up. If you have to add tests for that... yuck. With floating point, you degrade gracefully if your dynamic range spikes a little higher than you expect.

Anyway, getting back on topic: This isn't the first time an x86 has had a stack-pointer related bug. I remember the 80386s that had the so-called "POPAD bug". [indiana.edu] That one was a bit easier to hit.

Hopefully, AMD will be able to publish a microcode update or something to work around theirs. That's one thing modern x86s have over their predecessors: A good number of CPU bugs can be patched around with microcode updates. I believe Intel added that with the Pentium Pro, and AMD followed suit. I believe my Phenom is one of the affected parts. I guess I'll have to keep an eye out for such a patch.

Crash bugs are frustrating, but nowhere NEAR as scary as a bug that results in an incorrect but plausible computation. If the program crashes, you KNOW it crashed and you know the runs before that didn't crash are OK.

Note that IRL the two cases can overlap. That is, a bug that might trigger a crash or might trigger an incorrect computation that might be plausible depending on luck of the draw.

Imagine, there is a tiny bug that makes your floating point results just slightly wrong once in 1000 times. You run an iterative dynamic simulation of a bridge under load that runs for a million cycles. The results LOOK right...

Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

That might have been true on 386s.

But currently we're in 2012 and the most widely used instruction set for Linux on AMD processors is x86_64. Because these 64bit processors feature a big number of registers, the two arguments will be passed as registers, not on the stack. So the sequence of instructions isn't indeed common.

Not true. pop, pop, ret is a sequence that you are likely to see in any function that makes use of two or more callee-save registers - it will push them before using them and then pop them at the end. If you're lucky, the register allocator will have done some peephole optimisation and moved the pops earlier...

Nonense. It doesn't happen every time, or it would have been trivial to find. It happens under load, so it is most likely dependent on some other factor (e.g. CPU temperature, cache miss ratio, context switch timing) which may not have appeared in testing.

Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

The target function of the call has no business pushing or popping its arguments, ever. It doesnt work. Never has. The caller pushes the arguments and then in some calling conventions (such as STDCALL) the target function removes them from the stack using the return instruction itself ("ret 8" will remove 8 bytes of parameters) while in others the caller itself is responsible for removing the parameters (such as CDECL)

Let me repeat that what you are describing is not possible. When the target function be

The pushes and pops involved are for call-saved registers, not for arguments. Over the years GCC has kinda flip-flopped over the best way to handle that... whether to use PUSH and POP or to use SUB/MOV/MOV/MOV/... the MOV sequences produce much longer instructions, so if you are space-concious (e.g. -Os), you are more likely to get PUSH/POP.

Intel and AMD cpus, over the years, have been better or worse at optimizing instructions which adjust the stack pointer. These days PUSH/POP sequences should be as

Who is this "we".. are you, the anonymous coward, teamed up with icebike (68054)? Clearly you shouldn't be, since he most definitely stated his belief that a two-parameter function would pop its two input parameters near its final return statement.

anything the function has pushed onto the stack will be on the top of the stack, before the return functions.

What you are saying is not news to me. The problem with your argument is that in the x86-64 calling conventions (which is what the article is talking about) there are plenty of volatile registers to use. To be specific there are 7 general purpose registers (64-bit) as well as 6 SSE registers (128-bit) that are considered volatile. If a function really uses so many registers that it requires saving a few of the non-volatile registers, then the function is also most often going to be so non-trivial that it must maintain 16-byte stack alignment.

Only leaf functions can safely violate the 16-byte alignment rule and are allowed to push and pop willy-nilly, but leaf functions also dont need non-volatile registers themselves because they arent calling anything that might destroy the registers they use. So we are talking about a very narrow situation where the function is (a) A leaf function and (b) Takes many parameters (more than 4, certainly) in order to create the register pressure required to need to spill some of them onto the stack someplace other than the mandatory scratch stack space (for the first 4 arguments) required by the calling convention.

And it sounds like the sequence of instructions that causes it is not commonly found.

Really?Pop two off the stack and ret to the calling routine seems fairly common to me. Lots of functions use two arguments and are called with near calls in various programming languages.

So how come AMD systems aren't crashing all over the place? Why did it take him a year to be able to reproduce it reliably? I think my desktop machine has one of those chips in it but it's never had an unexplained crash.

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

I can well understand the guy who found it being all excited. The CPU is the last place you'd look for a bug, and finding one is pretty impressive, especially a really elusive one like this.

Actually, it could be occurring in other places/programs that aren't crashing but are [silently] producing bad results. The floating point bug, once isolated, could be probed for, and compensated for.

From what I can tell from reading the assembly code, the function is unremarkable except for the fact that it's recursive. It isn't doing anything exotic with the stack (e.g. just pushes at prolog and pops at epilog). The epilog is starting at +160 and the only thing I notice is that there are several conditional jumps there and just above it is a recursion call with a fall through. But, from the AMD analysis, it appears that it's the specific order of the push/pops that is the culprit. In this instance, it's r14, r13, r12, rbp, rbx

The workaround for this bug might be that the compiler has to put a nop at the start of all function epilogs (e.g. a nop before the pop sequence) on every function because you can't predict which function will be susceptible. Or, you have to guarantee that the push/pop sequence doesn't emit the sequence that causes the problem (e.g. move the rbp push to the first in sequence as I suspect that putting it in the middle is what is causing the problems)

Oh, I've found CPU bugs before. But I never found one others hadn't already found. The 16MHz 80386 had a bug with counters. If you did a REP MOVSW or similar instruction in a 16 bit mode, starting on an odd address, and you made the pointer registers roll over, the CPU would lock up. Couldn't handle the transition from 0xFFFF to 0x0001 in either direction. That was fixed in all the faster 386's. As I recall, there were about a dozen bugs in the 386. Of course later processors were all checked for those specific bugs, so they never happened again.

Then there's unintended features such as pipeline oddities. If you have self modifying code, and it changes the destination of a jump instruction immediately before executing it, the computer will jump to the old address. Step through those same instructions in a debugger, and it will jump to the new address. Strictly speaking, jumping to the old address is incorrect, but it doesn't break any good code and fixing it would wreck pipelining. This behavior has been known for a long time, and every CPU from at least the 386 to the Pentium 4 behaves this way. It wasn't an important problem because so little code was self modifying. Wasn't any good as a copy protection method either, as only an amateur would be fooled by it. I think it's been resolved in at least 2 ways. First, by amending the documentation for the instruction set to expressly state that behavior is undefined in such a case, and second, by proving that there is never any need for self modifying code. And making the separation between code and data explicit. Now we have No eXecution bits.

There are sometimes even Easter eggs. For some processors, a few unassigned opcodes performed a useful operation. It wasn't by design. Is that a bug? Another case was the use of out of bounds values. For instance, the ancient 6502 supports this packed decimal arithmetic mode, in which 0x99 meant 99. So what happened when some joker gave it an illegal value such as 0xFF? 0xFF was interpreted as 15*10+15 = 165, and one could perform some math on it and get correct results. Divide 0xFF by 2 (shift right), and it would compute the correct result of 0x82. That sort of thing makes life tough for emulators, and I have yet to find an Apple II emulator that reproduces that behavior faithfully.

There is when you're on a memory-constrained platform, which admittedly the PC is not. Selfmod code is still used in demo coding, especially with 256-byte and 4096-byte competitions, but that is exclusively an academic exercise.

On an embedded system with just a few kbytes of memory, like say an ARM-powered gadget, self-modifying code is still relevant, even in 2012. Just because we can put 4 gigs of Ram in a toaster doesn't mean we should.

Most of the undocumented op-codes on older CPUs were down to the fact that they were designed by hand rather than having the circuits computer generated. A computer will make sure all illegal op-codes are caught and generate an exception, but human beings didn't bother. Designers put in test op-codes as well which were usually just left in there for production. Even the way humans design circuits makes them more likely to produce useful undocumented op-codes and side-effects.

It was somewhat risky to use them though because the manufacturer might decide to change CPU. The Z80 design was licensed out and any number of companies could supply them, all with their own unique bugs. Some games like to used these features for copy protection and then broke when the producer switched supplier.

Though, it's still very serious. At least it generally causes your program to crash rather than spitting out a wrong answer. And it sounds like the sequence of instructions that causes it is not commonly found.

It may be uncommon to be found... but that doesn't equate to not exploitable

CPUs have plenty of bugs. It's not necessarily the last place to look, especially for less popular processors. The only reason it's rarer with Intel and Intel-copying CPUs is because the market is so much bigger and therefore the resources for QA. Actually the bigger and more complex the processors are becoming the more likely it is to have bugs. Of course most are things people don't worry about or that can be worked around by following advice in the errata.

In fact enough people assume CPUs have bugs only in the rarest of cases makes it hard to convince others that you have actually found a bug that's not in the errata. The same thing happens with compilers, you tell people that the bug must be in the compiler and they roll their eyes at you.

You try and find something that "the other guy" had a problem with and bring it up as worse so as to try and "protect" the thing you are a fan about? Because I see nothing about the FDIV bug anywhere but your post.

Oh and you know what that bug applied to, right? The Intel Pentium, the ORIGINAL Pentium. Not the Pentium MMX, not the Pentium Pro, not the Pentium II, not the Pentium III, not the Pentium 4, not the Core, not the Core 2, not the Core i, not the second generation Core i. And yes, that's how many major processor versions from Intel there have been since then (with another to launch in the next couple weeks). The original Pentium chips that had this problem came out almost 2 decades ago, 1993.

So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

No it doesn't. The story is about the AMD chips, nobody gives a shit about the FDIV bug and I'll wager there are people reading Slashdot who weren't alive when it happened.

The good news for AMD is that processors can often patch around this shit in microcode these days so a recall may not be needed. Have to see, but the potential is there for a software (so to speak) fix.

I don't disagree with your rant, however it is just not a good idea to dismiss a processor bug as "happened a long time ago". The point is, it happened. Processor bugs happen. And here is one that happened last year [ibm.com] if you must.

So seriously, leave off it. I get tired of any time there is a problem with $Product_X fans of it will point out how $Product_Y had a similar or worse error way back in the day and that somehow changes things.

Unless the $Product_X fans also point out that the maker of $Product_Y paid an awful lot of money to replace the broken CPUs.

> I get tired of any time there is a problem with $Product_X fans of it will point> out how $Product_Y had a similar or worse error way back in the day and that> somehow changes things.

The FDIV bug was really in a class of its own as CPU bugs go; it was trivially user accessible. You could test for the presence of the bug using a *spreadsheet*. This differs from pretty much every other CPU bug where you pretty much have to be cranking out some odd code before you see anything.

But there is a REALLY important question that needs to be asked here, which is "What are the REAL odds that a normal user will hit this?" I mean is it like the Pentium bug back in the day where unless you were working on some very specific math problems you'd never hit it in a million years, or is it like the Phenom I where even on BE chips you'd often have Core 2 BSOD the system if you went past 2.4GHz?

Because with all the problems AMD has had getting GloFlo up to speed the LAST thing they need is a chip r

What's really amusing is that I've been on the scene for so long if you google my name 'Matthew Dillon', the first entry is actually... me! And not the actor(s). I'm sure that grinds a bit but I do bask in the occasional fan mail reaching my inbox, just before I hit the 'delete' key.

In recent years its started to flip back and forth, and I expect Hollywood will again take over the top spot after things die down again:-)

Matt Dillon is a rather famous programmer (as programmers go). I assume that's why they mention him by name. I think a very large percentage of old Amiga hackers know who he is. He's also done work on the Linux kernel. Despite all that, he's best known for his work on FreeBSD and on his DragonflyBSD project. While a lot of old timers will know that, not everyone else will.

I can only imagine the time and effort spent on tracking down this problem - a rare CPU condition is exponentially more difficult to narrow down than most programming mistakes. A lot of progress in IT depends on engineers like this, who obsessively solve problems even when it's much easier to just ignore them, try to hack around them or pass the buck around. Kudos.

I agree but remember that must engineers are working on company time. For most companies it wouldn't be rational to have an engineer working months to isolate/reproduce this CPU bug. After all, this work will particularly benefit this company over all the other companies and at any rate it would be much cheaper to just do the workaround (which might be necessary anyway).
However, a good engineer probably couldn't resist looking into this in his free time (and maybe in company time with nobody looking!) at

A pertinent addition to the submission would be which CPUs have been found to be affected.The second link says Opteron 6168 and Phenom II X4 820. For a second I thought that bulldozer hasn't managed to do anything right, but these two examples are pre bulldozer.No doubt this is not an exhaustive list.

Does that mean that the kernel uploads the new microcode on boot ? How does it get it ?

The microcode module loads the microcode for the cpu from/lib/firmware/amd if it's newer than the one on the cpu. You can download and place new microcode updates from amd in this directory if needed or just let your distro provider update the microcode files when they push new packages out.

Linux in 64bit mode use register to pass arguments to functions.So the most common sequence at the end of a function isn't "bunch of pops then a (near) return", but "move the results into target registers and the return".Thus the bug sequence doesn't happen that often.

Indeed, even WIN64 uses the same essential calling convention because the hardware itself is designed with it in mind.. specifically the 64-bit structured exception handling requires 16-byte aligned read and write operations by design.

Presumably AMD will announce affected CPUs fairly soon, after they get done testing. This isn't the kind of thing they would be able to sit on, even if they wanted to. If your CPU has been working for you in general it isn't like it is going to suddenly go and beat up your cat or something, it'll be fine for a bit longer while AMD figures out which ones are all affected and figures out how to fix it.

As I noted in another post, depending on it may be possible to fix it via microcode. CPUs aren't "pure" hardw

I have to worry about stack smashing bugs here... can there be a way for (say) a data pattern in a media file, or carefully crafted javascript or java code that's been JIT-compiled, to break out of its sandbox? What about a hostile OS kernel running inside a VPS container taking over the hypervisor or bare iron? Hmm.

AMD has indicated to me that the Bulldozer is not effected, which is a relief.

I guess I should have realized this would get slashdotted. In anycase, it took quite a bit of effort to track the bug down. It was very difficult to reproduce reliably. It isn't a show stopper in that it really takes a lot of work to get it to happen and most people will never see it, but it's certainly a significant bug owing to the fact that it can be reproduced with normal instruction sequences.

I began to suspect it might be a cpu bug last year and after exhaustive testing I posted my suspicions in December:

Older versions of GCC were more prone to generate the sequence of POP's + RET, coupled with a deep recursion and other stack state, that could result in the bug. It just so happened that DragonFly's buildworld hit the right combination inside gcc, and even then the bug only occurred sometimes and only one a small subset of.c files being compiled (like maybe 2-3 files). The bug never manifested anywhere else, doing anything else, running any other application. Ever.

In particular the bug disappeared with later versions of GCC and disppeared when I messed with the optimizations. We use -O by default, not -O2. The bug disappeared when I produced code with gcc -O2 (using 4.4.7).

It is really unlikely that Linux is effected... the sensitivity to particular code sequences laid out in the compiler is so fine that adding a single instruction virtually anywhere could make the bug disappear. Even just shifting the stack pointer a little bit would make it disappear.

In anycase, for a programmer like me being able to find an honest-to-god cpu bug in a modern cpu is very cool:-)

Since the cat is out of the bag some further clarification is required so I will include some more of the email I received. I didn't quite mean for it to explode onto the scene this quickly, but oh well.

Again, note that this is *NOT* an issue with Bulldozer. And they will have a MSR workaround for earlier models.

>> quote"AMD has taken your example and also analyzed the segmentation fault and the fill_sons_in_loop code. We confirm that you have found an erratum with some AMD processor families. The specific compiled version of the fill_sons_in_loop code, through a very specific sequence of consecutive back-to-back pops and (near) return instructions, can create a condition where the processor incorrectly updates the stack pointer.

AMD will be updating the Revision Guide for AMD Family 10h Processors and the Revision Guide for AMD Family 12h Processors, to document this erratum. In this documentation update, which will be available on amd.com later this month, the erratum number for this issue will be #721. The revision guide will also note a workaround that can be programmed in a model-specific register (MSR)."
end quote

They go on to document a specific workaround when the MSR is not programmed, which is basically to add a nop for every five pop+return instructions (though I'm not sure if the nop must occur between sequences or within the sequence). I will note that just the presence of 5xPOP + RET does not trigger the bug alone, it requires a very specific set of circumstances setup prior to that (that gcc's fill_sons_in_loop() procedure was able to trigger when gcc 4.7.x was compiled -O, when compiling particular.c files).

As I said, this bug was very difficult to reproduce. It took a year to isolate it and find a test case that would reproduce it in a few seconds. Until then it was taking me upwards of 2 days to reproduce it on a 48-core and much longer to reproduce it on a 4-core.

Since the bug was stack pointer address is sensitive the initial stack randomization that DragonFly does multiplied the time it took to reproduce the bug. But without the stack randomization the bug would NOT reproduce at all (I would never have observed it in the first place). In otherwords, the bug was *very* stack address sensitive on top of everything else.

I was ultimately able to improve the time it took to reproduce the bug by pouring over all my previous buildworld runs and finding the.c files that gcc had compiled that were most statistically likely for gcc to seg-fault in. Then once I isolated the files I iterated all possible starting stack offsets and eventually managed to reproduce the bug within 10 seconds using a gcc loop (10-20 gcc runs on the same file).

Changing the stack offset by a mere 16 bytes and the bug went away completely. The one or two particular stack offsets that reproduced the bug could then be further offset in multiples of 32K and still reproduce the bug at the same rate. Using a later version of gcc and the bug disappeared. Compiling with virtually any other options (turning on and off optimizations)... the bug disappeared.

On the bright side, I thought this was a bug in DragonFly for most of last year and set about 'fixing' it, and wound up refactoring most of DragonFly's VM system to get rid of SMP bottlenecks and making it perform much better on SMP in the face of a high VM fault rate. So even though we wound up not doing the 2.12 release the eventual 3.0 release (that we just put out recently) has greatly improved cpu-bound performance on SMP systems.

I'm wondering if they will. This seems like a very odd timing issue that may be a problem in the electronics. Of course, I suppose they could just put in some microcode to wait after certain operations to make sure things settle and so avoid the hardware bug.

Their other options are to do a processor recall (like Intel + the infamous Pentium bug), or inform the compiler manufacturers that 'there be the dragons' (special case inserted into the code for the affected processor / architecture to bypass the affliction).

Intel has had quite a few serious chip bugs too, all in errata. A number of new cpu bugs in both AMD and Intel chips always appears in new generations, but both companies have very large test suites and the number of new bugs goes down in every generation.

Don't forget that Intel had to recall a sandybridge chipset early in the sandybridge cycle, which cost them something like a billion dollars because the related motherboards had to be thrown away and replaced. That was due to internal on-chip circuitry related to a SATA port burning out.

Right at this moment AMD has two issues facing it in order to compete on workstations: (1) Power and (2) Performance. Their initial bulldozer release clearly depends too much on compiler optimizations to make full use of the architecture. They will clearly have to bulk-up some of the simplifications they made that made their cpu cores a little too sensitive to instruction sequences generated by compilers and I hope their next few releases will do better.

On power consumption it comes down to the Fab as much as anything else. Their dependence on the Fab is clearly a problem and they've made a break for it to try to solve it, even though it is costing them dearly. At the same time Intel has made some major advances in their three fabs, to the point where Intel can do their entire production on just two of those three fabs now but they decided to keep the third fab because they think they can 'grow into' it.

So AMD definitely has some work ahead of it, and I am hoping they reserve some of their focus for the high-end and don't concentrate entirely on laptops. I always like to say that I love AMD, but in the stock market I invest in Intel. That's just business. But I got on the AMD bandwagon big-time when they got to 64-bit first and I stuck with them all the way through the Phenom II.

Now, at this moment, Intel's SandyBridge has the best value and AMDs bulldozer is quite far behind, so new purchases for me right now are Intel. That may change in the next year or two and when it does my new purchases will happily be in the AMD camp again. Frankly, AMD only has to get within shouting distance (~8%) of Intel and I will happily use AMD. AMD doesn't have to beat Intel.

I think there are a number of things AMD can do right now to compete better with Intel. One of the biggest is in the mini-server department (albeit clearly with lower volumes than their current focus on laptops & integrated graphics). AMD consumer cpus (aka Phenom II) always had ECC support but very few motherboards actually supported it, which made it difficult to use AMD for mini-servers and avoid the Intel Xeon tax to get ECC. If AMD worked on the mobo vendors to ALWAYS support an ECC option that would allow them to compete against Intel Xeons on price, even if they are unable to compete on performance.

On the opterons AMD clearly has the right idea going with high-core-count cpus, but the memory subsystem is lagging too much to really be able to make use of all those cores. That seems to be low-hanging fruit to me, something which should be readily addressable by AMD. The opterons still have a lot of value and potentially can have a radical improvement in value with Bulldozer, but only if AMD can push the core count and improve the memory subsystem.

On large multi-core boxes AMD also needs to improve CMPXCHG and other atomic instructions in situations where contention is high. Right now multi-chip opteron systems seriously lag Intel on contended latency due to cache coherency inefficiencies. Will Bulldozer fix those latency issues? I don't know.

AMD only needs to get within shouting distance of Intel for me to buy their chips, and work their mobo producers a bit more to get better overall support for their chip's capabilities. They don't have to beat Intel.

Once in the late 1990's we had a weird bug where FTPing or RCPing a particular file between two offices would often result in a corrupt file on the other end. We kept scratching our heads trying to figure out what could possibly be corrupting the file. FTPing it anywhere else succeeded... no corruption. Everything else between the offices seemed to work ok.

It wound up being a hardware issue with the T3 between the two offices. The hardware would corrupt the bitstream in a manner that tended to PASS the