Last week’s failure of a Federal Aviation Administration computer system handling air traffic in the southwestern US was caused by an error that filled the available memory to the system, Reuters reports. Security experts who evaluated the system told Reuters that the flaw could have been used by attackers to shut down the FAA’s En Route Automation Modernization (ERAM) system, stopping air traffic nationwide. However, creating the conditions to exploit the error would be difficult.

ERAM, a system designed for the FAA by Lockheed Martin, has a capability called “look-ahead” which searches for potential conflicts between aircraft based on their projected course, speed, and altitude. Because of the computing requirements for handling look-ahead for all of the flights within a given region of controlled airspace, Lockheed Martin designed the system to limit the amount of data that could be input by air traffic controllers for each flight. And since most flights tend to follow a specific point-to-point course or request operation within a limited altitude and geographic area, this hasn’t caused a problem for ERAM during previous testing.

A Lockheed Martin video describing the "En Route Domain" of the FAA's traffic control system and how ERAM works.

A flaw in the system was exposed when a U-2 spy plane entered the air traffic zone managed by the system in Los Angeles. The aircraft had a complex flight plan, entering and leaving the zone of control multiple times, according to Reuters’ sources. On top of that, the data set for the U-2 flight plan came close to the size limit for flight plan data imposed by the design of the ERAM system. Even so, the flight plan data lacked planned altitude data, so it was manually entered by an air traffic controller as 60,000 feet.

However, the system ignored this manually keyed altitude data. It started evaluating all possible altitudes along the U-2’s planned flight path for potential collisions with other aircraft. That caused the system to exceed the amount of memory allotted to handling the flight’s data, which in turn resulted in system errors and restarts. It eventually crashed the ERAM look-ahead system, affecting the FAA’s conflict-handling for all the other aircraft in the zone controlled out of its Los Angeles facility.

I'm getting on a plane in a few days. This is slightly unsettling. I know backup systems exist (including human intervention.) I just don't want to be the example that causes public hearings and drives extra funding.

Edit: I've done some previous FAA consulting - they generally have their shit together. Still slightly unsettling, knowing the feds and how things tend to happen: the lowest bidder.

I'd just love to know how old this software is, which language/system? FAA is notorious for being stuck in the 50s. What they're probably not saying is that the vacuum tubes popped.

Err, the fact that it's a Modernization program suggests it is in fact a newer solution to that very problem. A quick wiki check shows it first went fully operational in 2009 with an earlier system in 2006. Don't know the exact language, but I doubt it is too esoteric.

The construction of any non-trivial system is hard, and will never stop being hard.

I'd be terrified to write flight control software.

I agree, but on the other hand this seems like an unhandled exception. If they purposely limited the amount of memory that could be dedicated to the task, I would think one of the very first tasks that should have been on their list was how to gracefully handle a case where an overflow occurs.

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

Except if their crappy work the first time has catastrophic effects, they're not going to get any work anywhere ever again, so there's a good incentive to do it right. Would you contract work out to someone who designed the system responsible for causing a fatal plane crash, or cheaped out on building a bridge resulting in its collapse during rush hour, etc.?

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

Except if their crappy work the first time has catastrophic effects, they're not going to get any work anywhere ever again, so there's a good incentive to do it right. Would you contract work out to someone who designed the system responsible for causing a fatal plane crash, or cheaped out on building a bridge resulting in its collapse during rush hour, etc.?

They hit a low point in February of around $30. But it appears the stock market, at least, doesn't believe that utter incompetence matters in the long term. Granted healthcare.gov wasn't _directly_ life-or-death, but their failure to get it working on time was so total, you'd think their name would be poison in the future. During the healthcare.gov debacle, there were a bunch of articles on the class of companies like CGI who excel at getting government contracts, even if their actual output sucks.

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

You would be surprised what governments sometimes do to contractors that extract more than they deserve...

Edit: I've done some previous FAA consulting - they generally have their shit together. Still slightly unsettling, knowing the feds and how things tend to happen: the lowest bidder.

Lowest bidder? This is Lockheed Martin. They don't do "low".

Depends on what you consider low. Also, Lockheed potentially could be only the primary. You don't know if this is all in-house or not - regardless of what their website says (as long as they have a contract or own the IP.) I've consulted on plenty of contracts that were 5 or 6 names on a single task, and that's only going one sub-layer deep. And, a Lockheed-developed system doesn't necessarily mean Lockheed handles implementation, production, and/or support. I'm not in a position to say whether Lockheed was the lowest bidder on this project, but I'll assume they were following the law. This means Lockheed was the lowest bidder that FAA thought could actually execute well.

I'm sure the math and the software for this kind of system is quite complex, and demanding, but in a world where we have access to systems with silly amounts of memory and computing power, is it really that far a stretch to have additional info and to support complicated flight plans?

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

Except if their crappy work the first time has catastrophic effects, they're not going to get any work anywhere ever again, so there's a good incentive to do it right. Would you contract work out to someone who designed the system responsible for causing a fatal plane crash, or cheaped out on building a bridge resulting in its collapse during rush hour, etc.?

They hit a low point in February of around $30. But it appears the stock market, at least, doesn't believe that utter incompetence matters in the long term. Granted healthcare.gov wasn't _directly_ life-or-death, but their failure to get it working on time was so total, you'd think their name would be poison in the future. During the healthcare.gov debacle, there were a bunch of articles on the class of companies like CGI who excel at getting government contracts, even if their actual output sucks.

Great example. Nearly all the major contractors have gotten their fair share of things they did wrong over the decades, sometimes costing lives even, or at the very least massive cost overruns, and they've pretty much always survived. Booz Allen is back to fine now (highest stock value since they entered the market), despite alarmist articles all over the press and cocktail party talks in the consulting world that said the company would never get federal work again after Snowden and that their reputation was tainted forever.

One notable exception is Arthur Andersen, which didn't recover after Enron but also had to give up its CPA license as a convicted felon (overturned by SCOTUS on what comes across to me as a BS technicality...), but these are few and far between.

If anything the failure mode is indicative of numerous problems within the software system. If the system had known limits there should be code that monitors the limits and prevents any type of cascade failure from hitting such limits.

I have to wonder about their QA processes, as based on the basic explanation of the software I can automatically think of at least one test case that should have caught this bug. Essentially enter the max options for everything and one by one have one attribute be un-defined.

It was purely the result in incompetent programmers, testers and management by a company that placed profitability far, far above the safety of human lives in a critical component of this nation's infrastructure.

People assume, for no good reason at all, that airline software must be safe. They assume the same things of the software that runs prisons, nuclear power plants, etc.

No, this software is not safe. It was created in a way that specifically enhances its lack of safety - lowest bidder wins contracting. Anyone bidding a price that included actual employment of the experts necessary to create secure, reliable software, and giving those experts the control they needed to get the project done right, would be rejected immediately. Companies like Lockheed Martin who would simply lie and claim they can achieve the same level of security and reliability for a lower price by employing mostly junior-level developers win. And then the projects go over budget and long on time, until those on the government side of the contract cave to political pressures and simply sign off on the project, accepting whatever half-assed unworking solution is presented. Then they get in-house people to get it into a state where it can actually run.

The only reason major infrastructure hasn't been destroyed or disabled by hackers is simple: Hackers don't want to do it. If they wanted to, all planes would be grounded, doors of prisons would spring open, and ATM machines would regurgitate their currency into the streets by the end of the day. None of these systems are safe, or well-designed. Hit up YouTube and watch some video presentations from some of the hacker conventions, I recommend the Chaos Communication Congress in particular. Within the past few years they've had very good presentations on the vulnerabilities in the airlines, prison systems, power plants, etc.

Apparently society is going to wait until an actual real-life supervillain emerges and uses these systems against many people before they're going to demand that those in charge of the systems actually understand them and do what is necessary to make them competent. It's not like we don't know HOW to make solid software. We do. It's just expensive, and requires employees who are not just your average software engineer. It requires people who are experts in their field, who are absolute hard-asses about what is and is not safe, and who will never bend simply because the president or CEO or some other dipshit authority figure thinks that their political or economic motives can somehow force the computers to behave as they are needed to.

If you're a company that wants to do this, ask your job applicants what the last academic research paper on security they have read is. If they don't have one, boot them out instantaneously. If they don't know what differentiates a Turing-complete protocol system from a non-Turing-complete one, boot them immediately. The only exception might be if they're a math-centric cryptography researcher or somesuch who is branching out into Computer Science. CS is a big field, and a person with a CS degree can have never even heard of the things that are most important to your system. If they haven't, they've at least got to understand the theoretical basis of the work so they can learn it. If they spent their past 10 years maintaining software and they can't give you a laundry list of critical security vulnerabilities that would enable you to destroy that system in minutes, they're not someone you're interested in. Every system has those, and if they didn't pick up on them, they won't pick up on yours.

The system came into use in 2009, after three years of testing, five years later a major bug is discovered. I would say that's pretty smooth for such a complex system.

You must be a manager IRL.

How I see it:1. Why did the system allow input of a flight plan without an altitude? A plane cannot fly without altitude, so that field must have an entry, and it must be non-zero.2. Why didn't the system accept the manually input altitude? Read another way: why didn't the back-up option work?

I'm disturbed that the system ignored the 60,000 ft data inputted by the Controller. It shouldn't have ignored user-inputted data. But at least we know about this vulnerability now, when no one got hurt. It can then be fixed.

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

Except if their crappy work the first time has catastrophic effects, they're not going to get any work anywhere ever again, so there's a good incentive to do it right. Would you contract work out to someone who designed the system responsible for causing a fatal plane crash, or cheaped out on building a bridge resulting in its collapse during rush hour, etc.?

They hit a low point in February of around $30. But it appears the stock market, at least, doesn't believe that utter incompetence matters in the long term. Granted healthcare.gov wasn't _directly_ life-or-death, but their failure to get it working on time was so total, you'd think their name would be poison in the future. During the healthcare.gov debacle, there were a bunch of articles on the class of companies like CGI who excel at getting government contracts, even if their actual output sucks.

I'll betcha some deaths could be dug up in Oregon after Oracle's totalfail with Cover Oregon. I know I personally had the system fail me rather unbelievably when I needed it and ended up spending, at the time, 13% of my entire liquid assets to get one (1) mandatory prescription refill because of the incompetence of Oracle and their state government minders. Oracle has a history reaching back decades (when the 'Enterprise' software division was still Peoplesoft, unmolested my Larry Ellison) of huge failures resulting in millions lost, big lawsuits ... and people still keep paying them money.

I presume it's lobbyists and bribes, I mean, what else other than sheer ignorance and not doing due diligence on a (say) $200M system contract? Both are quite plausible.

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

You would be surprised what governments sometimes do to contractors that extract more than they deserve...

I very much doubt we should regard modernization as anything close to what we consider modern. If you saw the software used to input flight plans, receiver weather information or get notams you'd think that Windows 95 was the height of modern interface. Think teletype. Routing too isn't that complicated. There are airways (think point to point freeways) combined with zones around airports that intersect them. They aren't that complicated. And checking at all altitudes is just a lazy programmer not throwing an error and rejecting the plan. The only qualifying factor is that SoCal and New York are probably the two busiest airspaces in the world.

This will just be like every other time a government contractor screws up on a project. They'll get paid millions more to fix the problems. They actually have an incentive to do crappy work because it's more profitable to get paid to fix all of the problems than if they just did it correctly the first time.

Except if their crappy work the first time has catastrophic effects, they're not going to get any work anywhere ever again, so there's a good incentive to do it right. Would you contract work out to someone who designed the system responsible for causing a fatal plane crash, or cheaped out on building a bridge resulting in its collapse during rush hour, etc.?

Um... The LCS, the F-22, the F-35, the Ford class of aircraft carriers? Any of them ring a bell? These are all projects concluded or in the works with all of them coming in at substantially above cost according to their original bids with more than half of them (the LCS, the F-22 and the F-35) having design flaws that the contractors are being paid billions to fix.

And these contractors keep getting more contracts.

Not to be pedantic (well, not too much) but screwing up and then being paid millions or billions of dollars to fix it is SOP for government contractors. Though an unpopular point of view, the fact is, this is how it usually works. Having a government contractor fired for incompetence is the extremely rare exception.

Just thought I'd make that cynical point.

That the software apparently ran well for a long time and needed such an unusual event to trigger a collapse indicates that the contractor wasn't incompetent. I don't think this event merits firing them, but they need to revamp their stress testing of it to ensure such things don't happen again (like accepting a controller input as it should have, or better memory allocation exception error handling). Considering the complexity of the system, that may take some doing.

Oddly, system complexity seems to be the problem behind most of the cost-overruns in most government contracts. Sometimes, simpler is better. Or if not better, at least more reliable.

I wrote about an hour ago "A bad SQL Join/intersection, probably could have been followed by injection of code. LoL. "

Since I got a whole mess of downvotes for posting something technical, and this is Ars, I thought I'd explain my comment better for those not experienced in programming games in which targets move around at various speeds and check for detections and collisions.

I'm guessing the way their software is similar from my experience. It basically is a loop with a SQL statement that's concatenated with the fields that are input. Using several SQL tricks, I'm not going to discuss, you construct a statement on the fly that calculates all future positions say 1 hour into the future in certain time increments, say every 10 seconds. Certain fields may allow complex input (with certain controls: ie, sysadmins or managers can only enter in this info). Say ">60000" (I take what the article says with a grain of salt) was entered in the altitude. The code might run the SQL as its advancing a time and the query starts at 60000 in 1000 altitude going up to infinity, but each of those is also checking against all other tracks within a certain range (Speed, visibility factor into the size of the range check). The oops being the entry should have been 0-60000 feet which would have blocked out all possible paths the U2 was flying from intersecting with a commercial airline.

I'm guessing the backend is Oracle and the program has a field allowing input of a string of a certain length. Code injection is a possibility at this point, depending upon the patch status of the server/work-a-rounds. Given that FAA is fully funded I'll say that probably they are up to date though, or I should hope.

Except that it's almost undoubtably not implemented in SQL or using a database backend. It's a custom application developed specifically for looking at a specific set of current plane locations and flight plans and checking for potential collisions. For an application like this, you'd be looking for the efficiency from optimized C code, not hacked together SQL queries.

The construction of any non-trivial system is hard, and will never stop being hard.

I'd be terrified to write flight control software.

I made a career decision while still in my teens to go into game development instead of defense contracting (actually had both offers in front of me at the same time) because I'd rather a bug I wrote not kill someone. Many years later I still am not interested in doing defense contracting, industrial control systems or certain types of medical equipment. Unfortunately I'm sure there are plenty of people rushing into those jobs without having ever given it much thought about the ramifications.