Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

First time accepted submitter biodata writes "The BBC is reporting that hundreds of UK commercial air flights have been delayed for most of Saturday due to an internal telephone systems problem in the National Air Traffic Control Service, and delays are likely to continue into the evening. A spokesperson said that it was a different software bug from the one which grounded flights in the summer."

Sure. It was really not a simple bug to put in, but the programmer who wrote it had already grounded flights in the summer, and thanks to that experience he also managed to put this bug in, despite all its difficulty.

British Telecom has had an issue (which has happened a number of times) which led to a minor timing glitch in one of their systems. When this happens, the data reliability on the FARICE line to Iceland drops and you start getting corrupted flight messages. Shanwick was alerted to the problem and both sides consulted and decided that the best solution in the interrim would be something that had been done previously, disconnecting FARICE and thus forcing all connections through the backup line, DANICE, which appeared to be operating normally.

Unfortunately, the problem was even worse on DANICE. What appeared to be normal operation was only normal up to the data logger. Once it actually got to the flight tracking software, the messages were being refused, and corrupted messages being sent in the other direction. So while BT was working on getting their system fixed, flight control managers were being forced to basically manually dig up ATC messages and copy-paste them off to the air traffic controllers (as much was handled through voice as possible as well).

But it got even worse. A totally unrelated communications network, Datalink, decided to misbehave during all of this, which may or may not have been due to the Shanwick problems. On the Iceland side, the general solution is to force a switchover to the backup system. Which was done... except a critical component on the backup system immediately crashed. Repeated attempts to switch and ultimately switch back caused even more problems for the air traffic controllers.

Eventually the fixed FARICE line was brought back up, Datalink back online (with the switchover-crash problem postponed to be investigated during a low-traffic timeperiod)

It's terrible that there were so many delays, but these are extremely complicated systems with a challenging task, built up over decades with tons of computer components, protocols, lines, routers, radar systems, transmitters, and on and on, scattered all over the world. On a weekend. Everyone was scrambling and doing their damndest to fix it as soon as possible. It should also be noted that it was never a safety issue - even in the absolute worst case, air traffic control could go all the way back to the old paper-and-pencil method. What the systems give is, primarily, speed, and thus when there's big problems, there's delays.

Of course, I might be entirely off base here, but below is the first impression I got.

Wouldn't a "fix" be as simple as routing all that junk encapsulated over a point-to-point ssh connection between two routers? Doesn't almost any router let you pack up all of the disparate kinds of traffic and push it over a "safe" pipe that doesn't give a flying fuck about datagram corruption? Wouldn't a solution here be, quite literally, two router boxes from any major vendor? Yeah, it may not perform all that great when

mm you do wonder if the fault was in using a modern tcp/ip link rather than an old school error corrected up the wazzo x.25 - on of the problems with OSI was that it had a lot of error correction as was less efficient than TCP/IP

The issue is, you deal with the system you're with, not the situation you wish you had.

We can't change a transmission protocol or route data over arbitrary connections. This is a collection of everything from very old hardware to brand new, protocols from very old to brand new, in every country in the world, and you can't just arbitrarily rework them. It's the same in the air, too. And when new protocols are made, they're generally in addition to existing ones, not replacing them. I'm not aware of any with

Ultimately, there are routers or modems involved, and they push some legacy protocols, and there's a lot of providers out there who offer modules for modern routing hardware that take those old protocols and push them quite transparently over modern data pipes. It's a reasonably well understood problem. It would not require reworking the whole thing, that's the whole point - you take what you have and push the data around using modern hardware that can ensure that the data is safe.

I'll assume that it was only because you were overworked that you missed the humour in my comment. What I did was to give a possible interpretation which would have made the erroneous sentence correct. Of course I didn't mean to imply that someone really added bugs intentionally. At least one person understood it and gave me a "Funny" mod.

But anyway, your comment was full of interesting information, so it was the rare case of a productive Whoosh. Thank you for sharing that information.

This happens enough that I often wonder whether the editors are really that careless, or whether they intentionally insert errors like that in order to provide fodder for those who so enjoy writing posts correcting the article and complaining about the lack of editing. Thorough proofreading would kill one of the memes that makes slashdot what it is.

This wouldn't -- no counldn't have happenned in the days before computers.

Eventually, I think centralised computer control is going to go the way of semaphore. It's too easy for a centralised computer system to glitch, break, be shutdown, and then screw up the lives and functions of millions.

What we should see is decentralised systems run using independent computer systems.

The efficiency gains from centrally controlled, fully integrated computer systems simply dwarf any benefits you might get from time to time with a distributed system.

A central computer with occasional downtime is acceptable when the alternative is a stupid, slow clerical system every day, all the time. "Clerical" is what disparate, independent systems always break down to because of the amount of human effort required to keep them working together.

What efficiency gains? Airlines would be far more efficient if they could fly direct from A to B, rather than being funneled into narrow corridors. Pretty much since the advent of GPS, people have been trying to get rid of 'air traffic control' and replace it by direct communication between aircraft which know where they're going and where they want to go.

They do operate it as a 24/7 operation. However, at night time there are less planes in the sky, so each traffic controller is given a bigger area to work on and there are fewer of them on duty. During day time, these areas are subdivided into smaller areas and more controllers are brought on-line to work on the larger number of areas. It was this switch-over that failed.

`One of the key changes involves improving the warning messages that flash on the air traffic controllers' screens when an aircraft moves out of their area of control and responsibility. The aim is for a warning to flash on the display to remind the controllers to ensure that they have completed all their co-ordination checks before an aircraft leaves their screen and becomes the responsibility of others.

"There is a quirk over whether it flashes or not," says Chisholm. "We want it to work in 100% of case

I was traveling from Heathrow to Beijing via Helsinki (5.5 hour lay-over) that was supposed to leave LHR at 7:30 but was delayed until 9:00...the estimated departure moved again backwards and forwards once (after we got on the plane), but it seemed to be a minor delay from my point of view.

The most annoying thing was that the online systems weren't showing the disruption. I was looking at the departure board at LHR and it was showing the delay (though it took a while), but the online web page and the 'Heath