Posted
by
samzenpus
on Friday August 10, 2012 @02:06AM
from the distant-support dept.

CWmike writes "Picture doing a remote software upgrade. Now picture doing it when the machine you're upgrading is a robotic rover sitting 350 million miles away, on the surface of Mars. That's what a team of programmers and engineers at NASA are dealing with as they get ready to download a new version of the flight software on the Mars rover Curiosity, which landed safely on the Red Planet earlier this week. 'We need to take a whole series of steps to make that software active. You have to imagine that if something goes wrong with this, it could be the last time you hear from the rover,' said Steve Scandore, a senior flight software engineer at NASA's Jet Propulsion Laboratory. 'It has to work,' he told Computerworld. 'You don't' want to be known as the guy doing the last activity on the rover before you lose contact.'"

That is why I do not understand why the NASA engineers want to take such a risk

Unless it is a totally fatal software bug - that is, if they do not upgrade the software, the Curiosity rover gonna be bricked - I do not think taking the risk of bricking the rover for a regular software upgrade is worth the danger of bricking the rover, which is, as TFA has stated, 350 millions miles away

Similar to some devices here on Earth, the rover should have an automatic revert solution. For instance, a non-updatable software running on a separate processor detects specific conditions (like no signal from Earth for a while) and flashes back the updatable software to its original version when that condition occurs.

Get a 10-foot 4X4 piece of lumber. Drop it flat on the ground. Walk from one end to the other like a balance beam. I'll bet you can do it. I'll bet you can do it blindfolded, walking backward. I'll bet you can do it reciting the alphabet backward. I'll bet you could do it drunk.

Take that same 4X4, suspend it 20 stories in the air between a couple of cranes. Put a bunch of razor sharp, rotating propellers on the ground beneath it. Intersperse the propellers with oil drillbits pointed up, not down for once. Have a bunch of trained turkey vultures flying around to watch you fall. Take your wife, kids and your momma, put a gun in their mouths while the Joker cackles that when you fall, he's gonna blow their heads off. Bring in the television cameras and monitors so the whole World can watch and you can watch them watch. Have some intern read the tweets and comments sections about your plight over the loudspeakers.

Now, there are a few ice-blooded "Licensed to Kill" Double-O men who could keep it together and walk that beam under that kind of pressure. Mary Lou Retton and Nadia could, no doubt. I seriously doubt I could.

Is it a big deal to do a software upgrade under such tightly controlled conditions? Not really. But try doing that software upgrade when billions of dollars and your career is on the line, with the whole world watching. The guy who screws that up is gonna be a punchline and a byword for a few decades, a real Wilson if you've read that book.:-) You'll be known as the guy who screwed up Mars.

Tell me there wouldn't be maybe one or two drops of sweat on the keyboard...

"I do not think taking the risk of bricking the rover for a regular software upgrade is worth the danger of bricking the rover..."

I guess it all depends on on (A) what the perceived value of the upgrade is, versus (B) the perceived risk.

It's probably a safe bet that they learned from the Surveyor issue, and built in better tests and safeguards. I imagine -- although I don't really know -- that they have implemented something like the "rolling upgrades" that are common now, which allow processes to replaced on the fly one at a time, without reboot, and with a failsafe revert that runs at a higher level than any of those processes if anything goes wrong.

It isn't like Windows, in which just about every time you install or upgrade something you have to make all the changes then "reboot". They get done one at a time, and they are tested individually after they are made.

It sounds complicated but conceptually it's pretty simple: you have a top-layer monitor program program that accepts commands to replace lower-level processes. All it needs to be pretty "fail-safe" is to wait for a specified period of time for an "okay" signal from Ground Control. If it doesn't receive one in the specified time, it automatically reverts the process back to the old version. It's a little more involved than that, but that's the idea.

Unbelievable, this is so stupid...WHY NOT INCLUDE SECOND BIOS? or whatever fuck they are using? if its so precious and easilly broken, why not use back up hardware? It's not like it would add another half kilo of weight???? Risk is TOO BIG not to do that. A few grams => problem solved.

You have a separate "supervisor" board that moderates among the computers.

And then that board becomes a single point of failure.

In a case like that, you only need 3 for Damned Good Redundancy

3 computers and a supervisor? That's already 4 components.

If you want to handle t arbitrary node failures, then you need at least 3t+1 nodes in total. Whether you call the nodes for computers or supervisor boards doesn't change that fact. If you have t failures among 3t or fewer total nodes, then the failures can happen in a way that cause the functional units to receive so inconsistent information, that they are unable to do anything meaningful. It is a case of byzantine agreement.

Any system designed to handle failures of one third or more components is making assumptions about how the failed components behave. If the failed components behave differently than the assumption, it takes even fewer failures to break the entire system.

First off, shielded hardware is NOT a few grams. A second system adds a significant amount of weight. Each gram added to the rover is several hundred kilos more propellant required. In any case, they DID add a second system, which will take over in the event of an emergency. However, even then, an update is quite perilous, because you could theoretically brick the one system, and if something else goes wrong, you now have no backup.

That reminds me... I have sometimes wondered what security protocols NASA (and their Russian counterparts) have in place for their probes. Back from now to the 1970s, when security wasn't nearly as advanced as it is today.

Is it possible that someone with a large directional backyard antenna can hack some of the probes? To be remembered as the man who killed Voyager 2 might be attractive for some people.And who's to say that this hasn't already happened? There are non-responding probes out there, with no evidence for why they failed.

They do indeed have systems like that, if you're interested it's worth looking into how they dealt with the Sol 18 Anomaly on Spirit. Of particular note is the "Shutdown Dammit" command that they used to override everything else the rover was doing so it would stop wasting battery overnight.

Seeing as they were able to update the software on a device that wouldn't even finish booting, I imagine the procedures for doing it on a functioning device are pretty robust, even if they're still nailbiting.