Flash memory issue forces Curiosity rover into safe mode

Stray cosmic ray may have corrupted a file.

The Mars rover Curiosity is switching over to a backup computer after a corrupted file caused the primary "A-side" computer to glitch. On Wednesday, February 27, the rover failed to send its daily data dump back to Earth and switch into sleep mode. Mission Control made the decision to switch Curiosity over to its backup computer and suspend its scientific work for a few days.

"Don't flip out: I just flipped over to my B-side computer while the team looks into an A-side memory issue," NASA posted on the rover's Twitter account.

Like most spacecraft, Curiosity has two computer systems on board. The A-side computer is used for daily operations and the B-side is used as a backup. Until the B-side computer has been updated with the data necessary to assume control of the rover, Curiosity will sit on the Martian surface in "safe mode."

Configuring the B-side computer to take control of the rover may take a while, according to NASA. "We have probably several days, maybe a week of activities to get everything back and reconfigured," Richard Cook, Curiosity project manager at NASA's Jet Propulsion laboratory, told Space.com.

The cause of the corrupted file in the flash memory of the A-side computer is unknown, but it could be a stray cosmic ray. "The hardware that we fly is radiation tolerant, but there's a limit to how hardened it can be," Cook said. "You can still get high-energy particles that can cause the memory to be corrupted. It certainly is a possibility and that's what we're looking into."

Once the B-side computer has assumed control of the rover, the team will attempt to get the A-side up and running again as a fully functional backup.

Eric Bangeman
Eric has been using personal computers since 1980 and writing about them at Ars Technica since 2003, where he currently serves as Managing Editor. Twitter@ericbangeman

Does this mean the controllers have to put up with "SAFE MODE" in all four corners of the screen and basic VGA graphics until the driver is rolled back?

More seriously, I understand they can't run diagnostics on the A-Side until the B-Side is up and running... is the B-Side kept in a pristine as-launched condition, or does it receive updates maybe two or three iterations behind the A-Side? In other words, when the A-Side becomes the new backup, will it be rolled all the way back to as-launched condition?

I would imagine the state machine updates between side A and side B would be closer. Perhaps one update earlier? I had assumed the backup computer would fire away up 100s of nanoseconds after a primary failed. Not as live as I though. But then I don't know the full mission parameters ...

Gosh- these systems must have to be robust to an unholy degree. I remember reading somewhere that NASA's software practices are extremely good and they were able to reprogram the Voyager probes remotely even after significant system failures.

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

Does this mean the controllers have to put up with "SAFE MODE" in all four corners of the screen and basic VGA graphics until the driver is rolled back?

More seriously, I understand they can't run diagnostics on the A-Side once the B-Side is up and running... is the B-Side kept in a pristine as-launched condition, or does it receive updates maybe two or three iterations behind the A-Side? In other words, when the A-Side becomes the new backup, will it be rolled all the way back to as-launched condition?

I'm guessing that:

(1) The bandwidth and latency they experience transmitting data slows any work they do on working on both computers(2) When they experience an issue, they take things very slowly and only do 1 thing at a time, to prevent any other issues from arising.(3) B side probably does some other functions (other redundancies), so its better to having it do that instead of the convenience factor of just flipping the switch and carrying on when something goes haywire with A.

This gives you an idea of the design restrictions the NASA engineers have to work with. The rover has to be radiation hardened, has to be able to resist planetary entry, has to resist a very harsh environment, but it also has to be light.

Is it because there's no atmosphere on Mars that cosmic rays can get to the Martian surface and break drives?

Atmospheres and magnetospheres help but they are also mutually supportive so it seems.

Mars also does not have as a robust magnetosphere as the Earth. Current Theory is that it did but a collision with another proto-planet screwed up it's magnetic core. Mars only has pockets of magnetic fields and this is a problem because solar radiation and winds over time has stripped away Mar's atmosphere. Terraforming Mars isn't as simple as cranking out greenhouse gases as it may just get lost. Mars needs comets and such bombarded to add the lost material.

That or we build Mars a magnetosphere but the issue of adding more atmosphere is still needed to be addressed.

What amazes me most is the fact that this human designed vehicle will be on Mars when manhood has long gone into oblivion. At such a dry climate Curiosity can easily survive 100.000 years or more. Kinda witness of a long gone species.

"Sir, from the symptoms you're describing it sounds like reinstalling your system's OS will fix the problem. We can do that for $99, but it's going to erase everything you had on there. We also require the Windows license key sticker that should be on the case somewhere. Oh, you have a remote backup? That'll save you some money then. So just bring the system in and we can get right to work on it."

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

Well, but at the same time, these companies livelihoods will depend on success. They will have much stronger motivation that nothing fails, because it could very well mean the end of the company, and potentially setting the entire industry back many years/decades. While NASA has always been a bit of a whipping boy in the government, a failed mission doesn't risk the very existence of NASA itself.

But, time will show how things go. We're likely still several years away before we seem much of anything beyond test flights from the private sector. If they're cutting corners, it'll likely quickly become apparent very quickly.

Gosh- these systems must have to be robust to an unholy degree. I remember reading somewhere that NASA's software practices are extremely good and they were able to reprogram the Voyager probes remotely even after significant system failures.

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

The Chinese of been caught red-handed trying to steal these hardened parts. Not only are they extremely expensive, They are also tightly controlled.http://www.americaspace.com/?p=14543

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

Don't under-estimate the motivation of money. Sure, NASA have had costly mistakes before, but they haven't had a CEO screaming down the line about profits. Money lost by NASA is a tragedy, considering what it is spent on and what can be learned. But losing profits? Someone would get shot.

What exactly do they mean by corrupted in this case, bits flipping or an actual physical failure? I would assume that the A-side computer at the very least has ZFS or some other system with data-redundancy and integrity checking, or would it lack the space for more than two drives?

Just interested what exactly the B-side is going to do; can it actually resolve the A-side's problems and get it running again, or does this mean they're now relying on the B-side to run the rover? It sounded like the B-side computer may not be exactly the same as the A-side, possibly not as powerful and really only for redundancy and nothing else?

While the Space.com link has a bit more information I was kind of hoping for more of a description of what protections the rover already includes, e.g - error correcting RAM, data integrity checks etc. I would have thought that the worst thing to happen would be the rover would just need to restart its OS and may lose some of the data it was preparing to send (if it couldn't be saved properly).

It's interesting news, but I normally expect more technical detail from Ars beyond "it has two computers and one of them is currently unhappy face". Even if there aren't enough details of what NASA is using it'd be nice to see some discussion of relevant technologies and why conditions on Mars may have caused them to fail or bypassed the protection they would offer.

What exactly do they mean by corrupted in this case, bits flipping or an actual physical failure? I would assume that the A-side computer at the very least has ZFS or some other system with data-redundancy and integrity checking, or would it lack the space for more than two drives?

ZFS is probably too complicated for them to feel it's appropriate under the circumstances. Adding redundancies outside of the filesystem itself isn't that difficult, but tracking down whether a problem is caused by a stray bit or an actual bug is important. If it's a bug you need to eliminate it before that version ends up on the backup computer and you end up with 2 untrustworthy computers.

ZFS is probably too complicated for them to feel it's appropriate under the circumstances. Adding redundancies outside of the filesystem itself isn't that difficult, but tracking down whether a problem is caused by a stray bit or an actual bug is important. If it's a bug you need to eliminate it before that version ends up on the backup computer and you end up with 2 untrustworthy computers.

Sorry, I really meant any ZFS-like system. The quotes make it sound like they're gearing up the B-side computer to take control of the Rover but it's not clear if that merely means the B-side will start running diagnostics and/or sending back copies of the A-side's current files/memory/state so they can find out if the A-side can be repaired and attempt to do so, or if it will be taking over operation of the rover entirely.

Also not mentioned is how this switching went on, is there a third system in there for switching between A and B or was B just sitting there with enough activity to take over when told to do so?

There's just not enough information for to satisfy my own curiosity (sorry)

Gosh- these systems must have to be robust to an unholy degree. I remember reading somewhere that NASA's software practices are extremely good and they were able to reprogram the Voyager probes remotely even after significant system failures.

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

The Chinese of been caught red-handed trying to steal these hardened parts. Not only are they extremely expensive, They are also tightly controlled.http://www.americaspace.com/?p=14543

According to all the headlines of Chinese hacking activities it seems that we are either really bad at keeping secrets or on some subconscious level want them to have our secrets. However if they are as adept at stealing our secrets, why don't we not become the RIAA fighting torrents, and just sell them what they want. It'd be more profitable anyways.

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

Well, but at the same time, these companies livelihoods will depend on success. They will have much stronger motivation that nothing fails, because it could very well mean the end of the company, and potentially setting the entire industry back many years/decades. While NASA has always been a bit of a whipping boy in the government, a failed mission doesn't risk the very existence of NASA itself.

But, time will show how things go. We're likely still several years away before we seem much of anything beyond test flights from the private sector. If they're cutting corners, it'll likely quickly become apparent very quickly.

This line is brought up as if it's a forgone conclusion that a private company would never fail where a government entity might because 'money'. First, companies are often run by shortsighted, greedy people and the cost of bad design is not always collected on the same time scale that bonuses are doled out. Second, the people that work at NASA and design these systems are not immune from having their careers ruined just because NASA isn't subject to the same market dynamics as a private company. Third, we've seen that problems can occur in private aeronautics firms, e.g. Boeing's Dreamliner, even though 'money'. The people that choose to work at government labs are often taking a hit in earnings relative to what they would make with the same skills in private industry in order to work on things they feel are worthwhile or particularly interesting. To pretend that these people would be less than diligent with something they've dedicated years of their life to because they don't have to worry about their employer filing for bankruptcy is absurd.

When I hear about spaceflight in the private sector I can't help but to shudder a little. Does anyone think that a company like Delta is going to be conscientious enough to perform the kind of redundant multi-multi-level quality assurance necessary to provide some reasonable expectation of safety? (Even NASA has had its lapses in judgement, putting political expediency ahead of engineering, and it paid with the loss of two shuttle crews.)

You think this because Delta doesn't currently buy equipment with fail safe redundancies built-in? Hell, even with the Dreamliner problems, the fail safe systems worked and no one was hurt. If space travel ends up working like air travel, it will be safer than the car ride to the space port. But don't let facts get in the way of your carefully considered anti-corporationism.

ZFS is probably too complicated for them to feel it's appropriate under the circumstances.

ZFS is simply the best-known filesystem which includes block-level checksums. I don't believe anyone meant ZFS specifically, but was rather talking about any software that would verify data on the flash before loading it, so a flipped bit would be apparent. The added software and computational complexity to do so could be absolutely minimal.