LightSail Glitch: Hoping for a Reboot

byPaul GilsteronMay 27, 2015

The Planetary Society’s LightSail won’t stay in orbit long once its sail deploys, a victim of inexorable atmospheric drag. But we’re all lucky that in un-deployed form — as a CubeSat — LightSail can maintain its orbit for about six months. Some of that extended period may be necessary given the problem the spacecraft has encountered: After returning a healthy stream of data packets over its first two days of operations, the solar sail mission has fallen silent.

Jason Davis continues his reporting on LightSail, with the latest update on the communications problem now online. We learn that the suspected culprit for LightSail’s silence is a simple software glitch. Everything else looked good when communications ceased, with power and temperature readings stable. Davis explains that during normal operations, LightSail transmits a telemetry beacon every 15 seconds. The Linux-based flight software writes data on each transmission to a .csv file, a spreadsheet-like record of ongoing procedures.

This file continues to grow, and when it reaches a certain size, trouble can happen:

As more beacons are transmitted, the file grows in size. When it reaches 32 megabytes—roughly the size of ten compressed music files—it can crash the flight system. The manufacturer of the avionics board corrected this glitch in later software revisions. But alas, LightSail’s software version doesn’t include the update.

Late Friday, the team received a heads-up warning them of the vulnerability. A fix was quickly devised to prevent the spacecraft from crashing, and it was scheduled to be uploaded during the next ground station pass. But before that happened, LightSail fell silent. The last data packet received from the spacecraft was May 22 at 21:31 UTC (5:31 p.m. EDT).

Let’s hope we’ll still see a deployed LightSail, as in the image above. But anyone who has stared at a PC frozen into immobility knows the feeling that LightSail’s ground controllers must have experienced. The machine is not responding, which means it’s time for a reboot. A manual reboot being out of the question, a reboot command from the ground has to be used, and more than one has been sent. In fact, Cal Poly has been transmitting a new reboot command every few ground station passes. So far, no luck.

A fix may still be in the works from a natural source, but first, the situation led to a bit of humor, in the form of an email Davis received, as recorded in this tweet:

Davis also suggests a LightSail successor to be called BourbonSat, a flight spare that sits in each team member’s kitchen to offer quick stress relief. The humor is edgy but that’s because we may now be reliant on a hands-off fix: Charged particles striking an electronic component in just the right way to cause a reboot. If that sounds extreme, be aware that the phenomenon is not unusual in CubeSats. In fact, Cal Poly’s experience says that most reboot within the first three weeks of operations. You can place this in the context of the 28-day sail deployment timeline and see we might come out just fine.

What happens next depends upon when — and if — that reboot occurs, assuming the continued reboot commands from the ground are not effective. Various software fixes are being tested to see which could be inserted after contact is restored, so that the troublesome .csv file doesn’t cause further problems. Davis also says that when LightSail comes back online, the team will probably begin a manual sail deployment as soon as possible. Let’s make sure, in other words, that when we have a communicating spacecraft, we do what we sent it out there to do.

Comments on this entry are closed.

beermotorMay 27, 2015, 8:49

why in the world would they not send up two cube sats (or more) in the first place? wasn’t that the whole point of a cube sat to begin with (cheap, can make redundant, etc). this just kinda seems like a mega derp tbqh.

Software glitch? Sounds more like a design flaw. Since as long as I can remember, even the cheapest embedded processors have had a hardware feature called a watchdog timer. Properly functioning software periodically resets the timer. If the software ever goes off into the weeds and fails to reset the timer, the timer times out and reboots the system. Waiting for a cosmic ray? That’s no way to do fail safe design. Why wasn’t a watchdog timer used on this system?

This incident especially caught my eye. I’m a faculty member at Vermont Technical College currently working with a small group of students writing CubeSat software. We successfully launched our first CubeSat in November of 2013, and are now working on a general purpose software infrastructure for future CubeSat missions.

We are using SPARK, a dialect of Ada that allows for the formal verification of programs. We aim to produce software that we can prove mathematically is “uncrashable.” With SPARK this is a realistic goal.

If Lightsail-1 cannot be rebooted, is there any critical information needed that could delay their second launch next year, or is this a shakedown that identifies potential failures only? In this case it looks like the software failed and they may not be able to test the deployment.

@Peter Chapin
Good to see that some groups are using Ada to develop more robust code. The extra advantages of using Spark are worthwhile too. How easy has it been to get/find good SW engineers who know the language?

…Shades of Telstar-1’s problem from the Project Starfish nuclear detonation radiation, which shortened its life; the engineers had to periodically “rest” the transistors and resort to “notched zeroes” to send commands to the satellite. Also:

Perhaps LightSail-A’s mission controllers won’t have to wait for a fortuitous Cosmic Ray. If the LightSail-A spacecraft were illuminated with microwaves from a Deep Space Network station (or from a radar astronomy dish), that might induce enough of an electrical charge in the spacecraft “bus” to have the same effect as a Cosmic Ray. (Had the original Cosmos-1 solar sail succeeded, there was a plan to test microwave sailing with it, by “beaming” it from the Goldstone DSN station; maybe that plan could “snatch their fat out of the fire” now?)

Unfortunately, we learn NOTHING from this incident! Watchdog timers and log file management are old old news in computer systems. Nobody needed to pay launch costs or lose a mission to understand this failure mode.

It makes on wonder what other foreseeable design flaws exist in this device.

I have been working in software development and IT for twenty years now. It has been increasingly frustrating to watch people repeat the same basic mistakes over and over again.

Mark you are right, there is nothing new in these kinds of errors we are discussing. Unfortunately these fundamental lessons need to be taught over and over again.

I was surprised to read Peter Chapin’s comments about teaching Ada and verifiable programs. Uncrashable software is a good idea, but would it have really helped here? The defects are more system design errors rather than software defects.

@Matt: “Uncrashable software is a good idea, but would it have really helped here? The defects are more system design errors rather than software defects.”

Since the problem in this case caused a crash, the verification process could have potentially exposed the issue. Correcting it could then have pointed the way to the fundamental design flaw. The developer reasons like this: “Why is a crash possible here? It’s possible because I might overflow such-and-such a value. How could that value overflow? Because I’m not managing my log files properly.”

@Mark S: “Provable programs are dependent on specifications for their behaviour. The proofs are about what you intended, not that what you intended was correct.”

There is no doubt that proving freedom from runtime error is only one step in a larger process. Ideally formal analysis could also be used to prove that the program implements the design and that the design implements the requirements. That is a tall order, and even this leaves open the problem of incomplete or incorrect requirements.

In most cases formal methods can’t, at this time, reasonably be used for all stages of the software development life cycle (although the technology is constantly improving). Thus, using formal methods will not generate a program that is absolutely guaranteed to be correct. They only serve to increase confidence in a program’s correctness. However the sad reality is that many programs exist that exhibit errors that could have been easily avoided using currently feasible formal methods. What is needed is software engineers who understand how to use the technology available now.

Although great news does anyone else find it worrying that cosmic rays can cause reboots, I mean what if we want this thing to go to Jupiter! It also points out a problem with radiation effects on small electronic packages especially nanotech.

I don’t doubt that formal methods are a good idea. I wish we could see more of that. I despair somewhat in a world where “devops” and sloppy design practices are often considered a virtue… :(

I don’t “worry” about cosmic rays. The fact of cosmic rays causing computers to experience errors is also old news. You design your systems so that they recover properly from incorrect operation caused by cosmic rays. If the computer reboots, the flight software just has to recover and continue whatever it was doing or call home for instructions. It is more than just reboots that are at issue, though, but many other errors can be recovered by forcing a reboot when the errors are detected.

The fact that they used a Linux board for a flight computer suggests they were more concerned about low cost than high reliability.

This is a very low standard of engineering. However this is a low cost test flight, so it may be adequate. Remember the purpose of this flight is to verify that they can deploy the sail. I assume this is also a learning opportunity for many of the people involved.

In Centauri Dreams, Paul Gilster looks at peer-reviewed research on deep space exploration, with an eye toward interstellar possibilities. For the last eleven years, this site has coordinated its efforts with the Tau Zero Foundation, and now serves as the Foundation's news forum. In the logo above, the leftmost star is Alpha Centauri, a triple system closer than any other star, and a primary target for early interstellar probes. To its right is Beta Centauri (not a part of the Alpha Centauri system), with Beta, Gamma, Delta and Epsilon Crucis, stars in the Southern Cross, visible at the far right (image: Marco Lorenzi).

Centauri Dreams publishes selected comments on the articles under discussion here. The primary criterion is that comments contribute meaningfully to the debate. Among other criteria for selection: Comments must be on topic, directly related to the post in question, must use appropriate language and must not be abusive to others. Civility counts. In addition, a valid email address is required for a comment to be considered. Centauri Dreams is emphatically not a soapbox for political or religious views submitted by individuals or organizations. A long form of the policy can be viewed on the Administrative page. The short form is this: If your comment is not on topic and respectful to your fellow readers, I'm probably not going to run it.