Yeah. I'm always impressed by the amount of foresight that must go into designing the hardware and firmware on the ground, as well as the ingenuity in making it do things in space that they never thought of in advance.

When it ends up out of control, pointing the wrong way, half-frozen, with no power and limited communications, there's enough flexibility and failsafety designed into the firmware that they can remotely unbrick it from Earth. That's impressive; I have enough trouble unbricking the embedded systems I'm playing with on my desk.

Typically during the various phases of the the design of a spacecraft there will be some engineers who do not concerns themselves with how things work. Instead, they concerns themselves with how things fail and work in implement design changes to mitigate both the risk and impact of these failures. Although this can be a slow and expensive process. Developing a systems that works it just the first step.

It's been a long time since I've been in that world, so I don't know what current state of the art is. In my day, it was all manual, and for hardware, component by component.

Interesting anecdote: I was working on a complex timer module, which used crystal oscillators. The vendor I used asked me what kind of fill gas I wanted. Knowing nothing about crystals, I asked about options, he explained the various typical fill gasses, dry nitrogen, helium, etc. I asked what would happen if the seal failed in a vacuum. He told me the frequency would shift significantly. I reasoned that a slow leak might never be noticed at ambient (diffusing very slowly with ambient air). I asked him if they could make crystals with evacuated enclosures. He said they could. I reasoned that with those, a leak at ground level would show up quickly, and in orbit would be benign. I spec'ed evacuated enclosure crystals for the module and explained my reasoning to the prime contractor. They thought it was such a good idea that they might make it the spec system wide.

TL;DR My naive questions as a junior engineer may have changed an aerospace standard.

Just graduated with a degree in electrical engineering, slow market, took what I could get. It wasn't what I really wanted, and I left after five years. It was pretty dull, high overhead for design, lots of paperwork, very conservative because of the high consequences for failure. I got into data communications, and that was way more fun.

I did come out with a bunch of souvenirs, like an operations manual for an Agena vehicle ( https://en.wikipedia.org/wiki/RM-81_Agena) and a tetrahedral beryllium satellite frame, among others. Never did see a launch.

When I started in security software, I made it my unofficial job to ask every dumb question I could think of. And that's how you bring value to a company: by asking the questions no one has bothered to formulate an answer to for a long time.

SOHO broke 1 year after its primary mission, not 8 years. It was still a priority then, and a lot of resources were put into its recovery. It also cost several times more than Deep Impact in the first place. If there were a lot of resources to run Deep Impact, this would have been caught on the simulator and never happened on the real spacecraft in the first place.

"Basically, it was a Y2K problem, where some software didn't roll over the calendar date correctly," said A'Hearn. The spacecraft's fault-protection software (ironically enough) would have misread any date after August 11, 2013, he said, triggering an endless series of computer reboots aboard Deep Impact.

What's significant about august 11 2013 in regards to computer date data?

Ah, but it's hours (first) minutes (second) and seconds (third). Positionally the one after the seconds would be fourth. I don't believe that "third" as a third base 60 decimal place ever made it into the english language.

Hour, minute, second nomenclature is used outside of time keeping. Cartographers divide the planet in to hours, this is zero-eth- it's based on cosmological fact, not arbitrary divisions. Then they divide by sixty to get a minute portion. This is the first division by sixty. Rather than create a word meaning a minute's minute, we use second.

When referring to even more minute values, we can extend the thinking, calling them thirds, or we can throw the whole thing out and use a modern measure, like referring to second as the amount of time it takes for some chemical to cross some distance, then move around the decimal point. Yay, metric.

im a CS noob in uni and i don't understand a thing you said or how you came to that conclusion. is there anything that i can read regularly to help me understand stuff like this in this subreddit? everyone else seems to have an okay idea at understanding what you typed :(

Unix time is representing in seconds, this is just an arbitrary fact that you'll pick up on if you spend any time using *nix based operating systems. What they're doing here is calculating the number of seconds that have passed since January 1st, 2000 at 0:00 am.

(unix "20130811" - unix "20000101")*10

Multiply by 10 to get the number of tenths of a second that have passed since January 1st, 2000.

logBase 2 $ (unix "20130811" - unix "20000101")*10

I don't recognize this specific syntax, but it has the word "logBase 2" in it, so I assume its taking log (base 2 of course) of the result, that will tell you how many bits of storage you need to write the number in memory.

31.999992174768167

That number looks like its on the verge of rolling over to 32 bits of memory required. Its very common for integers to be represented as 32 bit data structures, that means that if you use all 32 bits you "overflow" and go back to 0.

Integer overflow is a very common issue, so regardless of anything else that the poster wrote in that script, you can take a look at the 31.999, assume its an integer overflow problem and nod sagely to yourself about how you totally wouldn't have made the same mistake if you were working for NASA.

As for how he came to the conclusion that NASA was counting tenths of a second? A Y2k issue is generally one caused by time overflowing its memory storage. He probably just guessed by using different time increments and seeing how much memory they required to store. We don't know for sure that's what happened, but it looks very damning when pointed out that way.

I don't recognize this specific syntax, but it has the word "logBase 2" in it

It's Haskell. The $ operator means function application but with a lower precedence. In Haskell function application is implicit (no parentheses, no commas) but it has high binding power, so you use $ to avoid parenthesizing sub-expression arguments.

because function application has higher precedence than any other infix operator. ($) has precedence zero and is right associative, so you can sort of think of it as wrapping parentheses around everything to the right of it.

Implicit function application means when you write something like foo bar, Haskell calls the foo function with bar as an argument (you don't have to write foo(bar) or anything similar). It also curries functions automatically, so foo bar baz means both "call foo on bar, and then call the result of that on baz" and also "call foo on both bar and baz," since those two are considered equivalent if foo takes at least two arguments.

The dollar sign means the same as no operator, but has a low precedence, in much the same way as + has a lower precedence than * in most languages (i.e. a + b * c is equivalent to a + (b * c)).

Most programming jobs don't require you to know this. Knowing it makes you a better programmer, especially this, knowing about 32bit overflows, and being able to see and predict them in source code without having to debug, but for most cases you don't need to know it.

It really depends on the context of the situation. Single variable that will reside in memory only in 1 location at a time ever? Kick it to 64 bits no problem; it's just 4 extra bytes of RAM. SQL table that will have 300 million rows, and will have hundreds of simultaneous queries running against it at any given time? Might want to figure out another way.

Yes, increasing the bit width for our variable (usually by doubling) is one way to avoid this, but if you KNOW you will never need more than, say, 33 bits on custom hardware, why waste another 31 wires on the PCB to accommodate?

The thing with log base 2 is just a trick for getting number of binary digits from a decimal number, for example:
42 in decimal is 101010 in binary
logbase2 of 42 = 5.39231742278
Round up to get 6, and that's the same number of digits in 101010

Ah thank you! That was exactly what I was looking for!

I do have the book "Code" by Charles Petzold which I may need to break open a bit more, he has some good chapters on memory.

Put another way, if you represented time as the number of tenths of a second since midnight on January 1, 2000, then you would hit 4294967296 tenths of a second on August 11, 2013. 4294967296 is significant because it's 232, which is the smallest number that can't be represented as a 32-bit integer. Generally this will wrap around to 0 (as in calculating 4294967295 + 1 will give you 0).

I would argue that Python with pylab is a bit better suited, but it's only because the libraries are more mature, and the duck typing makes things a bit easier when you're only worried about on-the-fly calculations.

Or, you know, use the dedicated libraries that have been written and standardised against in every language under the sun, specifically to avoid communication and overflow errors.

The first thing they need to teach people after teaching them to represent time, write a container, or write a security module, is to never ever do it again, unless they want to make that their fulltime job as part of a solely dedicated team.

Thats being too generous to goto. As defended by Dijkstra, the problem with goto is that it is a very general, low level tool and that it forces you to think about order of execution to understand program behaviour instead of being able to use more static source-code-based reasoning. Additionally, saying that the "problem is with the programmer" is a "no true scotsman" argument can be used to defend even the most absurd techniques.

That said, I think its kind of sad that people ended up demonizing goto instead of giving it a fairer treatment. Blindly getting rid of gotos often means needing to add extra flag variables or other sources of noise into the program. Aditionally goto can be used to do some things yourself if the language doesn't give you support: exceptions, state machines, breaking out of loops, ...

That said, I think its kind of sad that people ended up demonizing goto instead of giving it a fairer treatment. Blindly getting rid of gotos often means needing to add extra flag variables or other sources of noise into the program. Aditionally goto can be used to do some things yourself if the language doesn't give you support: exceptions, state machines, breaking out of loops, ...

But that's Dijkstra's entire point. Instead of gotos, higher level abstractions have been invented to solved those same problems in, at least theoretically, more expressive, less error prone ways. We have try-catch blocks, break and continue, switch statements, etc.

Of course, in languages where those primitives aren't available, as you point out, goto is darn handy (if I ever use it these days it's to simulate exceptions in C), and god knows those primitives aren't flawless. But discouraging the use of goto and instead focusing on developing better control flow primitives was, IMO, the right thing to do.

There's been one use of goto I've seen where I think it's "ok" (and one where I'd like to know the reasoning...). The "like to know the reasoning" one was for a very quick implementation of a state machine-like algorithm with character parsing for something like an IP Address (so if you read a ":" you would go to the COLON_FOUND label or something, which would then increment num_colons or something it's been quite a while). I'm sure it was done as an optimization or something (and it was really old code), but still, that was quite hard to read and I'm sure really difficult to debug and find bugs). The "ok" instance was to use as an auto-bail mechanism: there was only one label per function, and it was ALWAYS at the end - you never used goto to jump up the function. Basically, it was BAIL_ON_FAILURE([expression that returned an hresult]); It would jump to the end of the function with a CLEANUP label, which would clean up any state needed if the function failed in any way. Not that I think is the best way to do this, but it was kind of neat and the first "not horrible" use of goto I'd seen.

today we have a pretty good understanding of what a (lowlevel|highlevel|scripting)langauge needs. C definitely needs goto in some constructs and cases for the most elegant/understandable code, but new lowlevel language like rust and go wouldn’t necessarily need it (i think go has it, but rust doesn’t)

I can pretty much guarantee there is now at least one person, probably closer to 10, in NASA whose sole job is to unfuck the time code, and they are probably going to do it by pulling up some implementation of ctime.h and testing the crap out of it.

Or, you know, use the dedicated libraries that have been written and standardised against in every language under the sun, specifically to avoid communication and overflow errors.

Try embedded development where you only have 1 MB of program memory. I doubt they put a full on x86 computer on the space craft or if they did, it probably wasn't doing everything. Microcontrollers/microprocessors were than likely used and it's a bit different than just writing all the bloat you want on a desktop.

The general point about embedded systems development is valid, but NASA's unmanned spacecraft use the RAD750 microprocessor, which is a radiation-hardened "PowerPC G3," and the VxWorks OS. A competent Mac OS X developer from a few years back could develop for it (and several did; it's where Clozure Common Lisp came from, for example).

It's been a long time since I was designing MCUs, but terrestrial industry was moving pretty heavily from 8-bit into 16- and 32- even at the mid-to-low range almost two decades ago. I know that there are all sorts of temperature and radiation hardening things to do ... but I'd be surprised if they were limited at 1MB of program memory. Deep Impact didn't launch that long ago.

Of course, this is based on old data and rampant speculation ... so I could be wrong. Like ... really wrong.

it is true that memory utilization is still a big concern in embedded systems.

Well, sure. I certainly don't use much in terms of paging or any sort of virtual memory (even if TLB is a supported module on the CPU) in embedded system. I usually don't even use any sort of threading. Lots of events and async i/o doing the loop manually.

I guess the point is that the last time I looked, power-issues aside (and they do exist) ... a satellite is pretty darn big and can hold a lot of modern electronics. Maybe shielding is an issue with small transistor geometries?

Or, you know, use the dedicated libraries that have been written and standardised against in every language under the sun, specifically to avoid communication and overflow errors.

It's not that simple. Time was when NASA or its contractors would build every little part of a space mission themselves, down to the nuts and bolts. Today, they are put together using as many off-the-shelf subsystems as possible (that's how we are able to put up these missions for a few hundred million, rather than a few tens of billions).

That's when the trouble creeps in - some component has been tested in various scenarios (and I'm sure this was very thoroughly tested before 2013 :-)), but unless the team putting together the mission personally examines every line of code inside every component sourced from the outside (something which may not even be possible), you'll get things like this that slide past.

Depends when they started counting from and how precise the clock(s) were.

The issue may be "like the y2k problem" but is probably more like to the end of the UNIX epoch, which is when a 32bit int will no longer contain the number of seconds since Jan 1, 1970.

Based on that theory, I did a little math (thanks wolfram alpha and google calc) and figured out that a 48 bit int will hold 2.8147498e+14.

2.8*1014 microseconds before August 8th 2013 is September 4th 2004. Deep Impact was launched in January of 2005, so this fits with the craft being initialized in a lab, assembled, packaged, and sent to the launch pad in ~3 months.

Let's try this one on for size (I don't think the issue has been disclosed yet, so this is just speculation...)

How many seconds since Jan 1 2000? Approximately 4.295 x 108. Let's suppose the OS uses a 100ms tick rate. How many ticks would that be... OK, 4.295 x 109. That number is basically 232.

CPU is a 32-bit architecture. So if the tick count is stored as the number of ticks since its own epoch (Jan 1 2000) in a uint32_t, we basically have it overflowing & wrapping around right around August 11, 2013.

I've debugged something just like this, when lots of very expensive high-availability equipment in the field started resetting in the field. Turns out with a 5ms tick, you only get about 250 days before your 32 bit counter rolls over.

High-availability equipment (I'm talking about embedded systems, not servers or data processing equipment) should be able to stay up that long without even breathing hard. And it, until the tick count bug surfaced...

Not hard to see how it wasn't found during test. This should have been caught during design/code review.

And if the mission had more funding, this would have been caught on the simulator (which had the same problem). But as it is, the spacecraft did it's original mission, then an extended mission and then kept observing based on very limited resources even as NASA did not extend it for a third mission.

The cpu on Deep Impact is a http://en.wikipedia.org/wiki/RAD750 but I'm sure the clock is a separate chip. I note several other functional spacecraft on the list of deployments after Deep Impact. I hope they don't have the same problem or, if they do, it can be fixed in software before they have issues.

I still don't understand why you think it's a hardware problem, when they admitted it is known to be a software problem. The hardware behaved correctly, and rolled over to 0, but the software didn't like the new value, so the whole thing crashed.

"People of Earth, we are greatily sorry, but we seemed to have made a small bug in the software that reads the date of the message. We would fix it next time, only the problem is, there won't be. Sincerely, NASA."

Take this with a grain of salt, I'm just making educated guesses according to the article

They do, but it seems the bug was in the fault-recovery software, at least according to the artice. My guess is that once the fault-recovery software started seeing an overflowing date, it thought something had crashed and started a system reboot. Thing is, after the reboot the time probably wasn't reset, so the bug was still there, and we have a cycle of reboots.

Without the system being fully online, there's no way it's going to be able to receive anything from Earth, let alone fixed code. Remember that something on-board has to aim the antenna at earth, receive and process the signal, and do the reprogramming. Most likely, the whole process requires the CPU. According to the article they did keep sending such "reprogramming signals" to it but it never answered, probably because of the constant reboots.

Then, you have to remember that during a system reboot the control systems are offline, and so the solar panels get misaligned from the sun and stop generating enough power. Once the batteries are emptied, the system shuts down completely and any hope of recovery is lost.

The Curiosity Rover for example had a glitch in it's software, but NASA was able to fix it remotely by reprogramming it since the redundancies worked, mainly in that it automatically switched to a different CPU with a different code chip that could manage the antenna/reception/reprogramming. Thing is, if the glitch was also a date overflow, then the redundant CPU would most likely have also failed as the date wouldn't be reset.

The spacecraft accomplished its primary mission a few months after launch. It then did an extended mission for several years after. And then NASA did not renew it for a third mission, although the spacecraft kept observing on very limited funds. So if it had been on its way to destroy a dooms day comet, that comet would be dead and then some.

Considering it's lost power, all you can hope for is the CMOS battery to dry out and the orbit of the probe to shine some light on it, and it'll boot up again with a cleared date. That is, if it wasn't written to any non-volatile memory.

No - it's been without attitude control for too long. Even if it got power on itself and the hardware wasn't ruined from the cold/fried from the sun, there are still other factors that might prevent it from being able to communicate with Earth.

Right. IF the probe ever wakes up, it'll have no way of knowing where it is or where to point any sensors or antennas. There's no way NASA can justify spending more resources on this project hoping blind luck will bring it back to life.

Actually, in the miracle case that the batteries don't get destroyed and there's enough fuel and the hardware is alive, it could actually get picked up again. If it successfully went into safing mode, and we were still looking out for it, it could be recovered. There are safing procedures designed to allow us to communicate with it even if it has no idea where it is.

This wasn't a spacecraft that was suppose to last for decades. Its primary mission was finished a few months from launch. I am not sure that any Discovery mission (cheapest possible space exploration mission for NASA) used RTGs.

Maybe I'm missing something... But why does the probe need to know the date at all? Seems like it would be smarter to just tack on a time stamp here on Earth whenever we received data. Adjustments could be made to account for the distance of the probe.

Even with that, the computer doesn't need to know the exact date. Why not set the computers clock to think its January 1st 1970. Then the time is just at 0 and you don't have to deal with any counting problems. You'll still have a relative time stamp and just add the time difference when the data is received on earth.

Because nobody thought this issue would exist. There was suppose to be a 64-bit variable counting the time, but somewhere along the line it was converted to 32-bit which caused the problem. They did try to reset the computer (the reset switch does not depend on the main flight software) but by this point a number of other possible problems probably (there are estimates for likelihood of all of these) happened that prevented this from happening (from no power to batteries exploding to circuits breaking from too low temperature because heater wasn't working).

I think you're looking at it wrong. The question is not why they need to use a timestamp, but why a bug in timestamp would cause the spacecraft to get lost. Endless reboots for incorrect time doesnt sound like a great idea, but what do I know.

Watchdogs don't run (shouldn't run) on timestamps, they (should) run on relative counters only. Even if the period is days, the watchdog has no need to know what actual day it is on the Gregorian calendar.

No, it was due to the fault protection code. That code has checks to decide when to reboot things based on some scripted actions. If the timestamp gets messed up, the fault protection goes haywire as well.

As you say, it's probably to calculate delays so as to account for the distance of the probe when executing commands from Earth. Sure, you could work around it, but then again you could also just allocate some extra bits to not have the problem in the first place.

They also mentioned that without the date it didn't point its solar panels in the correct direction for power. It probably has a way to calculate what orientation the panels need to be in depending on location and date.

It has sun sensors and uses those to orient, not based on date. The problem with the solar panels is that when the computer is restarting as it starts up every time, it can't issue any commands like 'orient panels toward sun based on where sun sensors say it is'.

A ton of reasons. A very simple one is "shit, I haven't heard any pings from Earth in three days, time to go into safe mode since obviously something isn't working right". Or "I expected acknowledgement of my transmission within <x> minutes, and didn't receive it, so I'd better re-send it".

The way you know where everything IS in space is by knowing what date it is (you can use any accurate date system). You can set that clock once and tick tick for a million years and as long as it's accurate you will know where all the planets, moons, comets etc are relative to one another because they move in predictable patterns. If you completely lose the date or it gets wildly skewed then your little computer brain's model won't match reality and you will happily go off in the wrong direction. Now I am not a NASA programmer, I am a more typical programmer who also studied some astronomy. But I am completely unsurprised that date wraparound would cause the probe to quickly lose its ability to communicate. It could have a directional antenna and it's facing completely the wrong way, etc.

My speculation/blind guess: if they had some base mission plan which was part of the code (instead of controlling probe remotely) then it's easier for humans to verify/read it when scheduled events are declared with human readable dates.