3 The major guidelines in such a case Think first, then think again and think again, before you act, (and keep on breathing calmly) B.t.w.; trust the design we made it to withstand space hazards Each unit watches over itself All have been tested extensively HIFI has been in this state now for 20+ hours! a few more can t hurt Something is not working - any action will have consequences Be very careful and do one step at a time Get all experience on-board First agree before you make a change...and then think three more times before you do the change Do not let yourself be hurried...even if the scientists want us back fast! and even (certainly?) if the bosses want us back fast SRON, we have a problem, CSI in space - NVR - 6/11/2013 3

5 The history of things getting back on-line The LCU anomaly happens T22:43:00Z We switch off components to keep the instrument safe LCU Switch-off T15:20:00Z HIFI Switch-off T15:15:00Z An initial evaluation of LCU power consumption A DC/DC converter is broken but maybe not? A HIFI restart is tried T15:30:00Z Still nothing works start detailed analysis Reconstruction of the timing crucial for the understanding Key Features/Symptoms of the Anomaly The sequence of events is understood we can cure it but only for the redundant system so lets be absolutely sure! SRON, we have a problem, CSI in space - NVR - 6/11/2013 5

6 The main components in the play What is still working The controller ICU The backends HRS/WBS The focal plane unit controller FCU..and what not? The LO controller unknown The LO source unit - unknown The LO multiplier chain unknown The mixers unknown but are they broken? Detailed inspection of signals; Voltages, currents, temperatures, status words, etc. etc. SRON, we have a problem, CSI in space - NVR - 6/11/2013 6

9 Initial Evaluation LCU Power Consumption After LCU switch-off lab-tests on QM/IMD-3 conducted Identify power consumption at module level only one DC/DC converter (HRS-4) with LCU standby state LCU could explain observed low power consumption Switch-on with (simulated) failure in HRS-4 predicts: Communication would be restored Low current consumption would stay the same Unit would be found in standby Analogue HK LCU would show zero readings With this knowledge a HIFI switch-on attempt was made on OD89 / T15:30:00Z SRON, we have a problem, CSI in space - NVR - 6/11/2013 9

12 Timing Reconstruction zooming in Therefore the LCU fault it stopped talking- occurred between the following two times: (HL_frequency acquisition time) (HL_LSU_f main acquisition time) Confining the time interval of the fault to 6 ms connected to either LCU command 0xF30ACC7A and/or 0xB33A but we don t know the exact time when the system crashed Loss of communication is known within 6 ms w.r.t other actions what can we say about the physical state of the LO subsystem? Inspection of the Science Data = IF power, for more information IF power (indirectly) depends on LO power drop of IF power tells something about LO power SRON, we have a problem, CSI in space - NVR - 6/11/

15 Key Features/Symptoms During the sequence of events eventually leading to the LCU failure communication was lost and only 1.6s later the RF power was lost. After the anomaly the LCU was found in a state with low power consumption drawing 0.36A primary supply current After the anomaly the LCU was found in a state where the standby relay had been switched The reconstruction of the anomaly shows that the failure scenario must be consistent with all of these key symptoms so what is broken? SRON, we have a problem, CSI in space - NVR - 6/11/

16 We still need a conspiracy A power spike on the DC/DC converter can break a component Power is from the space craft that is stable Spacecraft power is monitored by LCU for changes with unexpected power drop LCU software activates standby relay to disconnect from the power just to be safe what if LCU disconnects while spacecraft power is stable? switch-ripple with overshoot! How can the LCU software think it needs to disconnect? Bad software Jump to boot because the disconnect is also executed at boot just to be safe And what could the system be doing those first 1.6 seconds? doing a drunkards walk through its program memory! SRON, we have a problem, CSI in space - NVR - 6/11/

17 Anomaly scenario 1. A single event upset corrupted the LCU memory. 2. Bit-flip brought the micro controller in non-communicado condition 3. Micro controller jumps to an erroneous program location executes program code not meant for use in normal operation 4. After 1.6 sec. of random code it comes to program counter 0 5. Start boot sequence standby relay was switched unit went instantly from full operational to standby 5. Power drop generates voltage transient on the internal 28V bus fatal for a secondary rectifier diode in HRS4 DC/DC converter 6. End result: instrument in stand-by no LCU-ICU communication LCU drawing ~0.36A (nominal is 2.8A) significantly decreased power dissipation unit/panel temperature drop Important note; NO other scenario explains all observed behavior SRON, we have a problem, CSI in space - NVR - 6/11/

18 Scenario consequences how can we prevent it Prevent instantaneous switch from nominal to standby disable or eliminate the function switching the standby relay lobotomise the LCU forget how to switch the standby relay add dissipative mode to eliminate switching between tunings (keeping RF power on) Introduce regular checks for LCU integrity check whether LCU program patch to disable relay is applied checksum (~few times per OD) to monitor LCU memory integrity ICU-LCU interaction to detect loss of communication SRON, we have a problem, CSI in space - NVR - 6/11/

19 Next steps Design and implement software changes LCU, ICU and ground operational software Test these all on the ground lab setup Start reviving HIFI step by step redundant side 1. ICU SW upload and sanity test 2. LCU SW upload and sanity test 3. Activate a mixer chain and see if it still sees something 4. Test if anything else in the system was affected e.g. mixers!? 5. Re-commission the system 6. Do science operations verification and start catch up with the other instruments! SRON, we have a problem, CSI in space - NVR - 6/11/

20 CSI in space and after 5 months hard work t = 1.58s Deep investigation by large team; Cause fully understood based on minute amount of telemetry; a cosmic ray hit! Mitigating measures put in place, and instrument could be restarted Required huge amount of background; full instrument description, full trace of parts, how were things made/tested etc WBS integration average IF power per dataframe HRS integrations communication loss effective LO loss periodic HK timestamp WBS HK sample/report time HRS HK sample/report time HRS-H 3.3 V supply current LCL 53 current Cheers!! GO! SRON, we have a problem, CSI in space - NVR - 6/11/

21 Consequences for operations For nominal operations virtually none! AOTs and engineering modes not at all affected AOTs are per subband already LCU integrity checks add of order seconds to standard AORs No change in physical characteristics e.g. temperatures not affected Mission planning constraint to minimize band switches #OD s for contiguous scheduling not important use dissipative modes For non-nominal conditions loss of observations/ods SEU can still upset LCU, leads to disabling LCU but will not lead to standby switch no damage LCU integrity check fails loss of current OD, likely loss of next OD if HIFI is scheduled Note: statistics for LCU integrity issues currently totally unknown 1 SEU/5d? FDIR unchanged SRON, we have a problem, CSI in space - NVR - 6/11/

22 Refinements in the root cause New finding w.r.t Schottky diode properties i.e. failed regulator Devices have threshold with respect to stress below threshold device functions ad infinitum, above threshold device dies Appropriate screening filters out low threshold devices (by breaking them) Screening was apparently not applied for HIFI (CPPA) flight articles Implication for HIFI LCU anomaly With the standby switch indeed the diode with the lowest threshold died Redundant chain diodes had ~100 subband0 switches (IST, TB/TV) without failing will not fail if they do not get stressed higher Consequences if this applies Scenario uncertainties significantly reduced the missing link Limiting number of band0 switches probably becomes less important All mitigating measures still apply, no new mitigating measures implied Some mitigating measures can possibly be relaxed Specifically applies to scheduling 1 band per OD SRON, we have a problem, CSI in space - NVR - 6/11/

23 Root cause cosmic ray Single Event Upset they continue to happen Redundant unit with updated SW is more robust Regular hits seen handled routinely with operational procedures PACS and SPIRE also suffer from cosmic ray hits Click to edit Master text styles Second level Third level Fourth level Fifth level B.t.w. was there anything special in OD 81?... actually no, radiation-wise a boring day...we were just unlucky SRON, we have a problem, CSI in space - NVR - 6/11/

24 We got HIFI working again. ready for three years of great science! SRON, we have a problem, CSI in space - NVR - 6/11/

Chapter 10 Troubleshooting This chapter explains how you can troubleshoot a specific problem, such as abnormal LED activity or no system power, when you power up the router. Topic Page Diagnosing Problems

Hello and welcome to this Renesas Interactive course, that provides an overview of the Clock Generator found on RL78 MCUs. 1 This course provides an introduction to the RL78 Clock Generator. Our objectives

FASTER-RF MODULE USER MANUAL This document is the user manual of the FASTER-RF module. In chapter one, you will find an introduction to the time measurement chain which uses cyclotron time as reference

1 CIRCUIT DESIGN If not using one of First Sensors ZBXYA interface boards for sensor control and conditioning, this section describes the basic building blocks required to create an interface circuit Before

Power Management Basics 2. Power Supply Characteristics A power supply s characteristics influence the design of a power management subsystem. Two major characteristics are efficiency and performance over

1 Nuclear Power Plant Electrical Power Supply System Requirements Željko Jurković, Krško NPP, zeljko.jurkovic@nek.si Abstract Various regulations and standards require from electrical power system of the

Logging of RF Power Measurements By Orwill Hawkins Logging of measurement data is critical for effective trend, drift and Exploring the use of RF event analysis of various processes. For RF power measurements,

Fluke 89-IV & 189 Event Logging FlukeView Forms Technical Note One of the major features of the Fluke 89-IV & 189 meters are their ability to "do logging". This technical note explains what kind of logging

Emissions Simulation for Power Electronics Printed Circuit Boards Patrick DeRoy Application Engineer Patrick DeRoy completed his B.S. and M.S. degrees in Electrical and Computer Engineering from the University

The counterpart to a DAC is the ADC, which is generally a more complicated circuit. One of the most popular ADC circuit is the successive approximation converter. 1 2 The idea of sampling is fully covered

Network Monitoring with Xian Network Manager Did you ever got caught by surprise because of a network problem and had downtime as a result? What about monitoring your network? Network downtime or network

Satellite Telemetry, Tracking and Control Subsystems Col John E. Keesee 1 Overview The telemetry, tracking and control subsystem provides vital communication to and from the spacecraft TT&C is the only

CompactLogix Backup on DeviceNet is a simple, low-cost, backup system targeted towards smaller applications which require fast switchovers from Primary to Secondary processor. No additional HW or special

OPENUPS 6-30V Intelligent Uninterruptible Power Supply Installation Guide Version 1.0f P/N OPENUPS-06 Before you start Please take a moment and read this manual before you install the OPENUPS. Often times,

SED-635 Digital Excitation System SED-635 is a complete excitation system capable of adapting to control synchronous generators of any size. The integration of the TOUCH SCREEN operator interface and a

PXI GSM/EDGE Measurement Suite The GSM/EDGE measurement suite is a collection of software tools for use with Aeroflex PXI 3000 Series RF modular instruments for characterising the performance of GSM/HSCSD/GPRS

The Business case for monitoring points... Points Condition Monitoring (PCM) measures key parameters related to the performance of switch machines and turnouts in real time at every movement. Intelligent

Products: R&S FSP Fast and Accurate Test of Mobile Phone Boards Short test times in conjunction with accurate and repeatable measurement results are essential when testing and calibrating mobile phones

COURSE NOTES Techniques for Measuring Drain Voltage and Current Introduction These course notes are to be read in association with the PI University video course, Techniques for Measuring Drain Voltage

SRF08 Ultra sonic range finder Technical Specification Communication with the SRF08 ultrasonic rangefinder is via the I2C bus. This is available on popular controllers such as the OOPic and Stamp BS2p,

HART Communication Manual The information and technical data disclosed in this document may be used and disseminated only for the purposes and to the extent specifically authorized in writing by General

Chapter I Model801, Model802 Functions and Features 1. Completely Compatible with the Seventh Generation Control System The eighth generation is developed based on the seventh. Compared with the seventh,

TranScend Next Level Payment Processing Product Overview Product Functions & Features TranScend is the newest, most powerful, and most flexible electronics payment system from INTRIX Technology, Inc. It

Proactive Performance Management for Enterprise Databases Abstract DBAs today need to do more than react to performance issues; they must be proactive in their database management activities. Proactive

Virginia Diodes, Inc. Biasing VDI's GaAs Varactor Frequency Doublers This document is intended to address a number of issues related to the proper application of bias to VDI's varactor frequency doublers

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform) I'm late I'm late For a very important date. No time to say "Hello, Goodbye". I'm late, I'm late, I'm late. (White Rabbit in

MD03-50Volt 20Amp H Bridge Motor Drive Overview The MD03 is a medium power motor driver, designed to supply power beyond that of any of the low power single chip H-Bridges that exist. Main features are

2.0 Command and Data Handling Subsystem The Command and Data Handling Subsystem is the brain of the whole autonomous CubeSat. The C&DH system consists of an Onboard Computer, OBC, which controls the operation

Hello and welcome to this training module for the STM32L4 Liquid Crystal Display (LCD) controller. This controller can be used in a wide range of applications such as home appliances, medical, automotive,

PS-2 Mouse: The Protocol: For out mini project we designed a serial port transmitter receiver, which uses the Baud rate protocol. The PS-2 port is similar to the serial port (performs the function of transmitting

A RF18 Remote control receiver MODULE User Guide No part of this document may be reproduced or transmitted (in electronic or paper version, photocopy) without Adeunis RF consent. This document is subject

Technical Information POWER PLANT CONTROLLER Content The Power Plant Controller offers intelligent and flexible solutions for the control of all PV power plants in the megawatt range. It is suitable for

STUDENTS SPACE ASSOCIATION THE FACULTY OF POWER AND AERONAUTICAL ENGINEERING WARSAW UNIVERSITY OF TECHNOLOGY PRELIMINARY DESIGN REVIEW CAMERAS August 2015 Abstract The following document is a part of the

1008121 R01 April 2005 MicroMag3 3-Axis Magnetic Sensor Module General Description The MicroMag3 is an integrated 3-axis magnetic field sensing module designed to aid in evaluation and prototyping of PNI

Monitoring Software using Sun Spots Corey Andalora February 19, 2008 Abstract Sun has developed small devices named Spots designed to provide developers familiar with the Java programming language a platform

A CANbus Replacement for the BIMA Antenna Telemetry A. D. Bolatto 1. Description of Current Telemetry System As of February 2003, the telemetry data flows from the array control computer to the telemetry

35'th Annual Precise Time and Time Interval (PTTI) Systems and Applications Meeting San Diego, December 2-4, 2003 A PC-BASED TIME INTERVAL COUNTER WITH 200 PS RESOLUTION Józef Kalisz and Ryszard Szplet

HV Solar Inverter System GUI Overview January 2012 TMS320C2000 Systems Applications Collateral The HV Solar Inverter System GUI provides a simple interface to evaluate some of the functionalities of the

International Journal of Electronics and Computer Science Engineering 228 Available Online at www.ijecse.org ISSN- 2277-1956 Design and Construction of Variable DC Source for Laboratory Using Solar Energy

Measurements on Transmission Lines Power and Attenuation Measurements Although a variety of instruments measure power, the most accurate instrument is a power meter and a power sensor. The sensor is an

Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

1. COSIMA Scientific Objectives For the COSIMA investigation the following scientific objectives were established: elemental composition of solid cometary particles to characterize comets in the framework