Causal models with binary variables

An important qualitative model form uses binary variables, usually represented as a set of variables as nodes in a directed graph (“digraph”). The nodes in these “causal digraphs” (“CDG”) are drawn with directed arrows (arcs) between them indicating causal links. This can be stated in the form of rules. For instance: “if (cause) C1 is true, then (effect) E is true”, and “if C2 is true, then E is true”. This would be represented in the causal digraph model with nodes for C1, C2, and E, and with a link from C1 to E and C2 to E:

We say that “C1 causes effect E”, and “C2 causes effect E”. The arrow directions match the normal notation of logical implication where we write C1 ==> E and C2 ==> E. We also often read the diagram as “C1 implies E” , and “C2 implies E”. Another equivalent interpretation of this diagram is more relevant in the diagnosis case where E is a symptom known to be true. Then we can read the model as “C1 is a possible cause of (symptom) effect E”, and “C2 is a possible cause of effect E”. That is, at least one of the causes C1 or C2 must be true. Additionally, by the “contrapositive” property of IF/THEN implication, NOT (E) implies NOT C1 and NOT C2. That is, if (symptom) effect E is known to be false, then we can conclude that possible causes C1 and C2 are false.

When true values correspond to abnormal conditions/faults, this is called a fault propagation model. A “fault” or “problem” in these models has a value of true when the abnormal condition exists, and false when it does not exist. So “true” is the “bad” value in a fault propagation model, and false means “good” or “OK”. The model also has to represent “unknown”, since there might not be enough observations (data) to determine if a problem exists or not. In the examples below, we indicate true (bad) values with a red color, false (good) values with a green color. No color means that the value is unknown.

The following digraph is an example of a small part of a cause/effect model of problems for vehicles:

The boxes represent problems, and the links represent cause and effect. (The arcs drawn here that don't connect two problems are just meant to suggest that other problems are present. All arcs connect two nodes.) If the fuel pump is weak, the diagram predicts that this causes a fuel pressure gauge to show low pressure and that the fuel system will provide inadequate fuel. The inadequate fuel will in turn cause the engine to experience power loss when driven. Conversely, if the O2 sensor reads high, an exhaust system leak or inadequate fuel are possible causes. In turn, the possible causes of inadequate fuel include a weak fuel pump and clogged fuel injectors. The three problems at the left side are possible root cause problems - they have no inputs. Diagnosis means starting from observed symptoms like high O2 readings or loss of engine power, working back to determine the most likely root causes.

In many practices of "root cause analysis", problems are associated with functions such as "providing cooling" for a nuclear reactor. Problem nodes can represent these functional failures, which in turn might be caused for specific equipment or procedural failures. The "fuel system:inadequate fuel" is an example of a subystem failure to provide its normal function.

As with the case of rules, the nodes can have “AND” logic or “OR” logic. The default for most fault propagation models is “OR”, so that if any of the inputs are true, the output becomes true and a fault effect is propagated. “AND” logic is sometimes useful. For instance, if both a primary and backup pump have to fail before flow is lost to a process unit, an “AND” gate is needed. More elaborate gates can be used as well, such as an “m of n” gate that concludes true if m inputs out of the n available are true.

For most practical applications, the logic associated with fault propagation models is not strictly binary. It is called “3-valued” logic because it must account for “unknown” as a value. For instance, for an OR gate, if at least one input is true, then the output is true, even if other inputs are false or unknown. If all inputs to an OR gate are unknown, the output is unknown. For an AND gate, if any of the inputs are unknown, the output is unknown.

The fault propagation model can be used for “prediction” and “diagnosis”, as explained below. This means that data is entered for some nodes (based on automated measurements or human input), and the run-time “engine” draws as many conclusions as possible by propagating information through the model. Some of the nodes may periodically receive new values automatically based on automated measurements. Others may get values only when requested, as part of an automated test strategy or question to a human user. Other nodes might not have any direct test or automated measurement associated with them. For instance, a problem such as “high temperature in drum 5” may have a temperature sensor that can automatically be used to determine a value of true or false. Or, a person might enter a true, false, or unknown value based on walking out to that drum and performing a manual test. Or, there might be a root cause problem such as “distillation tower tray plugged” that has no practical, direct test while maintaining operations. A true or false value for those problems can only be inferred indirectly based on the fault propagation model.

Time delays between cause and effect can be modeled. In this case, the time delay is stored as a property of the arc representing the causal link. Up through the section on time delays, the examples below are generally taken to have no time delays unless otherwise mentioned. This is referred to as a “static” causal model.

Prediction using fault propagation models

Following the arrows to look “forward” or “downstream” from a known value of C1 or C2 to determine E is referred to as using the model for “prediction”, or “simulation”. In the following four examples of prediction, the initial states of all cause and effect blocks are unknown. Then, a single measurement or human input is entered for either C1 or C2. “OR logic” is assumed in the examples below.

Examples of predictions using causal models

Color coding

In the first prediction example, C1 is observed to be true. Using the model, we can predict that both E1 and E2 must be true. We cannot conclude anything about C2, so it remains unknown. We use a pink color here to emphasize that E1 and E2 are not directly observed, just predicted. The distinction matters because there might be an error in the model, and there might be a time delay (discussed later). For now, we assume no time delay between cause and effect. These models are “static”.

The second prediction example shows an entry of true for C2. We can predict that E2 should be true as well, but cannot predict values for C1 or E1. Neither of those nodes are downstream from C2.

In the third example above, C1 is observed to be false (good). We can then predict that E1 is false. This is because E1 only has one input, so either OR or AND logic collapses to say that the E1 output can be predicted to be the same as the input. There is also the assumption that the model is complete - there are no other inputs to E1. We cannot predict a value for E2. Given that C1 is false, the value of C2 must match the value of E2, but C2 is unknown.

In the fourth example above, C2 is observed as false. We can make no predictions based on this. The values for E1 and E2 both will match C1, which is unknown.

Prediction to see the downstream effects of a problem is useful as part of the overall fault diagnosis process. One reason is to assess the impact of a problem, to help prioritize repair work, and warn people about further complications that are coming (usually after some time delay, whether modeled or not). Prediction also helps in alarm correlation, to recognize that when additional events are seen, they associated with an existing problem, and also help confirm it if there are possible errors in the observations.

For the case of static models (no time delays), the predictions are “instant”. That is, when the cause value is changed, the predicted effect values are immediately changed. In the case of time delays, the values change only after the specified time delay. In the diagram above, if both C1 and C2 were set true (at the same time or at different times), then the predicted value of E2 would change to true at the minimum predicted time based on the each of the two causes.

Causal models keep no history inside the nodes -- each node just has a current value, which can be changed based on its inputs. So (unless there is a loop of directed arcs), the entire state of the system can be predicted based on the set of nodes with no inputs - the root causes in diagnostic terminology. One implication of this is that the resulting predictions do not depend on the previous history of the root causes - only on their current values. So, in the examples above, suppose C1 is set to true, resulting in predicted true values for E1 and E2 as in case 1. If C1 is then set to false, the previous predicted values of E1 and E2 must be changed. The results should now look like case 3. E1 is now predicted false. There is no longer any support for a conclusion of either true or false for E2, so it must fall back to unknown. If the value of C1 returns to “unknown”, all values in that example return to unknown -- there is no support for any other conclusion. Even in the case of time delays (with no loops), the results will eventually reach their “steady state” values as shown in the examples.

If loops are present, then if starting from all unknown values, an initial estimate or observation of the value of one variable in each loop is needed to determine a value other than unknown for nodes in the loop.

Diagnosis using fault propagation models

Diagnosis involves reasoning using the above model in the opposite direction. Suppose we observe that E2 is true, based on a measurement. We might read the model as “C1 is a possible cause of E2”, and “C2 is a possible cause of E2”. If C1 and C2 are unmeasured, we don’t know if C1 or C2 is true, or if both are true. But we know that at least one of them must be true. Fault isolation will require additional information such as measurements to determine values of other nodes such as C1, C2, or E1. For this case, we say that C1 and C2 are in an “ambiguity group” -- we know that at least one of them must be true, but we don’t yet know which one. We also refer to them as “suspects” - we suspect them of being true. For systems where multiple faults are present, there may be multiple ambiguity groups while diagnosis is in progress. “Suspect” is a variation of “unknown” in the 3-valued logic of causal models. The value of a “suspect” is still unknown, but the diagnostic process has created an additional constraint beyond the causal diagram: at least one member each ambiguity group must be true. We indicate “suspects” with a yellow color in the examples. The value of “suspect’ is tied to a particular ambiguity group.

The general process of assigning possible causes to observed effects is also called abduction. Abduction finds root causes (nodes with no inputs, inferred to be true) that are consistent with the observed values for nodes that have values found by measurement, test, or manual input. The usual goal is to find the smallest set of root causes possible that can explain the observed values. A complete system views the causal diagram as a set of constraints -- all observed and estimated values of true/false/unknown should be consistent with those constraints, or consistent as closely as possible if allowances are made for errors in models or observations. Given the observed values, estimate as many values as possible for all nodes. This estimation is typically done by propagating binary variable values upstream and downstream from observed values.

The four examples below each start with all values unknown, and illustrate entering one observed value to start a diagnosis. In some cases, a problem can be diagnosed, and its value of “true” can then be propagated for further predictions.

In the first diagnosis example above, E1 is observed to be true. Then we can infer that C1 must be true. This is because E1 only has a single input. Under the assumption that the model is complete, the only possible cause for E1 being true is that C1 must be true. Now that we have inferred that C1 is true, we can also predict that E2 is true. (We say “infer” when we look upstream, and “predict” when we look downstream. This distinction matters mainly when there are time delays. Strictly speaking, we infer that C1 “was true” if there is a time delay from C1 to E1). We cannot conclude anything about C2 unless we make a single-fault assumption. In fact, if there is no direct measurement to determine C2, we cannot determine if C2 is true or false. We would say that C2 is “masked” by the failure of C1, because a change in C2 could not be observed. C2 is not even a “suspect”, because there was no observed evidence (tests of C2 directly or from nodes downstream of it) to suggest that C2 was true.

In the second diagnosis example above, E2 is observed true. We can conclude that C1 and C2 are “suspects” in the same ambiguity group. At least one of them must be true, but we don’t know which one, or if both might be true. We can determine nothing about E1 without further measurements.

In the third diagnosis example above, E1 is observed false. We can conclude that C1 must be false. This follows from the “contrapositive” property of if/then logic: The statement “C1 implies E1” also means that “NOT(E1) implies NOT(C1)”. It also happens that in this example, there is only a single input to E1, so C1 could not possibly be true for that reason as well. We cannot determine C2 or E2 without further information. Based on the one “good” observation, there is no evidence to suggest that there is a fault, so there are no “suspects”.

In the fourth diagnosis example above, E2 is observed false. By the contrapositive, both C1 and C2 must be false. In this case, since E1 has only a single input, we can also predict that it must be false as well.

Masking and the results of a single fault assumption

Even the simple examples above demonstrate the effects of masking, and the choice of making a single fault assumption. In the first example (E1 observed true), unless C2 can be directly observed, we cannot determine a value for C2, because E2 (the only observed effect of C2) will always be true given that C1 has been diagnosed as failed. C2 is masked by a failure of C1. We could make an assumption of only one fault at a time to conclude that C2 is false. However, such an assumption is not necessary here, because C2 was never a suspect.

In the following example, both E1 and E2 were observed to be true:

An example showing masking, without or with a single fault assumption.

As in the earlier “Examples of diagnosis” cases, C1 is diagnosed as true. With this model, only a direct test of C2 could determine its value - otherwise its value is masked. In an event-oriented system, results often arrive one at a time. If E2 were measured true first, both C1 and C2 would be suspects, as shown in the second “Examples of diagnosis” case earlier. Then when E1 is measured true, since C1 is the only possible cause of E1, C1 is diagnosed as true. That leaves only C2 as still suspected. Since we are working with static models, a diagnostic system might be designed so that the order of the observations does not affect the results. In that case, even if E1 were observed as true first, when E2 is then observed as true, C2 would be added as a suspect. That would result in the leftmost case above. An alternative is to say that E2 = true is already “explained” by prediction, so further processing is not needed. In that case, C2 would remain unknown, and not set as a suspect, as shown in the middle case above. This second strategy leads to fewer “suspects”. It is an abduction strategy emphasizing diagnosing the minimal number of root causes, but not strictly requiring a single fault assumption. The results then might depend on the order in which events are received, depending on the details of the algorithm implementation. A stronger use of a single fault assumption would result in diagnosing C1 = true, and declaring C2 to be false, as in the rightmost case.

A single fault assumption is appealing in forcing the “minimal” explanation, and many diagnostic techniques require this assumption. But it is not a good assumption for large systems. It is true that in a short time period, more than one new fault is unlikely. But the time to detect, diagnose, and repair faults is often so long that other faults occur while earlier ones are still being repaired, or are even just ignored until major maintenance work can be done.

One typical assumption in most studies of risk is that the probability of any root cause fault is independent of any other fault. Making a single fault assumption violates that -- it would mean that once a fault is found, it decreases the probability that a different fault will occur. It is best to avoid making these physically unlikely assumptions. In reality, when faults modeled as “independent” do interact in some unmodeled way, it’s usually to increase the other fault probabilities. For instance, one failure may lead to overheating, vibration, overloading, or stress, which increases the probability of another failure that is normally modeled as an independent root cause.

In large systems such as those in the operations of process industries and network management, multiple faults are common. Fortunately, in these cases there are also so many measurements that it is usually possible to use enough of them to diagnose any fault uniquely. It is a good practice to try to find a direct way to test any root cause whenever possible, even if that test happens to be expensive. But it often happens that “downstream” tests are cheaper than direct tests, so there is benefit in making full use of observations available to the fault model.

It should be obvious from the examples above that, even if making a single fault assumption, it is best to use it to leave other possible root causes as “unknown” (the middle case above). There is little justification in concluding the other root causes are actually good (false) when there is no evidence to suggest it. There often may be no benefit in drawing these extra conclusions. The most important actions to take are associated with fixing or compensating for the diagnosed failures.

Loops (cycles of directed arcs in the causal diagram)

Loops arise in the physical world. Obvious examples in the process industries include feedback control loops, loops introduced by feed/product heat exchangers or material recycles, and so on. Another example is in an exothermic chemical reaction: increasing input temperature results in increasing reaction rate. But increasing reaction rate causes increasing heat generation, which in turn leads back to increased temperature. This leads to the well-known “multiple steady states” phenomenon for chemical reactors - even knowing the inlet conditions to the reactor, you cannot predict the outlet temperature. There are often three possibilities: everything cool, everything really hot, or an in between “unstable” steady state which is often where very careful control must be maintained. (The instability is made even worse by recycling heat from the output back to the input in some commercial installations.)

Causal models will unavoidably reflect these loops. The same problem of “multiple steady states” is possible with boolean values even for static models: once achieved, a loop of “true” values for OR gates is self-sustaining independent of root causes. The case of models with time delays is a discrete time dynamic system with no damping, so cycles can propagate not just single values, but entire repeated patterns of changes over time. As noted in the section on predictions, a value has to be observed in each loop to resolve the predicted values to anything besides unknown. Similarly, for diagnosis, ambiguities are resolved as long as there is an observed value in each cycle. Loops do not pose a major problem for the algorithms for prediction and diagnosis: the algorithms just need to have tests to ensure that propagation stops when the same node is encountered twice.

A more complete example of fault diagnosis

The following is a slightly more complete example of fault diagnosis. In this case, a sequence of inputs are processed and a diagnosis is reached after 3 data inputs.

A diagnosis example with 3 successive entries of observed values

The example above starts with all diagnostics states unknown. The first observation is that H is true. Since there is only a single input to H, F can be inferred to be true. As a result, there is an ambiguity group consisting of C1, C2, and C3. At least one of these suspects must be true. Next, the node I is observed to be false. Hence, both G and C3 can be inferred to be true. Finally, a direct observation indicates that C2 is false. Since the inputs from C2 and C3 to node F are now believed to be false, the only remaining possible true input to F must be C1. So, we conclude that C1 must be inferred true. Diagnosis is complete. And, we can now predict that D must be true as well.

Conflicting data

The examples above do not demonstrate what happens when there is conflicting data that is inconsistent with the model. For instance, in the above example, what if D were observed to be true, but F was observed to be false? Conflicts could arise because of model errors, measurement errors, or timing errors if there are incorrectly modeled time delays. Several strategies are possible. The approach taken in SymCure was to always believe the most recent input, and only override existing values when needed to resolve the conflict. Other approaches perform numerical combinations of data to estimate probabilities or other metrics for each possible root cause. Most of those make the additional assumptions that there are no time delays, and that there is only a single fault. For instance, calculate “distances” between symptom vectors to pick the single fault with the closest match to the observed symptoms. When some symptom values are “unknown”, the calculations can just ignore that symptom.

Using “a priori” estimates of the probability of a fault

With some techniques (such as Bayes rule), “a priori” probabilities of failure (prior estimates) may provide help for guessing the most likely root cause. For instance, consider the ambiguity group C1 and C2 in the “E2 observed true” case of the first picture of “Examples of diagnosis”. Suppose we can get no further data. We can still provide useful diagnostic information: If we know from historical failure data that one failure is twice as likely as another, and we have to pick one, we pick the one with the highest a priori probability. Or, we can report estimated probabilities.

Test planning

In all the examples above, we just inserted values into nodes and propagated values upstream and downstream. But unless the monitoring system is completely automated and completely passive, values will not just show up randomly or periodically based on scanning of data. They may also be the result of tests. Tests can be questions asked of end users, requiring a manual input. They can also be be data acquisition or the result of automated workflows that are only run when needed. Planning to decide which tests to ask for is an essential part of the engine for a diagnostic package. This is described in more detail in the section on tests.

Causality in reality implies time delays and lags

In physical systems, causality is in reality associated with some time delay or lags between cause and effect. This has to happen because mass or energy has to move, overcoming resistance by inertia, thermal inertia, inductance, or other physical phenomena. This is discussed on the page Causal Time Delays and Lags.

Commercial use of binary causal models

Binary qualitative cause/effect models are in widespread use. The Symcure CDG example was already cited. The SMARTS InCharge product that was popular in network management applications used binary causal models for application development. But then it compiled them into fault signature patterns (vectors of symptoms present for a given root cause fault) for pattern matching at run time. The popular fault tree and related models are a special case discussed next.

Fault trees and related models

Some binary-valued, directed graph cause/effect models have the additional restriction that they are in the form of a tree. That is, no cycles are allowed (even ignoring directions on the arcs). One example is the Ishikawa “fishbone” diagram used in Statistical Quality Control. The “Root Cause Analysis” (RCA) popular in maintenance organizations is another example. Fault tree models are popular in safety analysis and nuclear plant monitoring. They generally have the additional feature that probabilities are calculated at each node. Fault trees are commonly used for offline analysis, and discussed on the separate page Fault trees and related models.

Other variations of qualitative cause/effect models

There are further variations of qualitative cause/effect models. Some model just binary variables: the presence or absence of faults and the resulting symptoms, as in the case of the CDG (Causal Directed Graph) models. Others incorporate a sign: In the SDG (“Signed Directed Graph”) models, variables are either high, low, or normal. The propagation of faults from variable to variable then includes a sign to indicate whether a high input causes a high or a low effect. The AI community has developed techniques based around “qualitative physics”, which also are signed digraph models.

SDG models in principle could do a better job of fault isolation because more information is included than in the unsigned case. However, SDG techniques have more problems with loops in the model, and cancellation of effects, than the unsigned directed graph models. Process plants have a prevalence of material recycle loops; energy transfer in feed/product heat exchangers; feedback loops caused by exothermic reactions, temperature, and increasing catalyst activity with temperature; and most importantly, the presence of numerous feedback control loops. Probably partly because of that, SDG-based techniques have not been popular in practical applications, although there was quite a bit of research around them.

There has traditionally been an assumption that the theoretical increased fault isolation capability of SDG models (or Bayesian models) compared to simple binary models is always important. However, in application areas such as operations management in the process industries or network management, we have already noted that there is generally ample instrumentation for fault isolation. The major benefit of model-based reasoning is in providing a principle for organizing the diagnostic process automatically, given someone with the domain knowledge to build a model. For these industries, wider, reliable applicability without worrying about special cases is generally more important than the few extra measurements that might need to be automatically used.

There are further variations in how uncertainty or evidence combination is handled. For instance, the Symcure CDG example already cited supported propagation of fuzzy values through simple minimum and maximum operations for “AND” and “OR”, respectively.

Quantitative causal models

Causal models can also be quantitative. Strictly speaking, models based on differential or difference equations are causal, as are the equivalent signal flow graphs. The causes are changes in the input variables, which then propagate over time through the equations, especially obvious when written as difference equations.

Algebraic models in the form g(x) = 0 (such as the constraints associated with data reconciliation) are definitely not causal. However, when these rewritten in input/output form such as y = f(x), where x is a vector of inputs to some equipment or unit, and y is the vector of outputs, causality is represented in a way and can be used. The input/output form is really an approximation ignoring dynamics, but we know that in reality there are time delays and lags between input changes and output changes.

An example approach to pipeline monitoring using causal models and quantitative models