> I really want to spend some time here, but that code is pretty
> difficult to understand (as you noted)
Umm, before i forget, i guess i'd dump some quick explanation of what i
figured out about it.
First, we have symbols-or-regions, hereinafter "values" unless it causes
confusion. Liveness of some values may be determined regardless of any
other values: for example, a VarRegion of a local variable can be proven
live via live variables analysis, which is an auxiliary data flow
analysis that doesn't depend on our symbolic execution engine. However,
values may also be related to other values via a "keeps-alive relation".
There are two kind of keep-alive relations:
I. "first-class" keep-alive relations arise from and rely purely on the
SVal hierarchy (for example, SymbolRegionValue is kept alive by its
parent region, SymbolicRegion keeps its base symbol alive).
II. "second-class" keep-alive relations arise when symbols and regions
appear in certain parts of the program state (for example, any region
would keep alive its contents - all symbols stored inside it).
The purpose of removeDead() is to understand which values present in the
current program state are dead (and, as a useful by-product, which
values are live) by exploring the keep-alive relations between values;
second-class relations are explored with respect to the current program
state. Whenever a certain value X is determined to be live, any values
kept alive by X are also known to be live. The liveness information is
collected from scratch on every pass and does not depend on previous
removeDead() runs. Once the liveness information is collected, a special
action is taken on all dead values, which may involve removing them from
the state, but does not necessarily boil down to just that.
SymbolReaper is an object that contains the incomplete understanding of
live values on the mark phase, which eventually becomes complete and as
such can be used on the sweep phase. SymbolReaper's data consists of two
sets of values - the "live set" and the "dead set". The live set
contains values already known to be live. The dead set contains the
values found in the state, liveness of which is explicitly questioned
and was not yet determined; values may move from the dead set to the
live set (if they turn out to be live due to a previously unconsidered
second-class relation to a live value) but not in the opposite
direction. First-class relations are mostly hardcoded directly into the
SymbolReaper - see isLive(), isLiveRegion(). Originally, SymbolReaper
was simply a pair of sets, but later it was extended to take care of
first-class keep-alive relations.
The removeDead() procedure begins by creating a temporary SymbolReaper
object that lives for as long as the procedure runs. Then it proceeds
with querying different manager objects responsible for different
sections of the program state to explore their section and fill-in the
liveness information in the SymbolReaper. Each section needs to classify
each value it is responsible for as live or dead.
Program state sections are queried in the following order:
(1) GDM traits of particular checkers (through CheckerManager and
checkLiveSymbols() callback).
(2) Environment (through EnvironmentManager).
(3) Store (through StoreManager).
(4) Range constraints (through ConstraintManager).
Section (1) is implemented in each checker in a different manner. The
checker needs to be aware of the fact that if it doesn't markLive() a
value here, then all information related to it may be removed from other
sections of the state (eg. range constraints for symbols or bindings for
regions). And if the checker doesn't mark any value maybeDead(), it may
never receive the callback when this value actually dies.
/*
[Q1]: In http://reviews.llvm.org/D12726 you mentioned that checkers mark
liveness after the store, but now you say it's just the opposite, how so?
[A1]: I was wrong and stay corrected. It seems that checkers do not have
access to any liveness information obtained from the rest of the state
for during checkLiveSymbols().
*/
Sections (2), (3), (4) combine(!!) their sweep phase with the mark
phase. However, for checker traits, the sweep phase is delayed until the
checkDeadSymbols() checker callback, which goes last at the end of
removeDead() after (4).
/*
[Q2]: Does taint information for dead symbols ever get removed? Cause
it's in GDM, but it's neither checkers nor range constraints.
[A2]: Afaik, no; we don't even have State::removeTaint() yet. This is a
bug, which may probably cause performance problems with taint analysis,
though maybe not too much, because tainted symbols are rare.
*/
Section (2) scans active expressions, marks values of active expressions
as live. It also marks values of inactive expressions that are still
found in the state as dead and removes these expressions and their
values from the environment. It does not need to rely on any liveness
information for values to make such decision, which is why this step is
completely independent and isolated. Note, hovewer, that first-class
relations play an important role here - any values found inside values
that were marked dead or live are also explicitly marked dead or live,
which contributes to complexity of the code. The visitor used here to
conduct marking of sub-values is not re-used.
Section (3) is the most complex here. It takes care of the following
second-class keep-alive relation: for every binding in the store, the
binding key keeps alive the binding value. Both binding keys and binding
values are (at least, partially/sometimes) "values" (in the
symbols-or-regions sense), hence it may happen that a certain binding
value X is also a binding key for another binding value Y, and once
liveness of X is established, liveness of Y needs to be re-calculated.
Liveness of binding pairs cannot thus be computed in a single pass
through the store, and therefore a worklist-like algorithm is
implemented, that tries to reach a fixed point in which the liveness
information no longer gets updated. Once the worklist algorithm
finishes, the following happens: all bindings with dead keys are removed
from the store, all sub-values of binding keys and all binding values
and their sub-values are marked as maybe-dead. The visiting algorithm
for sub-values is not re-used from the Environment.
The method assumes that after the end of (3) no new live marks are
added. Which is why Store is capable of removing dead bindings without
requiring a separate pass.
Section (4) is trivial - the constraint manager does not mark any
symbols live. It marks all constraint keys that were not previously
proven live as maybe-dead and removes them from the constraint map.
/*
[Q3]: So what if we try to fix the issue mentioned in [Q1] by putting
(1) after (4)?
[A3]: It's more complicated than that. Checkers may introduce their own
second-class keep-alive relations between values. For example,
CStringChecker would keep string length alive for as long as the string
region is alive. There may be more such relations introduced by other
checkers. It may happen that a certain checker-specific second-class
keep-alive relation would cause a store binding key to become live.
Hence, there needs to be a single worklist algorithm run for both
store-related second-class keep-alive relations and checker-specific
second-class keep-alive relations, and (1) needs to be essentially
merged with (4). And that would make things yet even more complex.
*/