5
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology5 it is a fundamental and practically important problem...which I learnt about working for the Swedish railways E.A. J. Ekman, Capacity of single rail yards [in Swedish], Swedish Railway Authority Technical reports (2002)

6
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology6 They have potential, under-used applications in systems biology As an example I will describe a consulting work we did for Global Genomics, a now defunct Swedish Biotech Company. They claimed to have a new method to measure global gene expression. Many of their ideas were in fact from S. Brenner and K. Livak, PNAS 86 (1989), 8902-06, and K. Kato, Nucleic Acids Res. 23 (1995), 3685-3690.

7
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology7 The problem is that using only one restriction Type IIS enzyme, there is not enough information in the data to determine which genes were expressed (many genes could have given rise to a given peak). Kato (1995) tried using several enzymes of the same type sequentially. Problem: loss of accuracy, complicated. Global Genomics AB’s invention was to use several enzymes in parallel.

8
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology8 The Global Genomics invention in led to a optimal matching problem A. Ameur, E.A., M. Carlsson, J. Orzechowski Westholm, “Global gene expression analysis by combinatorial optimization”, In Silico Biology 4 (0020) (2004) Matching the observations to a gene database gives a bipartite graph, where a link between a gene g and an observation o represents the fact that o could be an observation of g. The best matching can be represented as a subgraph of the graph above + expression levels.

9
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology9 Testing using the FANTOM data base of mouse cDNA (RIKEN) For in silico testing we used the FANTOM data base of full-length mouse cDNA, available at genome.gsc.riken.go.jp We used an early 2003 version of 60 770 RIKEN full-length clones, partitioned into 33 409 groups representing different genes. This second list can be taken a proxy of all genes in mouse. Principle of in silico tests: 3. Generate random peak and length perturbations 1. Select a fraction of genes 2. Generate random exp. levels 4. Run the algorithm 5. Compare

10
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology10 both methods solve the optimization according to the given criteria when the perturbation parameters are small enough the methods are comparable at low or moderate fraction of genes expressed local search is superior at high fraction of genes expressed Ameur et al (2004)

13
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology13 How is this possible? Following many others we will look at a simple model

14
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology14 Let there be N Boolean variables, and 2N literals Let there be M logical propositions (clauses) Can all M clauses be satisfied simultaneously? Random K-satisfiability problems A clause expresses that one out of 2 k possible configurations of k variables is forbidden. Clauses are picked randomly (with replacement) from all possible k-tuples of variables.

15
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology15 KSAT characterized by number of clauses per variable phase transition between almost surely SAT to almost surely UNSAT Algorithms take longest time (on the average) close to phase boundary Mitchell, Selman, Levesque (AAAI-92) Kirkpatrick, Selman, Science 264:1297 (1994) Several simple algorithms take a.s. linear time for α small enough

16
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology16 one state A now about decade old statistical physics prediction of 3SAT and other constraint satisfaction problems: a clustering transition SAT UNSAT many states no solutions 3SAT threshold values

17
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology17 The Mezard, Palassini and Rivoire 2005 prediction for 3COL Obtained by entropic cavity method, computing within a 1RSB scenario the number of states with a given number of solutions one green state many green states, but most solutions in one or a few big states

18
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology18 The latest clustering predictions for KSAT, K > 3 are in F Krzakała, A. Montanari, F. Ricci-Tersenghi, G. Semerjian, L. Zdeborová. ”Gibbs states and the set of solutions of random constraint satisfaction problems” PNAS 2007 Jun 19;104(25):10318-23. single cluster many small clusters but most solutions in a few of them many clusters and solutions are found in a large set of all about equal size

19
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology19 many clusters and solutions are found in a large set of all about equal size most clusters disappear, and again most solutions are found in a small number of them The cluster condensation transition in F Krzakała et al (2007)

20
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology20 So does clustering in fact pose a problem to simple local search? Are the known/features of the static landscape relevant to dynamics?

21
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology21 a landscape that could be difficult for local search courtesy Sui Huang global minimum local minima another local minimu m

22
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology22 Not quite like an equilibrium physics process in detailed balance, because only variables in unsatisfied clauses are updated Solves 3SAT in linear time on average up to α about 2.7 Papadimitriou invented a stochastic local search algorithm for SAT problems in 1991, today often referred to as RandomWalksat: Pick an unsatisfied clause Pick a variable in that clause, flip it, loop

23
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology23 A benchmark algorithm is Cohen-Kautz-Selman walksat www.cs.wahington.edu/homes/kautz/walksat Pick an unsatisfied clause Compute for each variable in the clause the breakclause If any variable has breakclause zero, flip it, loop With probability p, flip variable with least breakclause, loop Else, with probability 1-p, flip random variable in clause, loop Solves 3SAT in linear time on average up to α about 4.15 Using default parameters from the public repository (Aurell, Gordon, Kirkpatrick (2004) breakclause is the number of other, presently satisfied, clauses, that would be broken if the variable is flipped

24
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology24 We have worked with the Focused Metropolis Search (FMS) algorithm, and ASAT, an alternative version ASAT: if you have a solution, output and stop Loop Also not in detailed balance (also tries only unsat clauses) Parameter p has to be optimized. The optimal value depends on the problem class, e.g. about 0.2 for 3SAT Pick an unsatisfied clause Pick randomly a variable in the clause If flipping that variable decreases the energy, do so If not, flip the variable with probability p

25
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology25 Algorithm 1. ChainSAT S = random assignment of values to the variables chaining = FALSE while S is not a solution do if not chaining then C = a clause not satisfied by S selected uniformly at random V = a variable in C selected uniformly at random end if ΔE = change in the number of unsatisfied clauses if V is flipped in S if ΔE = 0 then flip V in S else if ΔE < 0 then with probability p 1 flip V in S end with end if chaining = FALSE if ΔE > 0 then with probability 1 – p 2 C = a clause that is satisfied only by V selected uniformly at random X = a variable in C other than V selected uniformly at random V = X chaining = TRUE end with end if end while We have a new algorithm ChainSAT which by design never goes up in energy

26
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology26 Solution course of a good local search (ASAT at 4.2)

31
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology31 Do we know how local search fails on hard CSPs? The first guess would be that local search fails if solutions have little slackness which is expressed by Parisi whitening

33
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology33 Several proposed clustering transitions do not stop circumspect descent Not even an algorithm which would be trapped in a potential well of any depth The reason why local search eventually fails is unknown

34
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology34 Clustering has been rigorously proven for KSAT and K greater than 8 For K less than 8 there are cavity method predictions How does numerics compare to these?

35
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology35 Solve a 3SAT instance L times with a stochastic local search (ASAT) Compute the overlaps between these L solutions See how that quantity changes with α average overlapvariance of the overlap Ardelius, E.A. and Krishnamurthy (2007)

36
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology36 The rank ordered plots of the overlaps in a chain of instances with increasing number of clauses displays a transition around 4.25 Ardelius, E.A. and Krishnamurthy (2007) α ranges from 3.5 to 4.3 N is 2000 for α = 4.3 repeat until solvable instance found for α < = 4.3 repeat until ASAT finds many solutions on the instance

37
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology37 Generate many chains of instances, check for the α at which all solutions found have an overlap of at least 80% Ardelius, E.A. and Krishnamurthy (2007) N is 100, 200, 400, 1000, 2000 Number of chains at each N is 110 If a chain does not reach the 80% threshold, repeat Threshold is between 4.25 and 4.27, could in fact coincide with SAT/UNSAT for 3SAT This is not in contradiction with the theoretical predictions of Krzakala et al (2007) who do not address 3SAT

41
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology41 As far as numerics can tell, if there are clusters beyond the clustering transitions in 4SAT, they are not separated by overlap

42
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology42 How does local search compare to more sophisticated (and specialized) methods that we will hear about at this school? (here I have to go to PDF)

43
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology43 A question to the experts: Which is (or are) the good metrics to compare runtimes? Wall-clock time? Some intrinsic count?

44
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology44 Conclusions Local heuristics (walksat, Focused Metropolis Search, Focused Record-to-Record Travel, ASAT, ChainSAT) are effective on hard random 3SAT, 4SAT… problems This is true even if the heuristic by design can never get out of a potential well, of any depth (ChainSAT). Traps in the landscape do not stop these algorithms. There seems to be a “clustering condensation” transition in 3SAT very close to SAT/UNSAT transition. If there is a clustering transition in 4SAT, these clusters do not seem to be separated in overlap (in contrast to K equal to 8 and greater)

47
KTH/CSC March 4, 2008Erik Aurell, KTH Computational Biology47 N is 1000, is 4.3 Energy as function of timeDistance to target Is the search trapped in “potential wells” of metastable states? ASAT nonlinear regime, no barrier seen