It seems to me (at least according to books and papers on the subject I read) that the field of automated theorem proving is some sort of art or experimental empirical engineering of combining various approaches, but not a science which tries to explain WHY its methods work in various situations and to find classes of situations for which they work. Or may be I am not aware of some results or the problem is too hard.

Has it been clarified in a rigorous way why, for example, Robinson's resolution works better than Gilmore's saturation?
(intuitive reason of course is that most general unifier already contains all information about infinitely many terms from Herbrand's universe, but it is only intuition, not a concrete reason, as well as the fact that you can not know when to stop searching for a proof of a formula A in the case A and not(A) are both unprovable is not a rigorous explanation of undecidability of first order logic).

I even saw a paper shows that there are cases in which London museum algorithm (brute force search, essentially) performs better than resolution!

Maybe somebody have found some large classes of formulas on which resolution works well?
Maybe some probabilistic analysis of proving methods?

3 Answers
3

You may be interested in the wonderful little book ``The Efficiency of Theorem Proving Strategies: A Comparative and Asymptotic Analysis'' by David A. Plaisted and Yunshan Zhu. I have the 2nd edition which is paperback and was quite cheap. I'll paste the (accurate) blurb:

``This book is unique in that it gives asymptotic bounds on the sizes of the search spaces generated by many common theorem-proving strategies. Thus it permits one to gain a theoretical understanding of the efficiencies of many different theorem-proving methods. This is a fundamental new tool in the comparative study of theorem proving strategies.''

Now, from a critical perspective: There is no doubt that sophisticated asymptotic analyses such as these are very important (and to me, the ideas underlying them are beautiful and profound). But, from the perspective of the practitioner actually using automated theorem provers, these analyses are often too coarse to be of practical use. A related phenomenon occurs with decision procedures for real closed fields. Since Davenport-Heinz, it's been known that general quantifier elimination over real closed fields is inherently doubly-exponential w.r.t. the number of variables in an input Tarski formula. One full RCF quantifier-elimination method having this doubly-exponential complexity is CAD of Collins. But, many (Renegar, Grigor'ev/Vorobjov, Canny, ...) have given singly exponential procedures for the purely existential fragment. Hoon Hong has performed an interesting analysis of this situation. The asymptotic complexities of three decision procedures considered by Hong in ``Comparison of Several Decision Algorithms for the Existential Theory of the Reals'' are as follows:

(Let $n$ be the number of variables, $m$ the number of polynomials, $d$ their total degree, and $L$ the bit-width of the coefficients)

CAD: $L^3(md)^{2^{O(n)}}$

Grigor'ev/Vorobjov: $L(md)^{n^2}$

Renegar: $L(log L)(log log L)(md)^{O(n)}$

Thus, for purely existential formulae, one would expect the G/V and R algorithms to vastly out-perform CAD. But, in practice, this is not so. In the paper cited, Hong presents reasons why, with the main point being that the asymptotic analyses ignore huge lurking constant factors which make the singly-exponential algorithms non-applicable in practice. In the examples he gives ($n=m=d=L=2$), CAD would decide an input sentence in a fraction of a second, whereas the singly-exponential procedures would take more than a million years. The moral seems to be a reminder of the fact that a complexity-theoretic speed-up w.r.t. sufficiently large input problems should not be confused with a speed-up w.r.t. practical input problems.

In any case, I think the situation with asymptotic analyses in automated theorem proving is similar. Such analyses are important theoretical advances, but often are too coarse to influence the day-to-day practitioner who is using automated theorem proving tools in practice.

(* One should mention Galen Huntington's beautiful 2008 PhD thesis at Berkeley under Branden Fitelson in which he shows that Canny's singly-exponential procedure can be made to work on the small examples considered by Hong in the above paper. This is significant progress. It still does not compare in practice to the doubly-exponential CAD, though.)

One way of looking at this question is work in termination analysis of logic programs (googling "termination analysis of logic programs" isn't a bad way to find some of the relevant research.)

As an example, if the rules I am looking at look like this:

lt(zero,succ(Y)).
lt(succ(X),succ(Y)) <= lt(X,Y).

I know that resolution will succeed or finitely fail for any query ?lt(n,X) where n is a ground term, because every subgoal that arises will have a smaller first argument than the subgoal that led to its generation. On the other hand, if the rules I am looking at look like this:

path(X,Z) <= path(X,Y), path(Y,Z).

I know that, given a n path(,) facts, I can only derive at most a finite number of other path(,) facts (on the order of n2, in fact), which ensures that saturation will terminate (assuming what I think of as saturation is what you think of as Gilmore's saturation).

(A deeper investigation of where the difference between these methods comes from, can be found in Chaudhuri, Pfenning, and Price, A Logical Characterization of Forward and Backward Chaining in the Inverse Method. Journal of Automated Reasoning, 40(2–3), pp. 133–177. 2008., but the response above seemed more relevant to the question.)

$\begingroup$Thank you. I have checked out some papers on that subject. But still when the authors say that one method terminates "more often" then other they likely rely on some intuitive reasons and experiments. It seems that the problem of comparing (even probabilistic), for example, resolution and brute force is too hard for now. $\endgroup$
– Sergei TropanetsJun 7 '10 at 12:10

Sorry, I cannot post a comment on your question for some reason (maybe I need more reputation), so I'll just ask here:

What exactly is Gilmore's saturation? Never heard about that.

What paper was that you are talking about? (About where the brute force performs better than resolution.)

More on topic to the question:

I'm not sure if your question (I'll restrict it here to:) "how well does resolution perform" does make much sense. Resolution is just a formal method to get any 'most-directly-following' clause from a set of clauses. That is one resolution step.

It doesn't specify anything about how you search for a proof now via resolution. There are dozens of ways to do that. You can just do brute force without any intelligence or you can do it kind of directed via some heuristic. Or in logic programming when you are asking if a clause Q is true, you are negating Q, adding it to the set of other given clauses and try to resolve the empty clause. When you are limiting yourself to Horn clauses, you can do SLD-resolution which is much more directed. Still, also with SLD-resolution, it is left open to the prover what clause and what literal to choose.

I.e. in any case, it keeps being some search you are doing. Prolog for example uses Depth-first search but some others are using iterative deepening search. The order is most often specified by the order of the definitions (in your fact-file) but you can also try to apply some heuristic here.

This heuristic or in general the order of how you do resolution will determine the performance of your prover.

$\begingroup$See also the paper of V. P. Orevkov "Britain museum algorithm can be more effective than resolution" (in Russian). "Resolution is just a formal method to get any 'most-directly-following' clause from a set of clauses". Of course, and I am asking to do rigorous (at least a probabilistic) comparing resolution with other methods. $\endgroup$
– Sergei TropanetsJun 7 '10 at 12:34