Comments by Karl

The path recognition automata are very small not thanks to automata minimization but thanks to cleverly exploiting that the input documents conform to a schema. It is clever, because Ahmed enlarges the underlying language which allows him to make the automaton smaller while still being correct.

Cover Languages and Cover Automata

The following work on cover languages is related to our paper.
The idea is also to enlarge the regular language of interest
with harmless words while reducing the state complexity.

Abstract:
A cover-automaton A of a finite language Lsubset of or equal
ta Sigma* is a finite deterministic automaton (DFA) that accepts all words in L and possibly other words that are longer than any word in L. A minimal deterministic finite cover automaton (DFCA) of a finite language L usually has a smaller size than a minimal DFA that accept L. Thus, cover automata can be used to reduce the size of the representations of finite languages in practice. In this paper, we describe an efficient algorithm that, for a given DFA accepting a finite language, constructs a minimal deterministic finite cover-automaton of the language. We also give algorithms for the boolean operations on deterministic cover automata, i.e., on the finite languages they represent.

Polya again

The idea of cover languages may be viewed as an instance of Polya's
Inventor's Paradox which has numerous other instances. See
Poly'a Inventor's Paradox.
The problem we want to solve is: find a small automaton for language L.
The invention we have to make is to find a cover Cover(L,K) which
generalizes language L with harmless words outside some
set K called the universe or the effective set.
The invention consists of finding a suitable K as well as defining
Cover(L,K) abbreviated as [K, Cover(L,K)] so that Cover(L,K) intersect K = L.
The metric used is the size of the automaton.
We want StateComplexity(Cover(L,K)) << StateComplexity(L).

Should we worry about StateComplexity(Cover(L,K)) as well as
StateComplexity(Cover(L,K) intersect K) and StateComplexity(K)?
No, if we only check elements of K for membership in L.
Then only StateComplexity(Cover(L,K)) is important.

The generalization then is:
Problem: Given languages L, K over the same alphabet Sigma.
Find a fast recognizer for L for words in K.
Solution: Choose any language X satisfying X intersect K = L
so that StateComplexity(X) << StateComplexity(L).

The solution is correct because for words in K, a recognizer for
X is also a recognizer for L. Given L and K, we want the language X
so that StateComplexity(X) = MinStateComplexity(L,K)

By making a proper
invention [K, Cover(L,K)] for L, the automaton for Cover(L,K) is smaller.
In this sense, solving the more general problem is easier, because it needs fewer states. Defining a super language is clearly a form of generalization.
Mapping back from the general problem to the concrete problem
is easy: do the intersection with K or give the automaton only inputs in K.

Polya again and again

Each invention of a good abbreviated path expression
in the context of a program being developed for a given problem,
is a little invention in the sense of Polya.
The original problem becomes easier because now the program is
written in terms of the paths defined by the abbreviated path
expression
which are free of all the noise in the original problem.
The invention is an abstraction which hides low-level information.
WWG(A,C,\Sigma_C) = C \cap {w \in \Sigma_C^* | f(w^*) \in A}
defines the mapping from abstract to concrete.
Non-ambiguity insures that a strategy graph path is mapped to at most
one concrete path.

Polya's inventor paradox is used at two levels:
by the programmer (invent abbreviated path expressions) and
by the implementation technology which uses a generic
cover language.

For related work section: To be removed

All references to bibtex entries are given above.

Polya's Inventor Paradox (\cite{polya:solve-it}) finds
applications at two levels in our paper: at the programmer level
and at the implementation level.
First, finding a WYSIWYG abbreviated path expression is a little invention in the sense of Polya.
A proper invention leads to a simplification of the programming task by writing a Generic Program.
This applies to all three programming models: AP, XML and AOP.
The original problem becomes easier because now the program is
written in terms of the paths defined by the abbreviated path
expression
which are free of all the noise in the original problem.

The second application of Polya's Inventor's Paradox is hidden from the programmer:
it is used at the implementation level to construct efficient path recognizers.
The problem we want to solve is: find a small automaton for language L.
The invention we have to make is to find a cover Cover(L,K) which
generalizes language L with harmless words outside some
set K called the universe or the effective set.
The invention consists of finding a suitable K as well as defining
Cover(L,K) abbreviated as [K, Cover(L,K)].
The metric used is the state complexity of the automaton. By making a proper
invention [K, Cover(L,K)] for L, the automaton for Cover(L,K) is smaller.
In this sense, solving the more general problem is easier, because it needs fewer states. Defining a super language is clearly a form of generalization.
Mapping back from the general problem to the concrete problem
is easy: do the intersection with K or give the automaton only inputs in K.
The idea of a cover is a generalization of the cover idea in
\cite{1217611}.

Automata are widely used to evaluate XPath Queries on XML documents
\cite{1042051,958947}. Some use schemas to speed up the processing
\cite{653613} while others don't \cite{958947}.
Simple, but very effective techniques, such as the Stream Index
\cite{1042051} are used to skip over irrelevant document parts.
None of the papers in the XML literature use our idea of cover
automata to significantly simplify the deterministic automata.

Sound, Complete and Optimal

Input: S and C over Sigma.
Output: automaton N satisfying:
1. [sound and complete, L(N) in cover(S,C)] (L(N) intersect C) = S,
2. [optimal] For all x in L(N), x can be completed to a
word in S that satisfies C.

We need the concept of optimal cover:

Optimal Cover Problem: Given languages L, K over the same alphabet Sigma.
Find a fast, optimal recognizer for L for words in K.
Solution: Choose any language X satisfying X intersect K = L
so that StateComplexity(X) << StateComplexity(L).
X must be optimal: For all x in X, x can be completed
to a word in L that satisfies K.

The solution is correct because for words in K, a recognizer for
X is also a recognizer for L. Given L and K, we want the language X
so that StateComplexity(X) = MinStateComplexity(L,K)

There is also a hidden agenda for the cover automaton: it must support programming well. See paper.

Optimal Cover

Karl felt that in its current form, the paper gives the wrong impression. It says that we need to find
a cover language while we need an optimal cover language and a deterministic automaton for it.

Ahmed pointed out that the cover is optimal because the way it is
used in section 4: L is a sublanguage of C.
He does not like to word "optimal" because it could mean minimum state complexity.
Let's try the following definition:
language L satisfies the fruitfulness property for S and C,
Fruitfulness(L,S,C), if
(1) L, S, C are all over the alphabet of C;
(2) for all l in L there is an extension of l to a word
in S that is also in C.
(3) L and S are sublanguages of C
In other words: L is a subset of PrefixClosed(S intersect C).

Ahmed views this concept as too clumsy to mention.
It is there implicitly.

In section 4 we have: For all regular languages C, L over Sigma C where L subset of C:
Fruitfulness(cover(L,C),S,C).

In section 5 we have (lemma 5.1): RR(S,C) gets into a stuck state iff
there is no way to achieve a fruitful path.
Therefore we have:
For all DFA S, and regular sets C as described in sectiuon 5:
Fruitfulness(RR(S,C),S with WYSIWYG expansion,C)

Using the RR automaton

For code generation for AP it has no benefits?
Can as well use the intersection?
No! Although the generated code is the same as the one generated using the
intersection, we get a faster code generation algorithm out of RR.
We use RR to interpret the class graph while generating the code.
The disappointment is that RR does not lead to generated code that runs faster.
It is the same.

Another use of RR is in DJ-style tools. The automaton
is interpreted at run-time. Although RR is smaller than the intersection,
there is a run-time cost: selections have to made from larger sets
because of all the merged states in RR.
However computing RR is very fast.
There are scenarios where using RR instead of the intersection
leads to faster programs.

From Ahmed:
there are only two approaches for code generation:
internal (in which methods are introduced to classes), and external.
In internal code generation, the input object is the *this* object and we know before hand its fields and therefore there is no need to use reflection.
In external code generation, the input is some Object received as an argument and therefore there are no guarantees so as to its fields and therefore, reflection has to be used.