Carl Cereke makes an interesting point:> As part of my recently completed PhD, I analysed about 200,000> incorrect Java programs written by novice programmers. Nearly all are> correct lexically.
...> Anyway, the point is that almost all errors involve a lexically valid> token stream.

My vision of the problem was probably biased by viewing the errors
made by my colleagues, who are hardly novice programmers. Most of the
errors I have seen from viewing the errors that occur as we work on
programs are "typographical" ones, usually involving a simgle missed,
transposed, or slightly modified character (e.g. ' and " are just a
shift-key away on my keyboard). Note, we also do a lot of editing by
cut-and-paste, so errors where the region to be moved is missing a few
characters on either end is also common--again that tends to
exaggerate the number of typographical sytle errors.

Now, perhaps helping professional programmers out is not as important,
as we are generally capable of finding and fixing our own errors.
However, the typographical errors (especially misplaced delimiters)
are ones that cause the most cascading errors. This probably also
accounts for the maxim that many programmers including myself follow,
which is fix the first error and ignore the rest--one can do a little
better than that, but quite quickly one learns when to punt on chasing
down subsequent error messages. About the only time, I pursue all
errors is when I have modified a procedures argument list, and then
the error instances are almost always the correct list of calls to
that procedure.

In contrast, and this adds weight to your hypothesis, while less
common for professional programmers, the most difficult to find errors
involve complicated and subtle uses of the language in question. The
errors that my colleagues and I have the hardest time finding involve
(in C++) subtle declarations where the exact syntax is not always
transparent. Any one of templates, macros, and a variety of
constructors can quickly result in a declaration which when misspelled
gives a totally uninformative error message. And the errors are often
of the semantic ilk, where the declaration was misinterpreted by the
compiler to be of a different semantic class than the one intended and
the errors are thus unrelated to what we were attempting to write.

I would be interested to hear if you came up with a strategy to handle
these semantic errors, when the user has submitted a set of valid
declarations and statements, but where there is a mismatch between the
semantics of one or more of the enries--e.g. the user has declared a
function, but meant to declare a variable and the uses would have all
been correct if the variable had been declared, but were in error
because the compiler "knew" that the identifier represented a function
(from the syntactically correct, but unintended function declaration)
and (the uses) were thus contextually illegal.

The one thing I will say that the semantic problem (as I posed it
above) has in common with the syntactic problem is that in both cases
the correct answer can only be seen when viewed in a complete context,
where "correct" fragments are modified until the total errors are
minimized.

In the lexical case, the parser assumes that the tokens the lexer has
handed to it are correct. However, when one delimiter is missed, the
tokens the lexer hands to the parser are not the correct ones--they
are gibberish (lexically correct gibberish). And if one looked at the
lexical problem in terms of finding a correctly parsable program with
the *minimum editing distance*, one would immediately see that the
correct program would have a different set of tokens. However, I've
never seen a lexer/parser that attempted to find the program with the
minimum editing distance from the original source.

Similarly, one could view certain semantic problems the same way.
Given a syntacitcally correct set of declaractions, how can one find
the minimally modified set the generates the no errors. In my
example, where a function was declared when a variable was intended,
in most languages it would take fewer changes to the source text to
change the declaration type of the identifier from function to
variable than it would to correct all (mis)uses of the identifier.

Note, that a compiler that followed these principles would work (and
particularly report errors) in a completely different fashion than any
that I am aware of today.

The common error reporting scheme is to find oneself in an
"impossible" situation, where something cannot be correct and to
report an error complaining about the current predicament. Note that
the error recovery scheme almost always assumes that what was
successfuly processed before was correct and that the internal
database contains only true facts. Because the software (compiler) is
not robust and we cannot assume that the program (compiler) will not
crash at some future point due to these inconsistencies exposing
weakpoints in the program (compiler) we are writing, we attempt to
report the error right away and perhaps make a graceful exit without
doing further damamge to the program's (compiler's) internal data
structures.

An "error minimizing compiler" would have to be able to assume that it
was robust enough not to fail and then collect all the inconsistencies
and then try to modify things to resolve those inconsistencies. Once
the compiler knew which sets of modifications resolved all the
inconsitencies, it would then report the set(s) of modifications that
were minimal. By the way, I'm not confident enough that I could write
a sufficiently robust compiler to allow pursuing such an error
recovery strategy.