Would you be ready to integrate such a feature? If so, which technical changes do you think would be necessary for the patch to be acceptable?

## Additional details

The patch was designed to be non-invasive, and only modifies the final error reporting function in typetexp.ml; it does not modify the semantics of the type-checker at all. I still needed to convey the current environment to the point of error reporting, so the Unbound_foo exceptions have changed to take an additional parameter.
Is it ok to change the exception definitions? (May user code be affected by such a change?). If this is unacceptable, it would be possible to change the patch to use an ugly global variable.

The last example (Lis.lengt) exhibits a slight downside of the feature: while there are two typos in the complete path, it will only report the first one. This is not due to the patch itself, but to the fact that it is grafted onto the error reporting function, after the "unbound name narrowing" of narrow_unbound_lid_error has taken place, rather than at the exception raising site. I think this is the right choice.

The heuristics of the spell-checking (when is another word considered a good suggestion?) can be fine-tuned. Currently, the patch uses the Damerau-Levensthein distance ( http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance [^] : an edit distance taking swapping of adjacent letters into account ), with a low cutoff designed to detect only obvious typos: it does no correction for identifiers of size below 2, look at distance 1 at most for below 4, distance 2 for sizes below 6 and distance 3 for larger identifiers. It returns the set of words found in the environment with the minimal distance, if it is smaller than the distance limit.

I have done some performance testing. For the real-world cases I could try the suggestion is instantaneous. I have included in the patch a test generator used to creates code with 1000 and 500_000 identifiers unqualified in the environment. With 1000 identifiers (to put it in perspective, Pervasives exports around 150 identifiers and the modules of the OCaml compiler itself have at most around 500 identifiers in the global scope), spell-checking takes 0.12s with ocamlc and 0.015s with ocamlc.opt. For 500_000 identifiers it takes 5s with ocamlc and 0.8s with ocamlc.opt. Given that this computation is only done in the case of an Unbound_foo error, I think performance is not an issue -- if it was considered to be, I could add a flag to disable the feature. The spell-checking runs after the preliminary "Error: Unbound ..." has been flushed, so even in the case of a particularly long delay the user would have the choice to interrupt the computation (as there is nothing done afterwards); I have tested that this works in the toplevel as well (without exiting the toplevel).

The patch currently doesn't handle the "Unbound type variable" error, as it relies on a different form of environment and would therefore make the code a bit more complex. I'm ready to add this as a separate patch afterwards.

The patch also does not rely on any form of type information, only names bound in the environment. One could dream of suggestions refined by type-checking, but this is outside the scope of the current, modest implementation.

This has been proposed before and I think the consensus was that it's not a job for the compiler, but rather for some development environment like TypeRex. OTOH, it's hard for TypeRex to get the info it needs if the file doesn't compile (and hence doesn't generate a typed AST).

I would like to point out that the error-reporting mechanism of clang has been widely praised in every single presentation/report I've seen about clang; and that's one of the reasons why people are switching from gcc to clang. Moreover, the patch is really small and self-contained, I think this is a low-cost / big-win kind of bug.

I confirm that some are using Clang *exclusively* for error reporting.

Also, doing this kind of things in the type checker itself allows some clever things:
- type aware suggestion: if c is a record with a field foobar, and there is a variable foocar, when the user types c.foolar, you can suggest only foobar. Same for a module, or even more clever things.
- if there is only one suggestion, the type checking can keep going after replacing the identifier with the suggested one to report more errors in one pass.

This is probably not necessary, but there are simple ways to improve performance:

- Do not allocate a matrix: keeping the last two rows (and the current one) is enough. For repeated distance calculation with a fixed target, these rows can be allocated once and for all. (If the function returns a cutoff + 1 instead of None, one can arrange so that the lookup for suggestions does not allocate at all.)

- Drop the longest common suffix and the longest common prefix from the two strings.

I had tested that matrix allocation is not the performance bottleneck (by measuring the time of the allocation alone), so I intentionally neglected the first optimization. I have not experimented with the two others, and I agree this may be worth it (but I'm not sure how to test them reliably). I'm ready to experiment with this afterwards if the feature gets accepted.

I have slightly modified the patch to take into account:
- an error in the handling of cutoff reported by 'yoch'
- some protection against overflows if "max_int" is used as cutoff
- some style comments by Damien

I have also attached a second patch providing a testsuite for the edit_distance function.

Thanks Alain, I will be happy to make typos in the next version of OCaml...

Maybe you could also apply the patch that adds the test to the testsuite? It's not world-changing but could provide regression test against very imaginary future changes to the edit function (eg. for performance reasons).