I'm thinking of writing a parser to load files that my customers have created. I'm a
software requirements engineer; the data consists of the customers' thoughts in
response to the latest release of the requirements doc. In fact, the files will
probably be copies of the requirements doc itself, into which customers have entered
their notes and made changes. The original requirements doc will have a format that
can be parsed; probably something simple like lines marked with codes like
//customer={customer name goes here}
//requirement=
{requirement text goes here}
When I parse the documents that come back from the customers, they are likely to
contain some errors. Field names may be mangled or misspelled. Customer names may be
entered in unrecognizable variants (e.g. someone named "Michael" is indicated as
"Mike") and so forth.
I was thinking that it might be useful to have a Google-like "do you mean this?"
feature. If the field name is //customer=, then the parser might recognize a huge
list of variants like //ustomer=, //customor=, etc... that is, recognize them well
enough to continue parsing and give a decent error message in context.
Any ideas how to go about this?
I don't think I would create a parser language that includes every variant, but
instead the field names would be tokens that could be passed to another routine. The
variants could be generated ahead of time. I would limit the number of variants to
something manageable, like 10,000 for each field name.
Thanks,
Mike