readme.md

Generalized edit distance command line tool

GenEditDist tool allows to find approximate matches between a search string and list of strings in a dictionary. In addition to the regular edit distance (the Levenshtein distance), a set of weighted transformations can be used in a search.

The tool has two working modes:

Find all the matches that are within the maximum edit distance limit;

Find top N closest matches;

Currently, the tool has been developed and tested only on UNIX platform and it is not expected to work on other platforms.

Environment and input files

The tool expects that all its inputs are in UTF8 encoding. Before using the tool, the environment should be set to support UTF8 in the command line.

One possible way to do it is with the bash command:

export LANG=et_EE.utf8

An input 'dictionary' where the search is performed should be a text file, where each entry (a match candidate) is on a new line.

The transformations file should also be a text file, where each transformation is on a separate line. Transformation from string A to string B with cost W should be formatted in file as:

A:B:W

A or B (but not both together) can be omitted to define an addition or a deletion.

Transformations file should only contain transformations in the given format, adding empty lines or comments might result in unexpected search behaviour.

Using the program

2.1. Full vs partial match

While calculating the generalized edit distance, one can specify whether a full match or a partial match between the search string and a string in the dictionary is found.

Four modes of matching exist:

Full extent match (flag -f, default):

Distance between the search string and a string in the dictionary is calculated considering both strings at full extent.

NB! If there are several edit distances computed for a single match candidate (more than one of the flags -f, -p, -s, -i is set), then the candidate will be output if at least one of the distances is less than or equal to <max_edit_distance>

Usage examples (**):

Searching for string 'book' in the dictionary 'testdata/pidgin_words.txt' using the transformations from file 'testdata/transformations.txt':

<> will block both regular edit distance and generalized edit distance, so the substring region cannot be changed at all.

(( or << at the beginning of the search string block additions before the first letter of the string (otherwise, such additions are allowed). Analogously, additions after the last letter are blocked by )) or >>.

Currently, additions between two consecutive regions are also blocked, so, in case of search string <a><b>c, no addition is allowed between a and b.

Without using the blocked regions, the previous query (suffix matches with the word metre) gives 32 matches;

NB! If the "TOP N matches" search mode is used (flag "-b"), current implementation does not quarantee, that matches with changes in blocked regions will be left out. This is because a change in a blocked region has a concrete cost - 3000.0.

For example, if we search for string <<h>u<t> in the dictionary 'testdata/pidgin_words.txt' using the transformations from the file 'testdata/transformations.txt', only 2 matches can be reached without changing blocked regions: hat and het.
So, if user queries for more than 2 closest matches, he/she will get also matches with changes in the blocked regions:

2.5. Showing transformations / alignments

If the plain full extent match mode is used for finding all matches within the given maximum distance, i.e. the combination of flags -m and -f is used, and none of the flags -p, -s, -i, -e is set, then the flag -a can be used to switch on the mode of showing the transformations / alignments between the search string and each found match.

Note that the alignments are displayed not character-wise, but transformation-wise, considering both the regular edit distance transformations (single character transformations) and generalized edit distance transformations (which can also be string to string transformations). The character : separates different positions in the alignment.

If the optional flag -y is used in the mode of showing the transformations (like in the previous example), then alignments are output in a pretty-printing mode: if two aligned positions contain different length strings, the shorter string is padded with spaces from the left side.

There can be more than one best alignment between the two strings, in that case, a semicolon in a new line separates different alignments:

The optional flag -w can be used to show the weights / costs of the transformations. The weights are displayed in-between the two aligned strings, and colons are used as separators of different weights:

NB! The current implementation of showing transformations does not support partial matches (flags -p, -s, -i), finding Top N matches (flag -b) and using the blocked regions within the search string (flag -e).

Compiling the program

In order to compile the tool, GNU C Compiler (gcc) is needed. If the GNU 'make' utility is available, automatic compiling can be done with command:

> make all

NB! When compiling on a Solaris machine, one should make sure that GNU make is used instead of Sun's make. Usually, this can be done by calling GNU make with full path, for example /usr/sfw/bin/gmake .

(**) About the examples

The dictionary "testdata/pidgin_words.txt" contains words and phrases from a Pidgin English and the dictionary "testdata/english_words.txt" contains corresponding English matches. The data is based on an online Pidgin/English Dictionary (Pidgin as spoken in Port Moresby, Papua New Guinea) which can be accessed on:
http://www.june29.com/hlp/lang/pidgin.html