The issue is that we really have a multiclass problem represented by clusters of data (real data cells, signal data cells, out of range cells) that for the time being we need to treat as a binary problem (data cells and out of range cells). Two of the clusters are mixed in together (real data cells, signal data cells) and two of the clusters can be mistaken for each other (signal data cells, out of range cells).

In that particular example, what I meant by ‘signal data cells’ are things like column header rows. For instance, if the below were our spreadsheet, we would want to get ‘model name’ and ‘model number.’

Now, if we wanted to build a program to work out what our column header row was for that spreadsheet and then output it, we could probably do it fairly easily. That particular project also wouldn’t make a ton of sense, considering it’s already organized for us.

But what if we had a collection of spreadsheets, each organized just slightly differently.

And we wanted to organize our data sets into a single data set. The solution becomes a bit more complex, but a second year comp sci major could probably come up with a solution involving a boolean or array position given this set.

In reality, though, most of our data sets are more likely to look something like this, if we’re lucky.

It does not take long to realize that at scale this becomes quite the problem to solve. Ultimately, we can have an unknown number of spreadsheets with an unknown level of variation in terminology, and only a promise that the sheets have some sort of correlation to each other. Additionally, while the column headers exist as a cluster, the location in that cluster really provides no reliable context information (my example keeps the positions uniform, but there is no reason to expect that would be true in production).

You need to be able to draw these sets into a single unit by mapping the strings that are their column headers together, and looking at your data it feels like you should be able to do it. But how, and how to do it efficiently and accurately?

If you are Google or Twitter or Facebook, you have the benefit of a) sitting on mountains of data to work with that b) come with interpretable signals thanks to interactive behavior from your users. If you work on common data science example problems, say in books or movies, you will probably have the benefit of organized terminology dictionaries you can draw from.

If you are like me, however, you work in a field with a vocabulary unique to its domain, and very few resources out there to help you draw together your terms. In other words, you are starting at or near zero for a mapped library or dictionary to work off of.

What’s a developer to do?

#0 – Understand What Levenshtein Distance Is and Does

Levenshtein distance is a good solution to this specific problem – a) we are matching strings, b) the strings are short, c) we have enough scale to make a high number of matches given a flexible enough learning rate.

To explain how it works on a high level, excuse me while I cheat and borrow an example from Wikipedia:

For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

kitten → sitten (substitution of “s” for “k”)

sitten → sittin (substitution of “i” for “e”)

sittin → sitting (insertion of “g” at the end).

Essentially, the ‘distance’ number is the minimum number of changes required to change one word into the other. Some distance methods put different types of constraints on what those changes can be, but Levenshtein will allow just about anything – insertions, deletions, and substitutions – to get the job done.

And, in case you’re wondering, the distance number for ‘kitten’ against ‘kitten’ is zero.

#0b – Understand What Levenshtein Distance Doesn’t Do

Keep in mind, it’s an approximate matching method. So it isn’t perfect.

It also just returns a number. So you are going to have to do some work as part of the implementation to interpret the meaning of that number.

#1 – Create a Function to Work with your Term Lists

So if you haven’t gathered by now, you need to have collected your terms so that you can compare them against each other. My code assumes you have some means of getting your unknown terms together into one list. It also assumes you have a tiny bit of training data to start with. We actually use a nifty process to automatically produce training data out of statistically significantly terms, but that is outside the scope of this post.

You can easily build a training list manually using this design pattern.

C#

1

2

3

4

5

6

7

8

9

10

11

12

13

14

publicTermLists(){

bigList=newList<List<string>>();

word1List=newList<string>();

bigList.Add(word1List);

word1List.Add("word");

word1List.Add("words");

word2List=newList<string>();

bigList.Add(word2List);

word2List.Add("apple");

word2List.Add("apples");

...

So once you have your training data put together and your terms collected, you need a function to handle everything.

I pass in two dictionaries, one being a
string/intkey/value pair, while the other is a string and another List of strings. The string/int, discreteTermsDict, is your list of unknown terms, while listOfValues are organized terms.

Despite the nested foreach loops, the logic is fairly easy to follow. For each of your training data lists and for each term in that list, Levenshtein Distance will be computed against it and each term in your unknown term list. If it passes certain parameters, then it is added to another list, which is eventually added to the organized list of terms.

A couple of logic loops that you can get lost in, but nothing terribly difficult once you sit with it for a bit.

#2 – Establish Acceptable Threshold Values

I want to draw attention to one particular section of that function block.

1

2

3

4

5

6

7

8

...

vard=levenshtein(b,c.Key);

doublee=Math.Max(b.Length,c.Key.Length);

doubleMs=(e-d)/e;

if(((Ms>.6&&e!=3)||Ms>.7)&&Ms!=1&&!a.Contains(c.Key))

theList.Add(c.Key);

...

As I said earlier, Levenshtein Distance only returns a number, so you have to establish some logic on what to do with that number once you have it. Getting a ratio of the distance vs the longest of the terms, and then checking that against a certain threshold worked nearly perfect for me.

You’ll notice that I set up a few parameters, the oddest of which may be
Ms>.6&&e!=3. Generally, I found 60% to be low enough to capture unlikely correlations, but threw the requirement up to 70% for a three length string; ‘cab’ and ‘cub’ would pass a 60% level, and I just found this duel comparison maximized matches while minimizing false positives, at least for my data sets. I point all of this out because you may need to play with it in your own work.

As an aside, I also found that I got far more positive matches keeping strings that could be split apart together. For instance, rather than trying to do a complex array matching method for “pizza pie” and “pizza box” where I created two groups of two strings (“pizza” + “pie”), I just compared the full string. If your strings start getting too long you could run into problems, but for column header-style strings I have not had a problem in my work.

#3 – Add the Levenshtein Distance Function

After all of that, you want to be sure you actually include the Distance function.

C#

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

publicclassTermClassifying

{

...

privatestaticInt32 levenshtein(Stringa,Stringb)

{

if(string.IsNullOrEmpty(a))

{

if(!string.IsNullOrEmpty(b))

returnb.Length;

return0;

}

if(string.IsNullOrEmpty(b))

{

if(!string.IsNullOrEmpty(a))

returna.Length;

return0;

}

Int32 cost;

Int32[,]d=newint[a.Length+1,b.Length+1];

Int32 min1;

Int32 min2;

Int32 min3;

for(Int32i=0;i<=d.GetUpperBound(0);i+=1)

d[i,0]=i;

for(Int32i=0;i<=d.GetUpperBound(1);i+=1)

d[0,i]=i;

for(Int32i=1;i<=d.GetUpperBound(0);i+=1)

{

for(Int32j=1;j<=d.GetUpperBound(1);j+=1)

{

cost=Convert.ToInt32(!(a[i-1]==b[j-1]));

min1=d[i-1,j]+1;

min2=d[i,j-1]+1;

min3=d[i-1,j-1]+cost;

d[i,j]=Math.Min(Math.Min(min1,min2),min3);

}

}

returnd[d.GetUpperBound(0),d.GetUpperBound(1)];

}

}

#4 – Use Levenshtein Distance to Create Suggested Training Data

The last portion of the
classifyTerms function collects your terms together inside of a list and then organizes those lists inside of a dictionary. Typically at this point, I treat those lists as suggested training data, which can then be used on a second or third iteration to further build up your training lists. From there, you now have a fairly well mapped list of terminology that can be used to build your unified data set.

There may be some work yet still to be done – some term lists that were produced that actually need to be mapped to each other, for example – what work there is should be far smaller in scale than what you would be looking at otherwise.

Using this method, we were able to cut down categorizing a term list ~400 strong down to ~40 words that could not be otherwise matched, and had roughly 20 lists organized. Once you have your project up and running, cutting through that sort of nonsense becomes relatively trivial. There are certainly more refined methods once you have data to work off of, but when you are starting at or near zero as far as domain knowledge for your program, the relative ease of implementation with Levenshtein Distance is hard to beat.

Feel free to leave questions or comments or suggestions for improvements below – or send me something on Twitter. I’d be interested to know how people are using this and methods like it in their own projects.