This post summarizes a number of recent posts on this blog showing how to calculate similarity and distance measures in C# for documents and strings.

First, documents can be represented by a “Bag of Words” (a list of the unique words in a document) or a “Frequency Distribution” (a list of the unique words in a document together with the occurrence frequency).

The simplest similarity measure covered in this series is the Jaccard Similarity measure. This uses a bag of words and compares the number of common words between two documents with the overall number of words. This does not take into account the relative frequency nor the order of the words in the two documents.

Three measures using Frequency Distributions are described. In all these cases a n-dimensional space is created from the Frequency Distribution with a dimension for each word in the documents being compared. They are:

Euclidean Distance – The shortest distance between two documents in the Frequency Distribution space.

Manhattan Distance – The sum of all the sides in the hyper rectangle formed around two documents in the Frequency Distribution Space.

Cosine Distance – The cosine of the angle subtended at the origin between two documents in the Frequency Distribution Space.

These measures take into account the word frequency, but Cosine Distance cannot distinguish between documents where the relative frequency of words is the same (rather than the absolute frequency). They do not take into account the order of the words in the documents.

The final measure is the Levenshtein Minimum Edit Distance. This measure aligns two documents and calculates the number of inserts, deletes or substitutions that are required to change the first document into the second document, which may not necessarily be the same length. This measure takes into account the words, the frequency of words and the order of words in the document.

From Lesk[1] p.254 – “The Levenstein, or edit distance , defined between two strings of not necessarily equal length, is the minimum number of ‘edit operations’ required to change one string into the other. An edit operation is a deletion, insertion or alteration [substitution] of a single character in either sequence “. Thus the edit distance between two strings or documents takes into account not only the relative frequency of characters/words but the position as well. Strings can be aligned too. For example, here’s an alignment of two nucleotide sequences where ‘-‘ represents an insertion:

ag-tcc
cgctca

For these two strings the edit distance is 3 (2 substitutions and 1 insertion/deletion). In the case above the substitutions and inserts/deletes (“indels”) have the same weight. Often, substitutions are given a weight of 2 and indels 1 resulting in an edit distance of 5 for these strings. Substitutions are really an insert with a delete, hence the double weight.

The edit distance calculation uses Dynamic Programming. The algorithm is well described in Jurafsky & Martin[2] p.107. and summarized in these PowerPoint slides. This class implements the edit distance algorithm and text alignment in C#:

The output from this test reports the edit distance to be 8 and the alignment is:

*EXECUTION
INTE*NTION

The “D” matrix with backtrack information is also displayed:

Note that there may be several different possible alignments since backtracking allows multiple routes through the matrix. This web site http://odur.let.rug.nl/kleiweg/lev/ provides an online tool for calculating the edit distance.

“Machine learning today is usually self-managed and on premises, requiring the training and expertise of data scientists. However, data scientists are in short supply, commercial software licenses can be expensive and popular programming languages for statistical computing have a steep learning curve. Even if a business could overcome these hurdles, deploying new machine learning models in production systems often requires months of engineering investment. Scaling, managing and monitoring these production systems requires the capabilities of a very sophisticated engineering organization, which few enterprises have today.

Microsoft Azure Machine Learning, a fully-managed cloud service for building predictive analytics solutions, helps overcome the challenges most businesses have in deploying and using machine learning. How? By delivering a comprehensive machine learning service that has all the benefits of the cloud. In mere hours, with Azure ML, customers and partners can build data-driven applications to predict, forecast and change future outcomes – a process that previously took weeks and months.”

Euclidean, Manhattan and Cosine Distance Measures can be used for calculating document dissimilarity. Since similarity is the inverse of a dissimilarity measure, they can also be used to calculate document similarity. For document similarity the calculations are based on Frequency Distributions. See here for a comparison between Bag of Words and Frequency Distributions and here for using Jaccard Similarity with a Bag of Words.

The calculation starts with a frequency distribution for words in a number of documents. For example:

A n-dimensional space is then created, with a dimension for each of the words. In the above example a dimension will be created for “Cat”, “Mouse”, “Dog” and “Rat”, so it’s a four dimensioned space. Then, each document is plotted in this space. The following diagram shows the plot for just two of the four dimensions with the Euclidean Distance (the shortest distance between two points):

The Manhattan distance is the sum of the lengths of the rectangle formed by the two points:

Finally, the Cosine distance is the angle subtended at the origin between the two documents. A value of 0 degrees represents identical documents and 90 degrees dissimilar documents. Note that this distance is based on the relative frequency of words in a document. A document with, say, twice as many occurrences of all words compared to another document will be regarded as identical.

For a full description of these distance measures see [1], including details on their calculation.

The Euclidean and Manhattan distances are specific examples of a more general Lr-Norm distance measure. The ‘r’ refers to a power term, and for Manhattan this is 1 and for Euclidean it’s 2. Therefore a single class can be used to implement both:

The simplest way of representing a document is the “Bag of Words”. This is the list of unique words used in a document. It is therefore a simple present/not present indicator for all words in the vocabulary and does not take into account the occurrence frequency of these words nor the order of the words.

The Bag of Words is used by the Jaccard Similarity measure for document similarity. If two documents have the same set of words then they are deemed identical, and if they have no common words they are completely different. This similarity measure takes no account of the relative length of the two documents being compared.

In C# a Bag of Words can be represented by a generic List. The list type can either be a string (in which case it’s the actual word) or an integer (where the integer is a lookup into a dictionary). The latter is more efficient because the word is stored just once as a string and the integer lookup (4 bytes) is most likely to be shorter than the word itself. The C# in this blog post creates a dictionary, some Bag of Words and calculates the Jaccard index for documents.

Frequency Distributions record not only the words in a document but also the frequency with which they occur. However, like Bags of Words, no account is taken of the order of words in the document. These frequency distributions can be compared and used to assess the similarity between two or more documents.

These techniques generally calculate the distance between two documents. A distance measure is the inverse of similarity. Common techniques are Euclidean, Manhattan and Cosine distances.

The following C# class manages Frequency Distributions. It’s a generic class, and so can use strings (the words themselves) or integers (for lookups into a dictionary).

Creating virtual machines with Azure is a great way of standing up test servers, especially for SharePoint where the installation can be long.

You can create a SharePoint Server 2014 trial server from Azure by first selecting “From Gallery”:

And then select the “SharePoint Server 2013 Trial”

The problem with this is that SharePoint is already installed for a farm installation and therefore cannot be installed as a standalone server. As the description provided by Microsoft states, you will need to create another virtual machine for SQL Server and possibly another for a domain controller with Active Directory.

To circumvent this issue you can:

Create the VM using the gallery in Azure as shown above

Install SQL Server Express 2012 on the newly create VM

Create a new farm using the New-SPConfigurationDatabase PowerShell command

Run the SharePoint Products Configurations Wizard and join the farm you’ve just created.