Document Similarity Analysis Using ElasticSearch and Python

Elasticsearch is an open source search engine based on Lucene. Its being used by leaders in the market like Wikipedia, Linkedin, ebay etc. It has an official python client elasticsearch-py

You can download elasticsearch fromhere. To install, just unzip the downloaded file and runbin/elasticsearch.bat if you are installing on windows. To install on unix systems you have to run bin/elasticsearch. Visit the pagehttp://localhost:9200/in your browser to see if its installed properly

Further information about installation and setup of elasticsearch can be foundhere

The python client can be installed by running

pip install elasticsearch

The process of generating cosine similarity score for documents using elastic search involves following steps

Creating an index

Index the individual documents

Search and get the matched documents and term vectors for a document

Calculate cosine similarity score using the term vectors

Creating an index

An index is like a ‘database’ in a relational database. It is a type of data organization mechanism, allowing the user to partition data a certain way. Elasticsearch also uses index to decide how to distribute data around the cluster.

Unlike the databases of RDBMS indices are light, so you can create hundreds of indices without running into any problems

The following is the code to create an index

es = elasticsearch.Elasticsearch()

Initializes the elasticsearch client

and thecall es.indices.createcall actually creates the index

here we pass the index name parameter and the body parameter which contains various settings and mappings to configure the index

def create_or_clear_index(_obj):

index_name = _obj

es = elasticsearch.Elasticsearch()

# Delete index if already found one

try:

es.indices.delete(index = index_name)

except elasticsearch.NotFoundError:

pass

# Setup fresh index and mapping

es.indices.create(index = index_name,

body = {

"mappings": {

"page": {

"_source": { "enabled": True },

"properties": {

"url": {

"type": "string"

},

"page_text": {

"type": "string",

"term_vector": "yes"

},

"title": {

"type": "string",

"term_vector": "yes"

}

}

}

}})

Index individual documents

indexupdates a json document into a named index and hence makes it searchable

Here we to pass the index name, type of the document and document itself

def index_the_text(inp):

page_title, text_data = inp

try:

es.index(index = idxname,

doc_type = "page",

id = page_title,

body = {

"title": page_title,

"page_text": text_data

})

print "-" * 10

except Exception, e:

print e

The above function takes a tuple containing the page_title and text as input parameters

Just for the sake of this problem we assume the title of the document is a unique identifier and we index it as the id of the document

Get the 5 most similar documents for every document

To get the most similar documents we use mlt, which stands for “more like this”. Using this api call we get the documents that are “like” the reference document we pass

mlts = es.mlt(index=index_name, doc_type="page",

id=doc_id, mlt_fields="page_text",

search_size=5)

The parameter “mlt_fields” specifies the exact fields toperform the query against

and the search_size parameter specifies the number of documents to return. Apart from these you can also specify stop words and many other parameters

Get the term vectors

tvjson = es.termvector(index=index_name, doc_type="page",

id=doc_id)

The above call is used to get the information and statistics about various terms in the fields of a particular document. We need this data to calculate the cosine similarity score of the documents

to get term vectors from the statistics returned we write a function

def get_tv_dict(tvjson):

return dict([ (k, v['term_freq'])

for k,v in tvjson\

.get('term_vectors')\

.get('page_text')\

.get('terms')\

.iteritems()])

Once we get the term vectors for documents we can calculate the cosine similarity score

Given below is the function to calculate the cosine similarity score of documents given the term vectors

Calculate the cosine similarity score

def get_cosine(vec1, vec2):

intersection = set(vec1.keys()) & set(vec2.keys())

numerator = sum([vec1[x] * vec2[x] for x in intersection])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])

sum2 = sum([vec2[x]**2 for x in vec2.keys()])

denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:

return 0.0

else:

return float(numerator) / denominator

The entire code is givenhere. It takes a set of urls from a file called “urls_file.txt” crawls them indexes them and for each document indexed fetches the nearest documents and outputs the cosine similarity score into a file “output.csv” in the current directory. Please note that lxml has to be installed for this script to run