Ashish Sharma's Tech Blog

This blog contains series of posts explaing basic and advanced concepts in data structures, algorithms, parallel processing, system design along with fundamentals in Java and unix with the sample code.

Pages

Tuesday, January 31, 2012

I was amazed at how Google does spelling correction so well. Type in a search like 'speling' and Google
comes back in 0.1 seconds or so with Did you mean:
spelling. Impressive!

In one of my past projects, I was working on a RealEstate infrastructure where we had to support spelling corrections for City/State searches. At that point, we used Solr's spelling corrector. But I wanted to dig into the algorithm and write something simple and works without Solr. Just did some research online, went through some papers publishings and wrote an algorithm that serves what we are getting from Solr.

The code maintains a static dictionary of correct spellings in Hashmap, which we load on-boot. Below is the algorithm written in Java.

Monday, January 30, 2012

Problem Statement : The closest pair of points problem or closest pair problem is a problem of computational geometry: given n points in metric space, find a pair of points with the smallest distance between them.

Basically you are given N points in a plane - and you ahve to find two closest points.

This is algorithm may be used to find the nearest location given your current geo location. This question was posted to me as -

This algorithm very similar to what we do in Merge Sort. The algorithm described here uses a divide and conquer approach to find out the closest pair of points in a plane. It takes an input array of points of the type described as input above

Some places it is also said that it may be O(n)
It is because only points in both the planes which are not more than MinDistance distance apart will have to be considered in order to find the closest pair.

Some places it is also said that it may be O(n)
It is because only points in both the planes which are not more than MinDistance distance apart will have to be considered in order to find the closest pair.

Thursday, October 6, 2011

An inverted index is an index data structure storing a mapping from content, such as wordsto its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. It is the most popular data structure used in search engines.

Inverted Index Example

T0 = "it is what it is", T1 = "what is it" and T2 = "it is a banana".
We have the following inverted file index, the integers in {} refer to the their document where prseent.. T0, T1 etc.):

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

A term search for the terms "what", "is" and "it" would give
-> {0,1} intersect {0,1,2} intersect {0,1,2} = {0,1}.

With the same texts, we get the following full inverted index, where the pairs are document numbers and local word numbers. Like the document numbers, local word numbers also begin with zero. So, "banana": {(2, 3)} means the word "banana" is in the third document (T2), and it is the fourth word in that document (position 3).

If we run a phrase search for "what is it" we get hits for all the words in both document 0 and 1. But the terms occur consecutively only in document 1.

Inverted Index Applications

The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

With the inverted index created, the word to document mapping can be stored in hashmap and the query can now be resolved by jumping to the word id (via random access) in the inverted index.

Thursday, September 29, 2011

Caching is an extremely useful concept to improve performance. There are many commercial and open-source caching packages available for Java. These packages need setting up the whole cache infrastructure in dev, qa, prod environments and then hooking it into the code. This is fine, but sometimes we just want a quick and simple solution - to address what we need.

In one of projects in past, we had a Gallery promotion every few days and under peak traffic, site started to melt because of read read timeouts. One option was to use cache tools like Memcached. But did we really need that? We quickly wrote a custom LRU cache that fits all the needs. The solution was well accepted and lives in production today.

Here I'll illustrate the construction of custom LRU cache. I used LinkedHashMap's accessOrder and removeEldestEntry capabilities to create the LRU feature of a cache.

// Used Generics for illustration. Static cannot be used with Generics.
// For real usage, make the cache to be static. This will make get, put everything as static.
// Also, the functions have to be synchronized, since we are deleting old elements and updating as per access order. Under multiple threads - it will lead to race condition.
public class LRUCache<K, V> {
private final int CACHE_SIZE;
private final int initialCapacity = 16;
private final float loadFactor = 0.75F;
public LRUCache(int size) {
this.CACHE_SIZE = size;
}
// LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder)
// accessOrder - to maintain in order of elements from least-recently accessed to most-recently. Invoking the put or get method results in an access.
public LinkedHashMap<K, V> cache = new LinkedHashMap<K, V>(initialCapacity, loadFactor, true) {
private static final long serialVersionUID = 1L;
// The removeEldestEntry(Map.Entry) - is a method from LinkedHashMap, that should be overridden to impose a policy for removing OLD mappings automatically when new mappings are added to the map.
// Returns true if this map should remove its eldest entry. This method is invoked by put and putAll after inserting a new entry into the map.
@Override
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
boolean ifRemove = this.size() > CACHE_SIZE;
return ifRemove;
}
};
// Adds an entry to this cache. The new entry becomes the MRU (most recently used) entry. If an entry with the specified key already exists in the
// cache, it is replaced by the new entry. If the cache is full, the LRU (least recently used) entry is removed from the cache.
// it has to be synchronized, since we are deleting old elements and updating as per access order. Under multiple threads - it will be an issue.
public synchronized void put(K key, V value) {
if (value == null)
return;
else
cache.put(key, value);
}
// Retrieves an entry from the cache. The retrieved entry becomes the MRU (most recently used) entry.
public synchronized V get(K key) {
return cache.get(key);
}
public synchronized void clear() {
cache.clear();
}
// Test routine for the LRUCache class.
public static void main(String[] args) {
LRUCache<String, String> c = new LRUCache<String, String>(3);
c.put("1", "one"); // 1
c.put("2", "two"); // 2 1
c.put("3", "three"); // 3 2 1
c.put("4", "four"); // 4 3 2
c.get("3");
for (Map.Entry<String, String> e : c.cache.entrySet()) {
System.out.println(e.getKey() + " : " + e.getValue());
}
}
}

Used Generics for illustration. Static cannot be used with Generics. For real usage, make the cache to be static. This will make get, put as static.

Also, the functions have to be synchronized, since we are deleting old elements and updating as per access order. Under multiple threads - it will lead to race condition.

Sunday, September 18, 2011

For every element to be processed, we keep looping down till we reach the leaf child node. And then elements are processed while coming up.
Thus to process element n, we recursively loop (or loop to stack the elements) upto height of that node O(h)
Since going down by height, we divide the scope into half (left or right subtree) - O(h) is log n
And each time we loop (through recursion or stack) - a new layer of stack is allocated in memory

For every element to be processed, we keep looping down till we reach the leaf child node. And then elements are processed while coming up.
Thus to process element n, we recursively loop (or loop to stack the elements) upto height of that node O(h)
Since going down by height, we divide the scope into half (left or right subtree) - O(h) is log n
And each time we loop (through recursion or stack) - a new layer of stack is allocated in memory

For every element to be processed, we keep looping down till we reach the leaf child node. And then elements are processed while coming up.
Thus to process element n, we recursively loop (or loop to stack the elements) upto height of that node O(h)
Since going down by height, we divide the scope into half (left or right subtree) - O(h) is log n
And each time we loop (through recursion or stack) - a new layer of stack is allocated in memory

Over 8 years of experience in developing highly scalable algorithms & applications running over terabytes of data serving millions of read/writes a second. I have worked across entire stack of technologies using Java, Servlets, Filters, JSP, Tuckey, Multi-threading, Caching, Regex, XML, JSON, MySQL, Solr, Lucene, jQuery, JavaScript on apache servers httpd, tomcat, with big data on the backend living on HDFS and CDFS.