Archive for January, 2007

Still in the Jungle in early 2002 I came across with then raising supernova of the search engines – Google. They posted details of their first programming contest. Now for historical reasons that for now will remain undisclosed I do hate programming contests, particularly those that limit choices of programming languages to something lame like Pascal (yuck!), and also rules for many of them are often too far away from practical useful things that can be done. In this respect the contest that Google offered back then was really good – I was annoyed a bit that I could not write in my language of preference at the time Perl, but other then that the idea for this contest was good: Google just gave you a bunch of crawled web pages and you need to do something interesting with the data, an excellent approach! Sadly later Google’s contests moved into the wrong direction, so bad that it is not worth talking about, but their first attempt at it was the best, as you will see later this contest played pretty critical role in decision to start working on DSearch.

Many failures can be turned into opportunities: it is especially rewarding when the failure was not yours, so when all the bits of lousy architectural decisions made by expensive fishy consultants became apparent I decided to give some extra thinking on how to make a better search engine because the one we had implemented could be basicaly summed up in the following SQL query:

select * from Products where Keywords like ‘%KeyWord%’

It is hard (if possible at all) to do it worse then that: this algorithm is particularly bad in cases when one or more of the keywords is wrong, which would make database scan whole table before not finding anything – a handful of pointless queries could hit site very har and effectively allow bad chaps execute a DoS attack on an e-commerce site. The database in question was DB2, a poor but “free” replacement to great Sybase DB. The box that was running database had 12 CPUs and they were running at around 75-80% – way too high, so I took that as a chance to play around some of the newer approach that I invented to make searching faster: basically we need to go away from table scanning and ideally decide quickly if some keywords will never result in any matches, so we can abort searching quickly.

Products were already referenced by a unique integer product ID, so it was only logical to turn keywords into numbers: a simple Perl script took product IDs with keywords and tokenized keywords converting them into unique WordIDs thus creating lexicon or dictionary. This allowed to do a very quick lookup in the dictionary which was kept as a separate data table with unique index on keyword that allowed very fast determination of either whether we have got some keywords that are not present at all (made up queries that are probably designed to DoS us), or have WordIDs for keywords that we need to search for. Later, when I started reading up on relevant research papers, I found out that this approach is called Inverted Index.

The search itself was done in a table containing clustered WordIDs and ProductIDs – in case of multiple words they were union’ised using SQL, a fairly fast operation when data is already pre-sorted, but in any case it would beat big table scan. When the search went live overall database load dropped to 20-25% – a very substantial decrease. As the prototype was done in Perl/T-SQL, which turned to be an unofficial “bad” language at the time, almost everything had to be changed to Java and DB2 SQL, something that was done by my good colleagues James and Mark.

The most annoying part for me was that the powers that be in the company removed sub-second search time shown on the search pages – much like Google was doing at the time to show that they are that fast, so could we (on obviously much lower scale), but that idea was overruled.

By this point situation in the Jungle became rather unbearable and ultimately good company was driven into the ground. A lot of good work that I did beyond the search engine perished and this was actually a very valuable lesson to learn, later it influenced my decisions pretty heavily. But at the time I was just thinking that the worst was behind as I joined a new dynamic company to work along side of two colleagues in the e-commerce department there, surely life was good as I had a chance to re-implement all the good things I did at Jungle.com, but little did I know that I would experience the worst possible daymare (like nightmare, only during your working day) of my life…

Hashtables are very convenient data structures however they are not very fast and it is very easy to make them even slower by using more logical, but ultimately much slower approach, consider the code below:

The difference in performance is about 2 times: the reason is that checking whether a hashtable contains an item actually has the same or almost the same cost as retrieving value from it or getting null if it is not present: this would not work if possible value of a key can actually be null, but in many cases using faster version will improve performance nicely, so long as you are not bottlenecked elsewhere!

My main day and night time job is developing DSearch, which is an attempt to create competitive Web scale search engine based on distributed computing principles with the help of community of people all around the world. This post is about the long road that lead me to this project.

The first search engine that I wrote was in Perl for my MBA database in late 1998, when I entertained a rather stupid idea of trying to get into MBA program in Harvard and Stanford, naturally they rejected my application, which was fair enough – I am glad I have not pursued that path anyway. As part of research in what it takes I wrote small Perl database/search engine to allow people submit their GMAT grades in order to try to get some statistical information on who gets into those Unis and who does not. The search engine itself is a pathetic attempt that used “table scanning” approach: works okay on small data scales but very wasteful when number of documents increases.

The next chance at having a better shot at search engines I had while working for Jungle.com – once a top flying e-tailer in the UK in early 2000, but ultimately failed and was bought by old-style company called Argos, who (along with blatant incompetence in business management) has lead to complete destruction of the company with scores good people fired. The degree of technical incompetence is apparent even now: Jungle.com will fail, but www.jungle.com will redirect to a shadow of what Jungle.com used to be – it is now a mere word in the URL path of Argos, a pathetic and bitter ending that really needs some dedicated posting that I will do at a later date.

Whilst at Jungle.com in late 2001/early 2002 we had to cope with consequences of a failed project “implemented” by some clowns whose fishy name really puts otherwise yummy salmon fish into a bad light… The search engine that they implemented to search over 500k products was as bad as what I had initially in my first search engine – but this time it used J2EE, something that must have made it look more professional, and certainly a lot more expensive. Anyway, the search was still scanning table, only this time the costs of doing so were huge as we had a lot of products – this was pushing CPU usage on our 12 CPU Sun box pretty high so a solution that does not suck was needed: in other words the kind of solution you would want to get for yourself, not the kind of botched job that often gets done for fixed fee IT contracts.

For some time I thought of starting a blog as means of recording some interesting finds as well as venting some frustrations experienced during process of building DSearch.

The current problem I am working on is automatically determining best recrawl rate for pages that generate dynamic content that is technically different every time they are requested, either due to personalisation, some hidden internals like client-side web analytics or they are just designed to look updated to make search engines recrawl them more often than they really needed. The solution requires an algorithm that is resistant to small changes on the page and allows to determine if a substantial part of the page change, it should also be very fast as we can’t spend too much time analysing a page, and also it should take very little space… if that’s your cup of tea then stay tuned for updates!