A few years ago I had picked Xapian after evaluating a number of solutions. More recently, the popularity surge of Lucene had me curious to learn about it. I needed to do a rip and replace of MySQL fulltext search due to scaling issues so I decided to check out clucene. I quickly found out the API was not as up to date as Lucene (a fast moving target) and that the mailing list had only had 4 posts in the last year or so. That led to a conclusion to move away from clucene. After that, I was told to check out Solr as an easy way to use Lucene without needing to implement Java. I replaced MySQL with Xapian but still had Solr in the back of my mind to check out.

We were using Solr before but it was constantly causing headaches in terms of scalability and complexity. I gave Xapian a go and so far I'm blown away by how awesome it is. Its incredibly lightweight, its scaled a 100 times better and everyone involved is happier.

I'm curious to hear what scaling and complexity problems they faced, but it's good to hear a strong endorsement of Xapian from a former Solr developer. That, and a quick check of the current users page listing del.icio.us with over 100 million documents, seems to indicate that Xapian remains a strong contender in the search space. That being said, I work with very scalable Lucene-based solutions as well, just in Java projects.

I recently looked at using various encodings for hashed UIDs, e.g. UIDs generated by a crytographic hash algorithm such as SHA-1 or MD5. These are often useful when the UID does not need to have human meaning but should exhibit some uniformity, such as character set and length. I considered Base64 and hexadecimal first because they are commonly used by crypto libraries and then decided on Base64 and Base32 where appropriate. Base36 is actually the most compact case insensitive encoding (using Arabic numbers and Roman letters) but is not an option for me at the moment because there's no Perl module for it that will take arbitrary text and binary input at the moment. Math::Base36 exists but only handles numbers.