Thursday, May 21, 2009

Postgresql has a full text search engine built into it. Teodor Siagev and Oleg Bartunov, who started a text search engine called OpenFTS, merged their code base into Postgresql 7.4 as a separate contrib module called tsearch2.

Tsearch2 is highly extensible and sophisticated code base for text search. It is flexible in the sense that you can write your own stemmer and parser, or even use the default one for any new language. Tsearch2 keeping up with the tradition of extensibility in Postgres provides users to define their own ranking function or headline generation. Besides that it provides people to use two text indexes namely GIST and GIN both with different performance curves for initial indexing and index updates. Tsearch2 was merged in the core of Postgres last year in the 8.3 release.

Postgres 8.4 is being released after more than a year of development and testing. A few patches that many people wanted like a default replication scheme in the core and a SELinux in Postgres called SE-Postgrresql were punted for 8.5. The main reasons being that these patches were too big for people to review in the last months.

Postgres 8.4 brings a number of improvements in text search. Here are the list of new text features that you may look out for and may force you to upgrade:

This was a Google Summer of Code (GSoC) project taken up by Jan Urbanski and it is great that the project was successful. Though it took a lot of time for this patch to be accepted, it provides a quite accurate selectivity measure for text search. This will enable the Postgres planner to produce better plans for SQL queries when text search matching operator is combined with other equally complicated operators.

Earlier prefix matching was used to be done using LIKE operator. For example, for searching all documents that have have the beginning few words "sush", the WHERE clause needs to contain LIKE 'sush%'. However, LIKE operator does not use the text index and is very slow. This patch introduces fast prefix matching in Postgres. Now the where clause can be something like:

xt_tsvector @@ to_tsquery('sush:*')

3. Support multi-column GIN indexes (Teodor Sigaev)

Earlier if you have to index two separate text columns like "documents" and "comments", then you could only have separate indexes for each column. And then a query has to be matched with each column (q @ document and q @ comments). With this patch, such queries can take advantage of multi-column indexes if the developer has used one. Here is the performance improvement of multi-column index over single index as observed by Teodor:

This patch was contributed back by me. Headline generation is the identification of text fragment in a document where query terms appear. The default headline generation function shows only one text fragment for a set of query terms. Further, the existing headline generation function did not show good headlines (it is a more subjective judgment as there is no way to identify a good fragment.

My patch allowed more than one non-overlapping fragment to be displayed for a set of query terms. Further, the text fragments that were chosen in such a way that those fragments contained query items in the most compact way. A lot of databases will be envious of production quality headline generation in postgres now.

5. Improve support for Nepali language and Devanagari alphabet (Teodor)

I do not know much about this but looking at the CVS log, here was the bug that was fixed:

"Some languages have symbols with zero display's width or/and vowels/signs which are not an alphabetic character although they are not word-breakers too.So, treat them as part of word."

Tuesday, May 19, 2009

ISP's have been warning us for a long time that the bandwidth on the Internet is exhausting. The doom's day when we will not be able to use the Internet is pretty close. They have attributed this problem to a variety of reasons: a small group of users consuming significantly more bandwidth than others, some applications like bit-torrent consuming the entire bandwidth, some websites like youtube consuming the bandwidth, etc. Solutions with respect to each of these problems were proposed respectively: capping user bandwidth and penalizing for any extra bandwidth, use protocol shapers that slow down (read discriminate) certain protocols, or charge the website owner like youtube for reaching the customer.

Cory Doctrow in his guardian article argues that all of these mechanisms are designed to prohibit people from using the Internet. I have been always against middle boxes like Packeteer that restrict bandwidth usage based on protocol because such discrimination is arbitrary and can be used to used to target and extort any application they want. Similarly, charging websites for reaching their customers is totally flawed as this can again be extortionist. And as Cory points out that it can also be used to curb public protests and free speech. In his words:

"by allowing ISPs to silently block access to sites that displease them, we invite all the ills that accompany censorship – Telus, a Canadian telcom that blocked access to a site established by its striking workers where they were airing their grievances."

I used to think that the solution in which ISP's fix the total data transfered over a month, is fair. This solution was not discriminatory to any particular website or application. However, I did not realize that such policy can significantly hinder the free usage of Internet. As Cory says:

But the real problem of per-usage billing is that no one – not even the most experienced internet user – can determine in advance how much bandwidth they're about to consume before they consume it. Before you clicked on this article, you had no way of knowing how many bytes your computer would consume before clicking on it. And now that you've clicked on it, chances are that you still don't know how many bytes you've consumed. Imagine if a restaurant billed you by the number of air-molecules you displaced during your meal, or if your phone-bills varied on the total number of syllables you uttered at 2dB or higher.

Actually in India, people who exceed the bandwidth cap are penalized significantly. So capping bandwidth usage with penalties is either going to scare off people from using the bandwidth or too conservative to try any new website or application on the Internet.

If we do not want any restrictions and ISP's keep claiming of the clogging routers, what is the solution. If ISP's keep adding more bandwidth, we probably don't need to talk about it. But if they keep insisting, what is the type of plan that I may agree to. The only thing that I may agree to is a price model which is incentive based for using more bandwidth. So it can be at a fixed cost till say 10GB a month and then they can charge me at a reduced rate for next 10GB. Such graduated rate will give more incentive for people to try new stuff on Internet. After all we want more people to freely use the Internet without worrying about the details that they do not understand.