Thursday, July 24, 2008

Today's Linux/Unix bash shell script is for those of us who sometimes get lost for words. This happens to me at least a few times a day as I seem to talk, and type, way too much. Every now and again, I'll find myself facing a sentence that is not only redundant, but also seems to repeat its central message more than once ;) Sometimes redundancy is a good thing, though. If you've ever listened to an instructional or motivational speaker, you've probably noticed that a lot of them like to hit on the "rule of 3's" (sometimes 4's and 5's, with the annoyance factor increasing commensurate to the occurrence of repetition :) I, personally, try not to repeat myself ever (although, in writing about a thesaurus, I'm almost certainly doomed to some sort of meta-paradox).

One of the times a good thesaurus can come in handy is when you're faced with having to use similar words within a restricted amount of space and the resulting text seems stilted because of it. For instance, the sentence:

As good an idea as it may seem, it's generally not good to repeat the same word within a sentence.

With a little thought (or a handy reference) can be made much more palatable, and the redundancy can be made to appear to have disappeared:

As good an idea as it may seem, it's generally not desirable to repeat the same word within a sentence.

The major differences between our script today, and the equally helpful one posted on Gentoo.org are mainly rooted in the method. For instance, their script makes use of lynx and a program you may not have installed by default, called html2text. Ours, while still relying on the online component, uses wget and sed. We went with wgetover lynx since it's partial-source dump option is a little more predictable than lynx's. That's not to say that there's anything wrong with lynx, just that it didn't suit our needs for this particular endeavour.

Another major difference between the two is that we decided to go ahead and throw in the "%20" space declaration so that you could submit multi-word queries to the script and get a response that you'd expect. Check out the picture below for a quick example of submitting a bad multi-word query, a good multi-word query, a bad single-word query and a good single-word query. If you can't see the picture, for whatever reason, the output is fairly simple. When you submit a bad query of any type (single or multi-word), you'll get back a "No results found" message and some suggestions. When you submit a query that matches something, you'll receive a varying number of definitions followed by a varying number of synonyms (and, yes, I'm not using the script while I write this ;)

Click the picture below to see it in full size:

As of the writing of this post, I have yet to figure out the "&whatever" suffix to the URL that will make the online Thesaurus return more than 10 results per page, so there's still some work to be done there. If you're so inclined, you can write in a quick check and recheck into the script. The addition to the URL that will start you at definition number 11 (instead of 1; the default) would be "&start=11" - So far, except for with very general words like "good," I've found that this hasn't been necessary, but it would be a cool improvement.