Archive

I have been using the Secure Shell Chrome extension for a while now when I’m in a non-Linux OS for one reason or another. It isn’t very well documented, and it doesn’t have a lot of directly configurable options, but it works fairly well without installing anything other than a Chrome extension, and it allows me to stay in the browser with tabs and all that, so its handy. One of the irritants is that you can’t easily sort the list of saved connections. You can set the order in the javascript console, but the methods I found mentioned on the Internet to do this are multi-step processes, so I wrote a little function that does this for me as needed. I can just paste this function into my javascript console and my connections are sorted for me. I wanted them sorted by the label/description field. Its just as easy to sort them by any other field in the profile.

This is using undocumented features of the extension, which might change at any time. It then calls the set method to save the sorted list back to the settings. You should backup your preferences before doing this, and use it at your own risk. Please don’t blame me if you melt your preferences.

I periodically need to find the differences between two javascript files that have been run through something like uglifyjs. There are a variety of ways to do this, but I haven’t found a solution that really gets me what I am looking for succinctly.

For a while I used git’s diff, but I found this to be cumbersome and not always available.

Then for a while I used wdiff with colordiff. That would look like this:

wdiff /path/file1.js /path/file2.js | colordiff

The problem with that is that the output is often really long because compressed js doesn’t have many line feeds, and wdiff doesn’t recognize non-word characters as word boundaries, and compressed js often doesn’t have whitespace to deliniate word boundaries, so the strength of wdiff is somewhat thwarted.

What I really wanted was to split the files into their smaller pieces for diffing. My latest solution is a bash alias, which looks like this in my .bash_alias file:

In my last post I talked about enabling mongodb’s beta text search, which at least to me was a little less than intuitive to accomplish. That’s probably partly because of the beta nature of this feature.

The next challenge was figuring out how to interact with the text search functionality from node.js, since interacting with it from an application that needs to provide search is the whole point. I’m sure that at some point the node.js native driver will support syntax specifically for searching, but at the moment its not there yet. This post assumes that text searches are enabled and you’ve added an index.

Before I show how I am accessing the text search feature, it is helpful to know how my modules are put together in general. At the top of each module I set up the mongo connections. In this post I’m going to use “articles” as my example collection. The setup for the db object looks something like this:

This leaves me with a db.articles object that provides access to the collection’s methods, including find, update, save, and so on. I would add each collection needed for the module to the db object in the same way. Unfortunately, the collections object doesn’t have a method for text searches. For that, I need access to the cdb object included in the callback to mongo.connect. To do that, I add the cdb object to my db object, which puts it in scope for the rest of my module.

We want the results that this method returns to be an array, not the object with the extra stuff mongodb adds to it. The extra conditional in there is to prevent it from throwing errors if results is undefined or something. There is probably extra logic, acl’s filtering and so on in the real thing, this is stripped down to just show the text search.

The results passed to the callback in this method will be an array of objects, each of which have two elements: score and obj. obj will have the full document for each match.

The extra steps shown here will go away when they add text searches to the driver, but for now this is a fairly functional approach. I hope it saves someone the extra time it took me to sort this out.

Full text search in noSQL databases is far less common than one would think. Most apps I build can benefit from full text searches, even if they don’t need sophisticated search capabilities. There are external solutions for most databases, mostly tying in Lucene through Elastic Search or Solr. Sometimes those external solutions are just the way you need to go, and I’ve used external Lucene integration with CouchDb before. But I was glad when I saw that text searches are included in MongoDb 2.4, at least at a beta stage.

The main catch I’ve had in my testing so far is I had a hard time figuring out how to enable this feature. Like many people (I expect), I’m using packages for Ubuntu, so I needed to figure out how to get this feature enabled in /etc/mongodb.conf. The documentation shows how to enable text searches in the command to start mongo, and mentions that you can put this in the config file, but it doesn’t tell you how to put it in the config file.

This doesn’t work:textSearchEnabled=true

You end up with a response that sayserror command line: unknown option textSearchEnabled

This is the syntax to put in the config file instead:setParameter=textSearchEnabled=true

Once I added that, the feature was enabled. In the mongodb console I was then able to add my initial index like this:db.content.ensureIndex( { title:'text', body: 'text' });

and search it like this:db.content.runCommand("text", {search:'Lorem'})

This returns an object that contains an array called results with the results in it, one result object for each document that matched the text in either the title or body field. Each result is in turn an object with a score and the matching document. It also returns a stats object that tells how many documents were found and how long it took.

Overall, this feature is very promising. While it doesn’t appear as strong in its search capabilities as the Lucene solutions, having it directly available in MongoDb itself is a big win for deploying solutions for customers quickly.

Next up: interacting with the text search functionality through the native node.js driver.

One of the things I largely underestimated when I first started working with CouchDb was the importance of a meaningful ID for documents. The default of letting the database set a UUID for you seems reasonable at first, but the UUID is largely useless for querying. On the other hand, a good ID scheme is like having a free index for which you don’t have to maintain a view. You can just query using the _all_docs view, which is built in. Well thought out ID’s can save you tons of headaches and fiddling with views later. Particularly with large data sets, this can be a big deal, because views can take up a significant amount of storage space and processing. Unfortunately, they are hard to change after you get a lot of data in the database, so it is worth thinking about before you get very far into the data.

There are a handful of primary considerations when considering your document IDs. In most data sets, there is a key piece of information that most of your searches are based on. Account or customer records are usually looked up by ID. Customer transactions are usually retrieved by account and date range. That’s not to say all queries are based on these data elements, but it tends to be a significant majority of queries. Since documents are automatically indexed by ID, one good candidate for good IDs are the information that you most often are searching for in your queries.

Another important consideration when designing views and IDs in CouchDb is storage space and view size. A view that stores several pieces of information from every document can double the size of the overall data required, and even more space is needed for maintenance operations like compacting. If you need a particular view that includes several pieces of data from the document, consider designing your IDs to replace the need for the view. In fact, getting rid of as many views as you can is a worthwhile goal. Views are powerful and useful, but unnecessary views can consume huge amounts of extra space and processing to maintain.

A third consideration for IDs is ID sequence. Documents are sorted by ID. It is often worthwhile including a timestamp as a part of the ID. For example, for transaction documents the account plus a timestamp often makes a good ID. This automatically sorts the transactions within an account by time. In fact, sometimes the primary factor in looking up documents is time, and in those cases it might be a good practice to start the ID with a time stamp. How to format the time depends on your use. Do it in the way closest to how you will retrieve the data. That might be a standard javascript timestamp (1338051515556), or a date/time in a format like YYYYMMDDHHMMSS (something like 20120526-165835). Remember the point here is a useful sort order, so any date and time format that will result in the desired left to right string sort order is what you want. When you’re using timestamps, remember that IDs have to be unique. Milliseconds are not a guaranteed unique identifier for a web application. More on that in a moment.

Sequence is also important if you have multiple types of documents that you need to retrieve together. For example, blog posts and comments are often retrieved at the same time. So it often makes sense to have the ID of comments start with the ID of the post they relate to. That way, you can easily retrieve the blog post, together with all of the documents that are sorted between that post and the next post.

A fourth consideration is to make the information in the IDs be enough data for at least some queries. An _all_docs lookup without an include_docs returns the ID and the revision. If the ID is enough information that it is all you need for a significant number of queries, you can reduce the data you need to move over the wire in at least some of your queries.

In CouchDb, people often store documents of multiple types together in the same database. I already mentioned blog posts and comments. In some cases, you almost always look for only one type of document at a time. In that case, it makes sense to start the ID with an indicator of the type. Alternatively, you might have documents for which there are several types that all relate to a parent or master document (again, the blog post comments are one example of this), and in those cases it might make sense for the secondary documents to start with the master document’s id, followed by type indicator, and then a unique identifier within that type. Usually one or two characters is enough for this purpose.

ID’s often end up being several pieces of information appended together. Maintain readability by adding a separator that will also be useful in queries. Often a dash is a good choice. For example, if you have an account number plus a timestamp for transaction ID’s, I typically put a dash between the account and time. On the other hand, keeping the ID short can save on space, so don’t add a lot of extra stuff to the ID. The minimum of separators to make the result readable is a good goal. So, for example, if you’re using time stamps keep it to the meaningful digits rather than including colons and time zone information.

Remember that IDs have to be unique, so if there is any chance that you’ll end up with two documents with the same ID, change it so they are guaranteed to be unique. Milliseconds are not guaranteed to make unique IDs, particularly when you have more than one database being replicated or using BigCouch. If you have a few dozen inserts a second you’ll end up with conflicts at some point. If you can guarantee that your ID will be unique using unique information from the record itself, that’s great. However, that’s not always the case. In a distributed web application generating guaranteed sequences at the application level is not practical. So once I get all of the information I care about in the ID I will often append a few random characters to the end. Milliseconds plus four or five semi-random characters is much less likely to generate collisions. The other approach is to just figure there might be collisions, and have your application watch for document conflicts and PUT the document again with a slightly different ID if you get a collision. As a practical matter that’s not a good idea if it is very likely to happen much, but if collisions are possible but very unlikely it is often a good compromise.

Of course, all of this has to be weighed against the size of the ID. The ID is stored with every record. Make it as useful as possible, but at the same time avoid having a lot of extra stuff in it that you won’t need. Also, avoid redundancy. If information is in the ID, consider whether you can eliminate a field or two from the document body. If you have the account number and time in the transaction document’s ID, maybe you can remove those fields from the document itself and just split the ID after you retrieve it.

The overall goal is to make the document IDs as useful as possible, take advantage of the fact that the ID is stored with every record anyway and is always indexed, and that _all_docs queries operate just like a view on a simple key. With a little thought put into your database before you start adding records in volume, you can reduce your storage and maintenance resource requirements, and optimize your ability to query the data.

I made the mistake of hitting the Upgrade button on Ubuntu updates manager on my main development box the other day when it asked me if I really was going to go another day without Oneiric, and within a fairly short time had an unbootable Ubuntu box. Usually Ubuntu upgrades are fairly smooth, this one was bad. For a little bit of context, I have been using Linux for a long time. I started with RedHat, wandered through Fedora when it appeared, Gentoo, Suse, OpenSuse, CentOS, ClearOS, Debian, and Ubuntu, but for the past while I’ve been using Ubuntu. For years I used and advocated KDE until version 4, at which point (about the same time I switched to Ubuntu) I moved to Gnome. I have been using XFCE for a few months, and I’m basically done with both KDE and Gnome for now. So far I haven’t lasted for more than a few minutes on Unity before I get completely disgusted and change to something else. Somebody told me this week I’m just one of those old grumpy Linux guys.

I didn’t spend a lot of time figuring out what went wrong with the Ubuntu upgrade. Instead, I downloaded a few updated versions and put them on a thumb drive, and tried out some variations on the setup I’ve been using for a while. I installed Mint 11, Mint’s Debian XFCE version, and Xubuntu. As a side note, why aren’t there any really good tools to make bootable live Linux installs on USB for Linux? Most of the directions on the web say to do it on Windows. Blech. I ended up using unetbootin, which works.

My brief reaction to each of the three installs? Mint 11 has all the advantages of Ubuntu, except that its currently a version behind and has mintier branding. The main reason I’d use it is if I wanted to stick with Gnome, which is a possibility if it weren’t for the fact that I’m really liking XFCE. So, Mint 11 isn’t in my immediate future.

The Debian version of Mint is somewhat enticing. I like the idea of rolling updates. I’m not a huge fan of straight Debian, for no reason other than I have a kneejerk ideological reaction to software that is too ideological. Software should be practical. Debian is from a planet I’ve only ever visited for short periods. Yes, I know Ubuntu is Debian based, but it is suitably commercialized. Odd position for a Linux fan to take, isn’t it? I am fairly sure I’m not alone. All that being said, I could see myself using and liking this distro, certainly over the vanilla Mint 11 Gnome version.

Xubuntu works reasonably well. The new Ubuntu Software Center stinks. What happened to options and the ability to configure stuff? It’s pretty, but gutted. That’s basically my reaction to the direction Ubuntu is going generally. Ah, for the good old days when all the configurations were in bash and lisp files.

My first step on all three installations, after changing them so the focus follows the mouse properly, was to try to compile CouchDb 1.1. It failed on all three. There seems to be a mismatch between compiler versions and what CouchDb’s configure is expecting. I haven’t taken the time yet to figure out what the problem is. At this point I mostly just need to get on with my coding. The CouchDb binary package available on these distros is out of date. For my purpose on this dev box, it doesn’t matter enough to spend time on it. However, I will need to sort this issue out at some point. By contrast, node.js compiled easily on all three.

For now, I’ll probably use Xubuntu. When I have more time on my hands, it is likely I’ll wander off into a search for a different Distro, and move out of the Ubuntu family again. I’ll need to do something with my laptop (the machine I actually work on), which is a light weight Acer currently running Ubuntu 11.04 with XFCE. I’m open to suggestions, but I guess I’m not in much of a hurry. None of the recent installs on my dev box were exciting enough to make me want to spend more time on it. And for someone who’s spent way too many hours over the past fifteen years or so distro hopping just for fun, that’s too bad.