Surround and define the edges of a subject, giving it shape and volume

Month: June 2011

Post navigation

I am testing out the new replicator db features in CouchDB 1.1 (documented here), and I came across a quirk that took me a while to figure out, so I thought I’d write it up. It isn’t a bug, and it is totally consistent with the rules, but for some reason it was counter-intuitive to me.

The fundamental problem is that I am using slashes in database names. This is fine and supported, but when used in URLs the slashes have to be escaped.

The database I am replicating between machines is called vdsdata/d12/2007. Ordinarily in CouchDB, because it uses HTTP for everything, I’d have to escape that as
“vdsdata%2fd12%2f2007″. For example, if I want to get the status of the database, I’d write

So this habit of always escaping the slashes is ingrained in me, and I always call a URL escape routine in my programs to escape the database names. For example, in the R code I am working on I just call tolower(paste(components,collapse='%2F')).

However, this doesn’t work in the replicator database. As documented, the replicator database entries are of the format:

What is going on is that internally CouchDB is not using HTTP to access its databases, and CouchDB knows that its databases are named with slashes or other funny characters. So when I escape the database name in the replicator document, CouchDB is happily doing what I asked and looking for a database with “%2F” in its name. Instead my entry into the replicator database must have the slashes for the local db, even though it still must have the escape for the remote db, since that remote database is accessed over HTTP. The correct entry looks something like:

That said, I found and worked around two bugs in it today. First my use case. I am saving one document per row of a data frame into CouchDB. So I need to convert each row with toJSON. But, if you call it with

docs <- apply(data,1,toJSON)

it is going to break and write out everything as a string. That is, a number such as 0.123 will be written in the database as “0.123” with the quotes. Not so bad for basic numbers, but that gets irritating to handle with exponentials and so on. So instead I had to call it in a loop, once for each row.

Second, I found out the hard way once I fixed the above bug that NA is coded as…NA, not as null, even though NA is not valid JSON. CouchDB complained by puking on my entire bulk upload, and it took a while to find the problem.

Regex worked well, but I also realized that I can just drop the NA values altogether.

Also, because I am pushing up lots and lots of records, using the default basic reader chewed up a lot of RAM. Instead I hacked the original to make a “null” reader that saves nothing at all.

where the variable uri is the usual CouchDB endpoint for bulk uploads.

Update

Idiot!

Perhaps this is why I blog. I don’t do pair programming or whatever, but here I am, sticking my code out on the internet with horrendously stupid errors!

Of course, I don’t need my toJSON loop! All I need to do is create a structure and dump it that way. I ignored my first rule of R programming: loops are bad.

So, the loop avoidance is that I just let the much smarter person who programmed RJSONIO do the loop unrolling. Instead of using apply stupidness, and instead of the loop, all I needed to do was create a list with a single element “docs” equal to the data.frame I wanted to store. In short, I recoded the above loop as

That is at least a 10x speed up over the loopy way, probably more but I can’t be bothered finding out exactly how bad that loop was.

Update

Idiot!

Wrong again! The problem with the above code is that I didn’t inspect the generated JSON. What toJSON is doing is evaluating the data.frame as a list; that is, instead of row-wise processing, toJSON is processing each column in order. That is because a data.frame is a list of lists, with each list having the same length. So although that is fast, it doesn’t work.

Which led me to the reason why toJSON seems to have a bug when applied using “apply(…)” It is because apply coerces its argument into a matrix first. So a matrix with mixed character and numeric values will get converted into a matrix of character.

I took a stab at using the plyr library, but that was slower. I then took a look at the foreach library, but that was slower still.

My new champ is once again my old, super slow loop!

But just because I had a mistake, doesn’t mean I shouldn’t persist. “Loops are bad” is an important rule when R code is slow, and my code is taking way too long to write out data, and that loop has got to go.

Realizing what was behind the “bug” with apply(*,1,toJSON) finally gave me the solution. What I had to do was split the data into numeric and text columns, separately apply toJSON, and then recombine the result.

A few more problems presented themselves. First, each run of toJSON() produces an object, wrapped in curly braces. So the call to paste(), while it correctly combines the JSON strings row-wise, has buried inside of each “row” the invalid combination of “} {” as the numeric “object” ends and the text “object” begins. With a little regular expression glue, the correct paste line becomes: