CouchDB for call analysis data – a case study

At Nu Echo, we’ve been developing and refining our own VoiceXML application framework for years now. As part of our nth rewrite (and I’ll talk more about that rewrite and why we did it in another post), we decided to experiment with CouchDB. (For those new to CouchDB, it’s a schema-less document-oriented database. A so-called NoSQL database.)

The first area where we saw a fit for CouchDB was the storage of call analysis data. This data consists of various attributes associated with a call, information about each interaction (like recognition results) and each transaction (groups of interactions). It can also be augmented with the recordings saved by the ASR engine. Call analysis data is used by our call viewer tool to listen to calls, search for calls exhibiting some specific caller behaviors, produce reports, etc.

In the previous incarnation of our framework, call analysis data was stored on disk in a plain text file, and optionally in a SQL database. Due to the richness of our model, the SQL schema consisted of about 15 tables. And the representation of the same data in the text file was quite complex (tab-separated values, with some fields encoded in JSON format). At the end of each call, data collected during the call was stored on disk and optionally stored in the SQL database. We also had a script that could read all the files on disk and push the data in the SQL database at a later time.

Adding support for CouchDB

The very first step toward our support of CouchDB consisted in rewriting the serialization code to produce JSON-encoded call analysis data instead of our complicated text format. Now, data for a call is written as a single JSON object, one per line, prefixed by the call Id. This greatly simplified the code to read data back into memory.

The next step was to write a script to push the data to CouchDB. The script simply reads the call data, one call per line, and PUTs them to CouchDB in batches of 100 calls using the bulk API in order to increase performance.

Finally, we had to rewrite the part of our call viewer tool connecting to a database to retrieve calls data matching some patterns. It relies on some simple CouchDB views, but not that much in order to be as independent as possible of the database layer (it is possible to retrieve calls from text files as well from the call viewer).

Benefits

We obtained several benefits by moving to CouchDB:

Performance – Loading call analysis data in the CouchDB database is way faster than putting the same data in a MySQL database. Our preliminary results show a speed up factor of about 100 (this does not take the loading of audio recordings into account, though). Ok, we are comparing apples and oranges. CouchDB does not update the view indexes until they are requested, while MySQL updates its indexes as rows are inserted. And only a single document is inserted in CouchDB, compared to lots of rows in more than 15 tables in SQL. On the other hand, if insertions are done at application runtime (after the completion of the call), you better do it fast, especially if the IVR handles many hundred (if not thousand) ports.

Evolution – Making modifications to a complex schema is painful, especially when you have applications deployed in the field. As documents do not have to follow a rigid schema, it is much easier to adapt our code to multiple versions.

Attachments- Even if audio recordings can be stored in a traditional SQL database as blobs, a custom application is still required to access them. With CouchDB, recordings are stored as attachments to the JSON document for the corresponding call. Moreover, these recordings are easily accessible by other tools since CouchDB is itself a webserver and all documents and attachments have a URL.

Conclusion

Of course, there is no panacea and CouchDB is no exception. There are still some aspects of our system for which CouchDB does not provide a better solution than an SQL database. One of them is the support for custom queries. In the call viewer tool, it was possible to write custom SQL queries to find calls matching very specific criteria. Of course, CouchDB supports temporary views to do something equivalent. The main problem is the time taken to build the view. When hundreds of thousands or even millions of calls are processed, creating a temporary view can take a long time (several minutes). Not so good for an interactive tool.

But overall, we have been very pleased by the performance of CouchDB and the flexibility it gives us.

5 Comments

I’ve encountered the same issue with the occasional need to write custom views against CouchDB databases with a large number of documents. The time needed to build the index for the view can be annoying and, sometimes, a deal breaker.

Fascinating – we’ve been using CouchDB for virtually the same reasons for the last few years (Of course, the fact that our entire PBX/infrastructure is Erlang based made this a bit of a no-brainer too). A few issues that we ran into were
– Index sizes, especially given the amount of JSON in a typical CDR. We resolved this by hacking up an mnesia based sequence for _id values (couch btrees grow *much* more slowly if you use a sequence)
– Query bloat, given that once people figure out that they have access to not just Big Data, but Rich Data, they want to query stuff all over the place. You quickly learn that you really need to *think* about the Views well before you write them (we frequently end up filtering in code)
– Performance. Simple reporting needs are easy, but if you need to do real-time access to stuff in the CDRs (“give me the last 20 calls that showed up from this DID”), and all 1000 incoming calls are trying to do the same thing at the same time, well, it doesnt work (and no, BigCouch doesnt quite get there either). We’ve built out our own caching layer on top of this, and it helps. Maybe “couchbase” will help…

The first issue you mention (index sizes) is not something we’ve experienced (yet!). We are at the very beginning of our experimentations. But regarding issue #2, you are perfectly right. We are also using a combination of Couch views and client-side filtering, but still looking for the right balance.