Amit Rathore blogs about software development

Menu

Why Datomic?

Many of you know we’re using Datomic for all our storage needs for Zolodeck. It’s an extremely new database (not even version 1.0 yet), and is not open-source. So why would we want to base our startup on something like it, especially when we have to pay for it? I’ve been asked this question a number of times, so I figured I’d blog about my reasons:

This is a long list, but perhaps begins to explain why Datomic is such an amazing step forward. Ping me with questions if you have ‘em! And as far as the last point goes, I’ve talked about our technology choices and how they fit in with each other at the Strange Loop conference last year. Here’s a video of that talk.

Post navigation

4 thoughts on “Why Datomic?”

It does look very interesting, one of the three distributed and consistent databases I know of (the others are Google Spanner and Hyperdex).

However, I have a few concerns:
– it’s on the JVM, that’s less than ideal
– queries appear to require fetching a lot of data, so frequent queries on fresh data are likely expensive
– writes appear to be a significant bottleneck
– there still doesn’t seem to be a good way to destroy data.

Also, my biggest concern is that it’s closed source. It’s hard to put up with that when there are so many good open source versions.

Lucian – addressing your concerns:
“it’s on the JVM, that’s less than ideal” – why is that less than ideal?

“queries appear to require fetching a lot of data, so frequent queries on fresh data are likely expensive” – in answer to the first part of your statement, it depends on the data but many use cases would allow the data to fit entirely in a peer’s cache. A peer’s cold start would take whatever hit on performance but after that the data is local to the peer. To get new data, a peer only needs to transfer the difference of whatever is relevant to a query -or- you can program so a peer subscribes to new data as its accumulated.

“writes appear to be a significant bottleneck” – mostly no. Because the writer (the Transactor) is dedicated to the task it’s not actually possible to max out the system with writes under most situations. If you feel you need infinite write scaling then datomic wouldn’t be your solution but I would encourage you to actually evaluate whether the ongoing novelty of your data could saturate I/O. The above post cites the ability to write hundreds of thousands of tuples per second. How many use cases require more throughput than that outside of infrequent bulk imports of data?

“there still doesn’t seem to be a good way to destroy data” – other people could give better answers to this but I can tell you what I plan to do for a project I’ve been working on. I’m using datomic to store parsed syslog data and each day’s worth of data will be stored in its own database. With datalog you can run your queries across multiple databases. When I’m ready to expire data I’ll just stop querying older databases and then delete the data.

The JVM has only ever annoyed me. I try to stay away from it if I can, especially since I primarily use Python. It’s also the reason I’ve used Clojure(Script) less.

I have several use-cases where peers couldn’t possibly store all data locally. Also, having to wait a long time on the first query for the cache to get populated is not acceptable, I want my queries to have predictable latency. Perhaps copying the db to peers as part of the bootstrap step would remove the initial latency, but I’m still not sure what to do about replicating a very large database to every single worker.

The transactor may be fast, but there is at least two network hops worth of latency (writer to transactor, transactor notification to reader). That is less than ideal, but perhaps not a problem in practice as you say.

Being able to (rarely) entirely destroy data is necessary for legal reasons. Users might wish to have all of their data removed, which means it would have to be hunted in all local caches, or something to that effect.

I’m evaluating Datomic as the primary storage layer in an application that I am writing. Are there any resources out there that you would recommend to give a good overview? I am intrigued that it has full-text search baked in via Lucene, since full text search in the storage layer itself is definitely something that would be useful in my app.