I've been doing some performance testing of the various ways that
attachments can be uploaded to CouchDB. I think that what I'm seeing
points to some pathological behavoir inside couch, but that's just a
guess (I don't really know anything about couch internals). However,
if I'm understanding the implications correctly, there might be the
possibility to make replication much, much faster for large
attachments (by speeding up the multipart API).
To get the data yourself, run 'python makedata.py' once, and then
repeatedly run 'bash do-curls.sh' to get timing information (perhaps
while making performance tweaks, if you're a dev). Code is on github:
https://github.com/wickedgrey/couchdb-attachment-speed
It's a bit janky, but gets the job done. The main takeaway: the
multipart API is just as slow as base64 encoding everything. Expect
to pay roughly a 10x performance penalty for using either api vs.
uploading the attachment separately.
All of the tests were run against a local 1.1.1 couch recently
installed via brew with delayed commits set to false. Hardware was a
2010 macbook pro w/ 8GB of ram, lightly loaded (browser and IDE
running but idle at the same time as the tests were run). The general
shape of the timing data didn't change over multiple runs. I haven't
looked into couch memory or cpu usage while handling the uploads.
n raw base64 multipart py b64 encode
py b64 decode
1 0m0.136s 0m0.014s 0m0.013s
0:00:00.000015 0:00:00.000009
2 0m0.014s 0m0.016s 0m0.015s
0:00:00.000012 0:00:00.000011
3 0m0.015s 0m0.017s 0m1.027s
0:00:00.000016 0:00:00.000021
4 0m0.015s 0m0.018s 0m2.020s
0:00:00.000057 0:00:00.000090
5 0m0.017s 0m0.035s 0m2.027s
0:00:00.000361 0:00:00.000801
6 0m0.054s 0m0.202s 0m1.133s
0:00:00.003541 0:00:00.005455
7 0m0.361s 0m1.859s 0m2.318s
0:00:00.043847 0:00:00.059307
8 0m3.531s 0m19.336s 0m15.820s
0:00:00.472431 0:00:00.822210
9 0m36.594s 3m24.152s 5m45.110s ? ?
One of the interesting issues that I ran into when working on
constructing the data was with trying to run a gig of text data
through the python JSON parser. It seemed that there were a couple
copies of the data being made (I'd guess the original data, then an
escaped version, and then the final string?) which slowed things down
quite a bit.
The current state of affairs is especially frustrating for me, since
my use case doesn't permit having documents in an attachment-less
(read: inconsistent) state. My ideal case would to have the multipart
API:
- Sped up to be roughly the same speed as standalone attachments
- Extended/changed/supplemented to allow for multiple documents at
once, like the bulk API.
In any case, thanks for reading. I hope this helps make CouchDB even
better. :)
Cheers,
Eli