Dissecting SimpleDB BoxUsage

Billing for usage of a database server which is shared between many
customers is hard. You can't just measure the size of databases, since
a heavily used 1 GB database is far more resource-intensive than a
lightly used 100 GB database; you can't just count queries, since some
queries require far more CPU time -- or disk accesses -- than others;
and you can't even time how long queries take, since modern databases
can handle several queries in parallel, overlapping one query's CPU
time with another query's disk time. When Amazon launched their
SimpleDB service, it looked like they had found a solution in BoxUsage:
As the website states,

Amazon SimpleDB measures the machine utilization of each request and
charges based on the amount of machine capacity used to complete the
particular request [...]

and reports back a BoxUsage value in every response returned by
SimpleDB. Sadly, this "measurement" is fictitious: With the possible
exception of Query requests, BoxUsage values returned by SimpleDB are
entirely synthetic.

Take creating a domain, for example. Issue a CreateDomain request, and
SimpleDB will report back to you that it took 0.0055590278 machine hours
-- never 0.0055590277 or 0.0055590279 hours, always exactly
0.0055590278 machine hours. Deleting a domain? Exactly the same:
Whether the domain is empty or contains lots of items -- for that
matter, even if the domain doesn't exist -- the BoxUsage reported will
be exactly 0.0055590278 hours. Listing the domains you have? That
costs 0.0000071759 hours -- again, never even a tenth of a nano-hour
more or less.

So much for domains; what about storing, retrieving, and deleting data?
Issue a PutAttributes call with one attribute, and it will cost
0.0000219909 hours -- no matter if the item already exists or not, no
matter if the item name, attribute name, and value are one character
long or 100 characters long. Issue a PutAttributes call with two
attributes, and it will cost 0.0000219923 hours. Three attributes
costs 0.0000219961 hours. Four attributes costs 0.0000220035 hours.
See the pattern yet? If not, don't worry -- it took me a while to
figure this one out, mostly because it was so surprising: A PutAttributes
call with N attributes costs 0.0000219907 + 0.0000000002 N^3 hours.
Yes, that's right: The cost is cubic in the number of attributes
-- and I can't imagine any even remotely sane algorithm which would end
up with an O(N^3) cost.

Retrieving stored data is cheaper: A GetAttributes call which
returns N attribute-value pairs costs 0.0000093202 + 0.0000000020 N^2
hours (since the pricing depends on the number of values returned, not
the number of values in the item in question, there's good incentive to
specify which attributes you're interested in when you send a
GetAttributes request). Deleting stored data? Back to cubic again:
A DeleteAttributes call with N attributes specified costs 0.0000219907
+ 0.00000000002 N^3 hours -- exactly the same as a PutAttributes call
with the same number of attributes. Of course, DeleteAttributes has
the advantage that you can specify just the item name and not provide
any attribute names, in which case all of the attributes associated
with the item will be deleted -- and if you do this, the reported
BoxUsage is 0.0000219907 hours, just like the formula predicts with N = 0.

The last type of SimpleDB request is a Query: "Tell me the names of
items matching the following criteria". Here SimpleDB might
actually be measuring machine utilization -- but I doubt it. More
likely, the formula just happens to be sufficiently complicated that
I haven't been able to work it out. What I can say is that a Query
of the form [ 'foo' = 'bar' ] -- that is, "Tell me the names of the
items which have the value 'bar' associated with the attribute 'foo'"
-- costs 0.0000140000 + 0.0000000080 N hours, where N is the number
of matching items; and that even for the more complicated queries which
I tried, the cost was always a multiple of 0.0000000040 hours.

Now, there are a lot of odd-looking numbers here -- the variable costs
are all small multiples of a tenth of a nano-hour, and the overhead
cost of a Query is 14 micro-hours, but the others look rather strange.
Convert them to seconds and apply rational reconstruction, however, and
they make a bit more sense:

0.0055590278 hours = 4803 / 240 seconds.

0.0000071759 hours = (31/5) / 240 seconds.

0.0000219907 hours = 19 / 240 seconds.

0.0000093202 hours = (153/19) / 240 seconds.

Where the fifths and (shudder) nineteenths are coming from, I have no
idea; but the odds of these numbers all turning out to be small
rational multiples of 1/240 seconds by coincidence are astronomical.
Where does this unit come from? I can only speculate, but a typical
high-performance drive can do approximately 240 small random disk
accesses per second -- so if Amazon somehow decided that a
PutAttributes call involved 19 disk writes, this overhead cost would
make sense... although I can't imagine how a GetAttributes request
would require 153/19 disk accesses to service. (How a PutAttributes
call could involve 19 disk accesses is also an interesting question
-- but maybe, as with the BoxUsage being cubic in the number of
attributes being stored, Amazon's developers have discovered some very
innovatively bad algorithms.)

Putting this all together, here's what SimpleDB requests cost (at
least right now); μ$ means millionths of a dollar (or dollars per
million requests):

What can we conclude from this? First, if you want to Put 53 or
more attributes associated with a single item, it's cheaper to use
two or more requests due to the bizarre cubic cost formula. Second,
if you want to Get attributes and expect to have more than 97 values
returned, it's cheaper to make two requests, each of which asks for a
subset of the attributes. Third, if you have an item with only one
attribute, and your read:write ratio is more than 22:1, it's cheaper
to use S3 instead of SimpleDB -- even ignoring the storage cost --
since S3's 1 μ$ per GET is cheaper than SimpleDB's 1.305 μ$ per
GetAttributes request. Fourth, someone at Amazon was smoking something
interesting, since there's no way that a PutAttributes call should have
a cost which is cubic in the number of attributes being stored.

And finally, given that all of these costs are repeatable down to a
fraction of a microsecond: Someone at Amazon may well have determined
that these formulas provide good estimates of the amount of machine
capacity needed to service requests; but these are most definitely
not measurements of anything.