AWS signature version 1 is insecure

The important bit first: If you are making Query (aka REST) requests
to Amazon SimpleDB, to Amazon Elastic Compute Cloud (EC2), or to Amazon
Simple Queue Service (SQS) over HTTP, and there is any way for an
attacker to provide you with data which you use to construct your
request, switch to HTTPS or start using AWS signature version 2 now.
For example, if you allow users to add arbitrary "tags" to documents,
and you use SimpleDB to store those tags, this means you. (Amazon
Flexible Payments Service (FPS) and Amazon Devpay also use the same
insecure signature method, but they already require the use of HTTPS.
Amazon S3 and other services use different signature methods.)

I've been sitting on this blog post since May 1st, when I was reading
documentation in preparation for writing the accounting code for my
tarsnap online backup service
and I first noticed that AWS signature version 1 was insecure;
but now that the cat is out of the bag thanks to Amazon announcing the
new signature version, it's time to publish the details of how their
signature version 1 is broken.

AWS signature version 1 signs an HTTP query string as follows:

Split the query string based on '&' and '=' characters into a series
of key-value pairs.

When Amazon invented this signature scheme, they forgot about one of
the foremost design principles relating to cryptographic signatures:
Collisions are BAD! In a well-designed signature system, it should be
computationally infeasible to construct two different messages which
have the same signature; this prevents substitution attacks where an
attacker convinces the key holder to sign a "harmless" message, and
then attaches that signature to a different message.
Looking at how AWS signature version 1 is computed, it's easy to see
how to construct collisions: Because there are no delimiters between
the keys and values, the signature for "foo=bar" is identical to the
signature for "foob=ar"; moreover, the signature for
"foo=bar&fooble=baz" is the same as the signature for
"foo=barfooblebaz".

To see how this could be exploited, let's return to my earlier example
of a website which allows users to add tags to documents. Suppose
that each document is identified as a single item in SimpleDB, and that
each document has attributes associated with it including an "owner"
and one or more "tags" (SimpleDB allows multiple values to be associated
with each parameter name). To add the tag X to a document, the website
would normally issue the SimpleDB request

(whitespace added for clarity and to avoid page-width problems)
which would be signed with the HMAC of
ActionPutAttributesAttribute .0.NametagsAttribute.0.ValueX....
Now consider what happens if someone asks the website to add the tag
"fooAttribute.1.NameownerAttribute .1.ReplacetrueAttribute.1.ValueDr.Evil"
to a document. The website issues the SimpleDB request

which is signed with the HMAC of
ActionPutAttributesAttribute .0.NametagsAttribute.0.ValuefooAttribute .1.NameownerAttribute.1.ReplacetrueAttribute .1.ValueDr.Evil...
-- which would also be the signature for the request

Providing that this request is sent over HTTP,
Dr. Evil just has to capture this request (via network sniffing or
ARP / IP / DNS / BGP attacks) and he can attach the signature from the
real request to his fake request, whereupon he sets himself as the
"owner" of the document in question. If the request is sent over HTTPS,
in contrast, Dr. Evil won't be able to see the signature (unless he has
an SSL certificate for sdb.amazonaws.com, which is unlikely)
and so he won't be able to apply this attack.

I reported this issue to Amazon via an email to Jeff Barr, the "Lead Web
Services Evangelist" at Amazon on May 1st, and while it took a long time
-- 7.5 months -- for it to be fixed, I'm happy to say that Amazon took
this issue seriously at all times, and the lengthy timeline was simply
because of the large amount of work involved. Jeff forwarded my email
to someone working on SimpleDB (I've been asked not to mention names),
who confirmed that they agreed that this was a problem. As part of
their review of my findings, Amazon's security people realized that this
also affected EC2 and SQS -- in my initial investigation I had only
looked at SimpleDB -- and at the beginning of July they agreed to send
me their planned signature version 2 so that I could review it.

Aside from some minor clarifications to the documentation, I saw no
problems with the new signature method, and at that point Amazon started
the lengthy process of implementation, testing, and rolling out the new
signature method. In September, they allowed me to perform some basic
interoperability tests between my code (written based on the documentation)
and their back-end code; this proved very useful, as it uncovered an
ambiguity in the documentation. Amazon then returned to their processes
-- including updating their many client libraries in order to make sure
that everybody would be able to switch to signature version 2 as soon as
it was announced. Now, in mid-December, they've finished updating their
servers, documentation, and libraries, and the new signature is finally
being announced.

I must congratulate Amazon on a highly professional response to this
issue. Companies very frequently have difficulty handling
externally-discovered security problems, both because of a temptation
to downplay the significance of the issues, and also because of a
desire to keep potentially sensitive information out of the hands of
anyone outside of the company -- Intel's response to the Hyperthreading
information leakage problem is a good example of both of these.
The fact that Amazon not only accepted that there was a problem, but was
willing to keep me informed during throughout the process of fixing it
-- even going to the extent of allowing me to review their intended
solution, which is more than the FreeBSD Security Team usually does --
is quite exceptional.

People inevitably make mistakes from time to time. Security problems
happen. But when they do happen, Amazon's response is a good example
to follow.

How Tarsnap uses Amazon Web Services

Regular readers of these Daemonic Dispatches will no doubt have noticed
that I have
mentionedAmazonWebServicesonmanyoccasions,
and it's no secret that my
tarsnap online backup service is
built on top of Amazon Web Services. Over the month since tarsnap
reached
public beta, a number
of people have asked me questions about AWS and how tarsnap uses it, so
I think now is a good time to provide some insight into how the tarsnap
service works behind the scenes.

The tarsnap server provides a transactional key -> blob store to
tarsnap clients. The keys are a fixed 33 bytes (a one-character type
plus a 256-bit unique ID generated using SHA256), while blobs are an
average of about 30 kB but can be as large as 256 kB. In order to
create a new archive, the tarsnap client sends a "write transaction
start" request, many "write data" requests, and a "commit transaction"
request to the tarsnap server; deleting an archive is similar (except
with a "delete transaction start" and "delete data" requests).

The tarsnap server has no concept of separate tarsnap archives or of
eliminating data which is duplicated between archives; instead, it is
up to the tarsnap client to recognize duplicate data and avoid storing
it again, and to only delete data once no remaining archives require it
(this is important for both performance and security). As a result, it
is essential that the tarsnap server provide a storage system which is
both transactional and strongly consistent: Without this guarantee, it
would be possible for partial archives to be stored but "orphaned" in
the event that the tarsnap client crashed in the middle of writing an
archive (in which case an unlucky user would be stuck paying to
store inaccessible and useless bits), or -- even worse -- for the
tarsnap client to delete some bits which were still required by
an archive, rendering that archive unreadable.

Of course, providing strong consistency comes at a price: Tarsnap
sacrifices availability. If, at any point, the tarsnap server finds
that it is unable to service a request, it simply drops the connection
to the tarsnap client; the tarsnap client will then re-connect and
retry. In practice, tarsnap client re-connections occur far more often
due to client-server connections falling victim to network outages and
packet loss; but the tarsnap client handles a server failure and a
network outage identically (in fact, the client code isn't even aware
of the distinction).

Once data reaches the tarsnap server, it is stored to
Amazon S3; and the tarsnap
server only acknowledges the client request once the S3 PUT has
successfully completed -- that is, once the data has been stored on
disks in multiple geographically diverse datacenters. However, S3 by
itself doesn't provide either the consistency guarantees or the
transactionality required by tarsnap. To provide these, the tarsnap
server implements a log-structured filesystem on S3, but holds all of
the relevant metadata on an
EC2 instance. Because log
entry numbers are strictly increasing, each object which the tarsnap
server stores on S3 only has one possible value (if it exists at all); and
because metadata is stored on EC2 (which makes it easy to provide
strong consistency guarantees), this makes it possible to recognize
if S3 provides "stale" data (since the only possible stale response
is a 404 error).

Now, holding (meta)data on an EC2 instance means that we must accept
the possibility of the EC2 instance dying; but the use of a
log-structured filesystem makes this an easy problem to solve: All
of the metadata is implicit in the individual log entries, so to
regenerate the metadata one must merely read the log entries back
from S3. In fact, this provides a very good "safety net" to protect
against any unforseen glitches in the tarsnap service: If all else
fails, I can do a complete reboot of the service by throwing away
everything except the data stored on S3 and then reconstructing all
of the transient state.

Naturally, this all comes at a cost, and this is part of why tarsnap's
price for storage ($0.30 / GB / month) and bandwidth ($0.30 / GB) is
higher than S3's prices ($0.15 / GB / month for storage, and $0.10 --
$0.17 / GB for bandwidth). However, the price difference isn't as
large as it seems: In addition to the price of storage and bandwidth,
S3 charges a per-request fee of $0.00001 (for PUTs) or $0.000001 (for
GETs). While this seems small, it adds up: If the tarsnap client
wrote data directly to S3 (ignoring, for the purpose of argument, the
fact that S3 doesn't provide consistency and transactionality) the
added cost of S3 PUTs would make it more expensive than writing data
via the tarsnap server. Because the tarsnap server services requests
from many clients at once, it is able to "bundle" multiple writes
together, with the result that the cost of PUTs which
I end up paying S3 is considerably lower on a per-GB basis.

That said, I do make a few cents of profit out of the $0.30; but
unlike Jungle
Disk, I don't charge an up-front fee for the tarsnap client code
(the Jungle Disk software costs $20) or a monthly fee for the service
(Jungle Disk Plus costs a flat $1/month beyond the S3 costs), and
given that I spent two years working full-time on writing all the code
for tarsnap, I don't think it's unreasonable for me to add a small
markup to the service in order to pay for my time. :-)

Speaking of money, I would be remiss if I failed to mention
SimpleDB. While I wrote
last year that one should not
try to use [SimpleDB] to
store any sort of accounting information, this is in fact exactly
what I'm doing: Tarsnap users' current and historical account balanaces
along with how much storage and bandwidth they have used, are all stored
in SimpleDB. The fact that SimpleDB lacks any useful consistency
guarantee means that it is theoretically impossible to do this without
a risk that a user's usage, or worse, a payment made will be
"lost" and not reflected in a user's current balance; but I've written
the code in such a way that the tarsnap accounting code will never lose
anything providing that SimpleDB always reaches consistency within 24
hours. Given that I've been told that SimpleDB usually reaches
consistency within a few seconds, the danger of losing some accounting
data is low enough that I'm not particularly concerned. I would never
be so cavalier about someone's data, but accounting data... well, it's
only money, after all.

There are still improvements which I'd like to see made to Amazon Web
Services: Removing "eventual consistency" or at least replacing it
with eventually known consistency (the distinction being that
with eventually known consistency you can query an API to ask if
updates have propagated yet) and adding support for running FreeBSD on EC2
(I've signed an NDA, so I can't say much here except that we're working
on it) are at the top of my list, closely followed by expanding the
Flexible Payments Service to
Canada. For all its quirks and limitations, though, Amazon Web Services
is a great platform which I'd recommend to anyone interested in building
an online service.