Brendan Hay

Blog

I wanted to share an overview of a new library named amazonka-s3-encryption,
which was created to supplement amazonka-s3 with client-side encryption.
Client-side encryption allows transmission and storage of sensitive
information (Data in Motion), whilst ensuring that Amazon never receives any of
your unencrypted data. Previously amazonka-s3 only supported server-side encryption
(Data at Rest), which requires transmission of unencrypted data to S3. The cryptographic
techniques used within the library are modeled as closely as possible upon the
official AWS SDKs, specifically the Java AWS SDK. Haddock documentation is available
here.

Contents

Chunked Encoding

The version 4 signing algorithm supports two modes for signing requests when communicating
with S3. The first requires a SHA256 hash of the payload to calculate
the request signature and the second allows incremental signature calculation for
fixed or variable chunks of the payload. Up until now, amazonka (and all other SDKs excepting Java)
only supported the first method.

This poses a problem for encryption, where the need to calculate the SHA256 hash
of the encrypted contents requires the use of a temporary file or another buffering
mechanism. For example, the aws-sdk-ruby library performs the following procedure
to send an encrypted PutObject request:

Copy and encrypt the payload to a temporary file.

Obtain the SHA256 hash and file size of the encrypted file.

Stream the file contents to the socket during transmission.

This means whatever the payload size is, you have to stream/encrypt a complete copy
of the payload contents to a temporary file before sending.

To avoid this same pitfall, amazonka-s3 now uses streaming signature calculation
when sending requests. This removes the need for the pre-calculated SHA256 hash
and allows the encryption and signing to be performed incrementally as the request
is sent.

Unfortunately, despite the documentation claiming that Transfer-Encoding: chunked
is supported - it appears that you need to estimate the encrypted Content-Length
(including metadata) and send this without the Transfer-Encoding header, otherwise
the signature calculations simply fail with the usual obtuse S3 403 response.

The smart constructors emitted by the generation step for all amazonka-* operations
now take into account streaming signature support and you’re likely to encounter
the following parameters for operations utilising a streaming request body:

ToHashedBody and ToBody type classes are provided to make it easy to convert
values such as JSON, ByteString, etc into the appropriate request body. amazonka
itself exports functions such as hashedFile, chunkedFile and others to assist
in constructing streaming request bodies.

All regular S3 PutObject and UploadPart operations now take advantage of
streaming signature calculation with the default chunk size set to 128 KB. This seems
to be a decent trade off between streaming and the expense of incrementally performing
signature calculations, but I’d recommend profiling for your particular use-case
if performance and allocations are a concern.

The above information is available in a more context sensitive format within the
documentation.

Encryption and Decryption

Client-side encryption of S3 objects is used to securely and safely store sensitive
data in S3. When using client-side encryption, the data is encrypted before it
is sent to S3, meaning Amazon does not receive your unencrypted object data. Unfortunately
the object metadata (headers) still leak, so any sensitive information should be
stored within the payload itself.

The procedure for encryption is as follows:

A one-time-use symmetric key a.k.a. a data encryption key (or data key) and
initialisation vector (IV) are generated locally. This data key and IV are used
to encrypt the data of a single S3 object using an AES256 cipher in CBC mode,
with PKCS7 padding. (For each object sent, a completely separate data key and IV are generated.)

The generated data encryption key used above is encrypted using a symmetric
AES256 cipher in ECB mode, asymmetric RSA, or KMS facilities, depending on the
client-side master key you provided.

The encrypted data is uploaded and the encrypted data key and material description
are attached as object metadata (either headers or a separate instruction file).
If KMS is used, the material description helps determine which client-side master
key is later used for decryption, otherwise the configured client-side key at
time of decryption is used.

For decryption:

The encrypted object is downloaded from Amazon S3 along with any metadata.
If KMS was used to encrypt the data then the master key id is taken from the
metadata material description, otherwise the client-side master key in the
current environment is used to decrypt the data key, which in turn is used
to decrypt the object data.

If you’re unsure about which key mechanism to use, I’d recommend using KMS initially
to avoid having to store and manage your own master keys.

Instruction Files

By default, the metadata (known as an envelope) required for encryption
(except for the master key itself) is stored as S3 object metadata on the encrypted
object. Due to user-defined S3 metadata
being limited to 8KB when sending a PUT request, if you are utilising object
metadata for another purpose which exceeds this limit, an alternative method
of storing the encryption envelope in an adjacent S3 object is provided. This
method removes the metadata overhead at the expense of an additional HTTP request
to perform encryption/decryption. By default the library will store and retrieve
a <your-object-key>.instruction object if the related *Instruction suffixed
functions are used.

Compatibility and Status

Metadata and instruction envelopes are designed to be compatible with the
official Java AWS SDK (both V1 and V2 formats), but only a limited set of the possible
encryption options are supported. Therefore assuming defaults, objects stored
with this library should be retrievable by any of the other official SDKs, and
vice versa. Support for other cryptographic configurations will be added in future,
as needed.

amazonka-s3-encryption can currently be considered an initial preview release.
Despite this, it’s tied to the greater release process for the other amazonka-*
libraries and therefore life will start somewhere after version 1.3.1.
It is separated from amazonka-s3 proper, there are extra dependencies
not desirable within the main S3 package, such as amazonka-kms and
conduit-combinators. This way those using unencrypted S3 operations do not
inadvertantly end up with an amazonka-kms dependency.

The library is currently being used in a limited capacity and the release to
Hackage will be delayed until I’m confident of correctness, robustness and
compatibility aspects. If you’re brave enough to experiment, it’s contained within
the greater amazonka project on GitHub.
Please open an issue with any problems/suggestions
or drop into the Amazonka Gitter chat if you have questions.

After 4 months, nearly 900 commits and an inordinate number of ExitFailure (-9)
build errors, version 1.0 of the Haskell Amazonka
AWS SDK has been released.

Some of the features include significant changes to the underlying
generation mechanisms, along with changes to the external surface APIs which are
outlined below.

Looking back at the initial commits for Amazonka show that it’s taken 2 years
and nearly 3,300 commits reach this milestone. The entire suite now consists of
55 libraries over 200K LOC and is in use by a diverse set of individuals and
companies.

I’d like to thank everybody who contributed to the release. If you have feedback
or encounter any problems, please open a GitHub issue,
reach out via the maintainer email located in the cabal files, or join the freshly
minted Gitter chat.

A whirlwind summary of some of the changes you can find in 1.0 follows, in
no particular order.

Contents

Errors

Previously the individual services either had a service-specific error type such as EC2Error,
a generated type, or shared one of the RESTError or XMLError types.

In place of these, there is now a single unified Error type containing HTTP,
serialisation and service specific errors.

In addition to this change to the underlying errors, changes have also been made
to the exposed interfaces in amazonka, which commonly had signatures such as
Either Error (Rs a) and in turn the AWST transformer lifted this result into
an internal ExceptT.

Since the previous approach was not amenable to composition due to the concrete
Error, functional dependencies and instances MonadError/MonadReader, the
library still passes around Either Error a internally, but externally it
exposes a MonadThrow constraint and I recommend using Control.Exception.Lens
and the various Prisms available from AsError
to catch/handle specific errors.

Which can be used in the same fashion as the previous example. Check out the individual
library’s main service interface Network.AWS.<ServiceName> to see what error
matchers are available.

Free Monad

The core logic of sending requests, retrieving EC2 metadata and presigning are
now provided by interpretations for a free monad. This works by the regular functions
exposed from Network.AWS and Control.Monad.Trans.AWS constructing layers of
a FreeT Command AST which will be interpreted by using runAWS or runAWST.

This allows for mocking AWS logic in your program by replacing any runAWS or
runAWST call with a custom interpretation of the FreeT Command AST.

Network.AWS vs Control.Monad.Trans.AWS

Due to the previously mentioned changes to Error and ExceptT usage, the surface
API for the main modules offered by the amazonka library have changed somewhat.

Firstly, you’ll now need to manually call runResourceT to unwrap any ResourceT
actions, whereas previously it was internalised into the AWST stack.

Secondly, errors now need to be explicitly caught and handled via the aforementioned
error/exception mechanisms.

The primary use case for Network.AWS is the fact that since AWS is
simply AWST specialised to IO, a MonadAWS type class is provided to automatically
lift the functions from Network.AWS without having to lift . lift ...
through an encompassing application monad stack.

But that said, Network.AWS is simply built upon Control.Monad.Trans.AWS, which in
turn is built upon Network.AWS.Free. All of these modules are exposed and most
of the functions compose with respect to MonadFree Command m constraints.

Authentication

The mechanisms for supplying AuthN/AuthZ information have minor changes to
make the library consistent with the official AWS SDKs.

For example, when retrieving credentials from the environment the following
variables are used:

Multiple [profile] sections can co-exist and the selected profile is determined
by arguments to getAuth, with [default] being used for Discover.

You can read more information about the standard AWS credential mechanisms on
the AWS security blog.

Configuring Requests

Service
configuration such as endpoints or timeouts can be overridden per request via the
*Withsuffixed functions.
For example, changing the timeout to 10 seconds for a particular request:

sendWith(svcTimeout?~10)(getObject"bucket-name""object-key")

In fact, since modifying timeouts and retry logic is so common, functions are provided
to do this for one or more actions in the form of:

once :: m a -> m a

timeout :: Seconds -> m a -> m a

within :: Region -> m a -> m a

Field Naming

The way lens prefixes are generated has been completely re-implemented. This is for a number
of reasons such as stability of ordering, stability of a historically selected
prefix with regards to introduced fields and a desire to reduce the number of suffixed
ordinals that needed to be introduced to disambiguate fields.

Additionally, casing mechanisms now universally treat an acronym such as Vpc
into the form of VPC. This is pervasive and consistent through naming of operations,
types, module namespaces, etc.

Both of these are breaking changes, but are considerably more future proof than
the previous implementation.

Generator

The previous generator predominantly used textual template rendering to emit
Haskell declarations and a fair amount of logic was tied up in templating code.
The new(er) generator now constructs a Haskell AST and then pretty prints code
declarations. Actual layout, spacing and comments are still done by templates.

This results in less code, including templating logic and defers any sort
of formatting to tools like hindent and stylish-haskell.

As an artifact of these changes, it is now considerably slower. :)

Additional Services

Since the initial public release of Amazonka, an additional 12 libraries have
been added to the suite, consisting of:

Query protocol services that submit POST requests now serialise the entirety of their
contents as application/x-www-form-urlencoded to avoid URL length issues.

Placeholder fixtures and tests are now generated for every request and response.

Per project examples have been removed in favour of a single amazonka-examples project.

All modules are now exported from amazonka-core but the interface is only considered
stable with regard to other amazonka-* libraries. Any use of amazonka-core should be treated
as if every module was .Internal.

Supported GHC Versions

The currently supported GHC versions are 7.8.4 and 7.10, built against
stackage lts-2.* and nightly-* respectively. The libraries will probably
work on 7.6.3 as well, but active testing is not done for reasons of personal scale.

Cabal vs Stack

In place of cabal sandbox, stack is now used for all development due to the
multi-lib nature of the project. This has been a huge improvement to my
development workflow, but because of this testing with cabal-install has become
somewhat limited. For now, if you’re trying to build the project from git, I suggest
sticking to stack and using the supplied stack-*.yml configurations.

In my day job as a glorified System Administrator I have the opportunity to write infrastructure, services, and tooling in Haskell, where traditionally someone in my position might reach for the hammers labeled Perl, Python, or Ruby et al.

While the advantages are many and those can be left to another blog post - a recurring pain point where Haskell falls down is in what I would categorise as mundane and commercial library availability:

Mundane: offers little intellectual reward to the library author. For myself this is anything that includes vast swathes of (mostly) repititious serialisation code that cannot be nicely abstracted using something like GHC.Generics.

Commercial: Company X offers compelling service Y that you wish to utilise, of which there are officially supported client libraries in Java, .NET, Python, and Ruby.

Haskell offers plenty of mechanisms for limiting boilerplate and these generally work well in the face of uniformity (See:pagerduty), but faced with supporting an inconsistent API of sufficient scope, I hereby postulate both of the above categories will be satisfied and many shall wring their hands and despair.

Contents

Status Quo

As a concrete example, In early 2013 we decided to exclusively use Amazon Web Services for our entire infrastructure. Coupled with the fact that all of our backend/infrastructure related code is written in Haskell, the lack of comprehensive and consistent AWS libraries proved to be a problem.

Looking at the AWS category on Hackage, the collectively supported services are:

Cloud Watch

Elastic Compute Cloud

Elastic Load Balancing

Elastic Transcoder

Identity and Access Management

Kinesis

Relational Database Service

Route53

Simple Database Service

Simple Email Service

Simple Notification Service

Simple Storage Service

In some of these implementations the supported feature set is incomplete and approximately 30 services from Amazon’s total offering are not available at all.

This results in a subpar experience relative to Python, Ruby, Java, or .NET, for which there are official SDKs.

A Comprehensive Haskell AWS Client

After coming to the realisation in late 2012 - early 2013, that there were no Haskell libraries supporting the services we wished to use, I went down the route of providing a stopgap solution so we could begin building our infrastructure without having to compromise our language choice. This yielded a code generation Frankenstein which crawled the AWS documentation HTML, available SOAP definitions, and XSDs to provide AutoScaling, EC2, IAM, S3, CloudWatch, Route53, and ELB bindings.

While this was immediately useful, the obvious inconsistencies arising from HTML brittleness along with public XSDs in particular being an apparently legacy artifact for most services, intertia set in and I was unable to continue utilising the above approach for expanding the library offerings.

Going back to the drawing board in mid 2013, I started working on implementing a more future proof and sustainable approach to providing a truly comprehensive AWS SDK I could use for all my projects, both personal and professional.

The key enabler for this next approach was the discovery of the Amazon Service models, which are typically vendored with each of the official SDKs and provide a reasonably well typed representation of each of the services, warts and all.

Aside: the format of the service definitions has changed a couple of times and I’ve been forced to rewrite pieces of the generation code more than once due to oversight.

The end result is called amazonka, consisting of 43 different libraries covering all currently available non-preview AWS services.

In the following topics I’ll briefly highlight some of the features and potentially contentious design decisions, and the reasoning behind them.

Note: This is a preview release designed to gather feedback, and I’ve not used all of the services (for example Kinesis, or SNS) personally, which will no doubt result in issues regarding the de/serialisation of requests, responses, errors, and possibly tears.

I’m relying on the brave to offer up constructive feedback via GitHub Issues since the scope is too much for me to test in practice, alone.

Liptstick on a Pig

Since the definitions appear to be generated from Java-style services, the corresponding AST and type information follows similar Object Oriented naming conventions and class level nesting.

This isn’t particuarly nice to work with in a langauge like Haskell, as it results in alot of extraneous types. Libraries in various other languages provide the proverbial lipstick on a pig and alter the types in such a way to make them more consistent with the host language’s semantics.

Despite these points, I feel the advantages of providing types which strictly implement the naming and structure of the AWS types makes it easier to follow along with the Amazon API reference, and the use of lenses in this case mitigates some of the annoyances relating to access and traversal.

The intent is to provide a more low-level interface which corresponds 1:1 with the actual API, and let people supply their own lipstick.

Lenses and Roles

Amazon utilises a number of different de/serialisation mechanisms ranging from the venerable XML and JSON, to more esoteric querystring serialisation of datatypes, and I inevitably ran up against the prototypical newtype explosion when avoiding orphan instances due to the heavy usage of type classes.

The solution for this was divorcing the internal structure from the representation observed and manipulated by the user. This approach allows extensive use of newtype wrappers internally, to define non-orhpaned instances for types such as NonEmpty, Natural, HashMap, or Bool, but exposes the underlying type to the user and the wrapper is never needed outside the core library.

Isos are paired with lenses to hide the (un)wrapping of newtypes from the user.

Roles are used to avoid the need to traverse structures such as NonEmpty or HashMap when converting between the internal and external representations.

Here is the List and Map newtype wrappers from amazonka-core:

-- | List is used to define specialised JSON, XML, and Query instances for-- serialisation and deserialisation.---- The e :: Symbol over which list is parameterised-- is used as the enclosing element name when serialising-- XML or Query instances.newtypeList(e::Symbol)a=List{list::[a]}deriving(Eq,Ord,Show,Semigroup,Monoid)-- Requires the RoleAnnotations GHC extension.typeroleListphantomrepresentational_List::(Coercibleab,Coercibleba)=>Iso'(Listea)[b]_List=iso(coerce.list)(List.coerce)-- | Map is used similarly to define specialised de/serialisation instances-- and to allow coercion of the values of the HashMap, but not the Key.newtypeMapkv=Map{fromMap::HashMapkv}deriving(Eq,Show,Monoid,Semigroup)typeroleMapnominalrepresentational_Map::(Coercibleab,Coercibleba)=>Iso'(Mapka)(HashMapkb)_Map=iso(coerce.fromMap)(Map.coerce)

This hopefully illustrates the usefullness of the approach to convert between the two representations. The srItems lens above can be used to manipulate the field with the more friendly [HashMap Text AttributeValue] representation, and you can retain all of the benefits of wrapping newtypes at arbitrary depths internally.

The following links provide detailed explanations of Roles and their implementation:

Smart Constructors

Providing the minimum number of parameters to satisfy construction of a valid request is desirable for succinctness, as opposed to comprehensively specifying every field of the underlying record.

This simply involves defaulting any Maybe a or Monoid field types to their respective Nothing or mempty, and supplying a smart constructor which delineates only the required parameters.

For example the operation CreateAutoScalingGroup contains 15 fields, most of which are optional, and can be constructed with the fewest parameters required to create a valid Auto Scaling Group, or modified using lenses to specify any additional values for the optional fields before sending.

-- | Ciphertext that contains the wrapped key. You must store the blob-- and encryption context so that the ciphertext can be decrypted.-- You must provide both the ciphertext blob and the encryption context.gdkrCiphertextBlob::Lens'GenerateDataKeyResponse(MaybeBase64)

Currently links and other markup are stripped, but in future I hope to convert it directly to Haddock and retain all of the supplied documentation in a fashion similar to the official SDKs.

One Library per Service

To illustrate the large nature of the codebase, everybody’s favourite productivity measurer cloc shows:

LanguagefilesblankcommentcodeHaskell12583446278158145314

Since you generally do not depend on every service simultaneously, forcing users to compile 140,000+ lines of code they are probably not interested in is pointless.

Despite the maintenance overheads, cabal versioning, and potential discovery problems, encapsulating the code along service boundaries results in a much better user experience.

Conclusion

While generating code may not yield the same user friendliness as hand written code in every case, it seems to scale very well for this particular class of problem.

During the recent 2014 AWS Invent over 8 new services were announced, with Key Management Service, Lambda, Config, and CodeDeploy being available, effective immediately. I was able to support these services not long after announcement by running amazonka-gen:

makecleanmake

Which was a nice validation of the approach.

Overall I’m happy with the current status and direction, despite there still being a large amount of work ahead to place Haskell on an equal footing with other langauges in regards to building Cloud services and infrastructure.

Some items that I’ve identified for the immediate roadmap are:

Some responses lack required field information, resulting in Maybe a for always-present fields. Overrides need to be manually annotated.

Comprehensive testing and usage of all services.

Improved documentation parsing (retaining links as Haddock markup).

Additional hand written documentation about usage.

Implement waiters and retries according to the service specifications.