Use-Case

In my previous article, I was using DynamoDB as my Parameter Store and KMS to encrypt all the information I do not want to see accessible in plain text. The only issue with that is that a mistake in the Lambda code and you end up breaking your parameter store. In the code that is available on Github for my previous version of the parameter encryption, there wasn't much of return values capture to know what was wrong and protection of the existing values.

With AWS SSM, that problem is sorted out. AWS SSM manages all of those parameters nicely for me, probably in the same way that I did with DynamoDB and Lambda, but now it is their job to maintain it for me and they have provided a very nice API to me for that.

Given that SSM does that for me, let's integrate that to my CloudFormation templates !

The problem

Some might call me paranoid, but I believe that a system that involves humans in getting access to secrets, isn't a good thing. Of course, somebody smart with elevated access can workaround security and get information, however, that's not a reason not to try.

So, as I read the documentation of CloudFormation for the SSM:Parameter, I found out (as I write this article) that SecureStrings aren't a supported parameter type. Which means, no KMS encryption of the value of my parameter.

Lambda my good friend

As usual, a limitation of AWS CloudFormation is a call for a little bit of DIY with AWS Lambda.

Now let's think again about the use-case. Most of the time, what I am going to have to generate are a username and password for a service, a database. One rarely goes without the other.
So the code I have written can be totally customized for a more specific use-case, but it is simple

What you need to prepare to make it work

A lot of things can be automated and so, I have also created here the CloudFormation template to create the Lambda function role that you will need to have for your function to work (thanks to all previous readers of the different articles for those suggesstions towards improving the blog content).

Also, you will need to allow that role to use the KMS Key (IAM -> Encryption Keys -> Select the right region -> Select the key -> Add the role to users).

The policy statement to add if you edit the policy should be like:

{"Sid":"Allow use of the key","Effect":"Allow","Principal":{"AWS":["arn:aws:iam::234354856264:role/lambdaSSMregister",]},"Action":["kms:Encrypt","kms:Decrypt","kms:ReEncrypt*","kms:GenerateDataKey*","kms:DescribeKey"],"Resource":"*"}

Where to use it?

This Lambda function returns two values, which have been stored in to SSM as SecureString and only those with access to those strings can Get the value.

Here, instead of doing two different Lambda functions like in the past, one to generate the user and password and another to get those, I thought that the simple successful return of SSM to add the value was enough for me to be sure that the parameter has been successfully stored and created.

So, the Lambda function returns those values in Clear to the CF Stack so they can be injected directly to where it was needed (most likely, your RDS DB).

Now, those values stored in your Parameter Store can also simply be retrieved by automation scripts. For example, within that CF Stack where you have generated the user / password, you also created the name of those parameters. Given that you know the parameters, if you create an instance or an AutoScaling group, you can assign an IAM Profile to either and add an inline policy that grants GetParameter to those very specific parameters in the store.

Small demo

This isn't practically a demo, but just a small walkthrough of what you stack creation process and output would be like.

The use-case

A group of friends asked me for help as they were in need of a SysAdmin for their startup. Their web application, mostly API based, integrates a Video Edition tool, which allows users to upload videos, sounds, images etc. which then they can cut, edit and render. The major requirement for an efficient production was to leverage a GPU. Historically, they were using OVH as their provider, but OVH offering for GPU servers starts at 2K Euros / months. Not really in their budget.

So, of course, I told them to go for AWS instead where they could have GPU instances and pay only for when they need it. So, after a few weeks of work, they had the necessary automation in place to have "Workers" running on GPU instances get created when the SQS queue was growing up. With CloudWatch and Scaling Policies, AWS was starting on-demand GPU instances. From 600USD per month for a G2.2xlarge to just a few USD a day, savings were already significant. But as I was working on that with them, I wanted to go even further and use Spot Instances. For the GPU instances, it is a potential 75% saving of the compute-hour. For production as for development, it is a significant saving.

The problem

With the CloudFormation templates all ready, we simply duplicated the AutoScaling group and Launch Config for the GPU and instances. We now had 2 ASG with the same set of metrics and alarms. But, how could we distinguish which ASG should scale up first when the Queue grows in messages and we need a GPU to compute results faster ? I could not find an easy answer with integrated dispatch within AWS AutoScaling nor CloudWatch services..

Possible solutions

Online we could find SpotInst, a company that manages the AutoScaling groups and whenever a scaling operations is necessary, is going to manage for you "Spot or On-demand ?" (at least, that's what I understand of the documentation). Of course, SpotInst proposes a lot more services integrated with that, but I personaly found a little bit of an overkill for our use-case.

That is where the integration with CloudWatch and SNS, paired with Lambda as a consumer or our SNS topic, comes in and does it all for us with what I have called the "SpotManager".

The idea

As you probably already guessed, the Spot Manager is my Lambda function which will distinguish which of the AutoScaling group should trigger a scale-up activity. Here is an overview of the workflow :

How to?

For this solution, we will need:

Our 2 Autoscaling Groups

Identical Launch Configuration apart from the SpotPrice

Scale up policy configured with no alarm

Scale down policy configured with CW Alarm

SNS Topic to get messages / notifications from CloudWatch

CloudWatch alarms on the Queue

Alarm to raise "Jobs to do" signal

Alarm to raise "No jobs anymore" signal

"SpotManager" Lambda Function

In terms of pricing, the EC2 side of things is purely hour-compute maths, so report to the EC2 Spot pricing and EC2 On-Demand pricing.
Good news : SNS delivery to Lambda costs 0$USD as referenced here, for CloudWatch we count ~0.4$USD per month. The Lambda pricing depends on how long the function runs. Here, it might take up to 1 second per invoke, so, per months you probably won't go over the free-tier.

Total cost : less than 1$USD per month per pair of ASG (Compute pricing excluded).

1 - The AutoScalingGroups

A. The Launch configurations

To simplify all the steps, I have published here a cloudformation template that will create 2 autoscaling groups, as explained earlier, with an identical Launch Configuration at the difference that one has the property * SpotPrice * set.

B. Scale up policy

Here, also in the Cloudformation template provided, we create a scale-up policy : when triggered, this will add 1 instance to the AutoScaling Group by raising the value of the "Desired Capacity". With the on-demand ASG, nothing fancy will happen if you trigger it: the EC2 service will kick off a new instance according to the ASG properties and the Launch Configuration. Now, if you do the same for testing with the ASG configured for Spot Instances, you will notice that first, in the "Spot Instances" section of the Dashboard, the spot request is being evaluated: a Spot request is sent with the Max bid you are willing to pay for that instance type. If the current spot market allows it, this spot request will be granted and an instance will be created in your account.

C. Scale down policy

As we need machines for jobs coming in, we are also capable to tell when we don't need any compute resources anymore. Depending on how you do the queue messages length analysis, you should be able to determine pretty easily when there are no more messages your workers have to consume. Therefore, I have linked the Scale Down policy to the SQS Alarm. The good thing about an alarm is that you can have the same alarm go to multiple actions. So here, as we have 2 ASG we want to treat the same way regarding scale-down, we instruct the alarm to trigger both ASG' scale-down policies.

Note

The alarm can trigger mulitple actions, but remember that you need to configure a scaling policy on each individual ASG for it to work.

Note

For the rest of the blog, I am going to work with 2 ASG. If you haven't already, create those with a minimum at 0, maximum at 1 and desired capacity at 0. No need to pay before we get to the real thing.

2 - The SNS Topic

SNS is probably one of the oldest service in AWS and doesn't stop growing in features. As a key component of the AWS eco-system, it is extremely easy to integrate other services as consumers of the different topics we can have. Here we go in our AWS Console :

Via the cli ?

aws sns create-topic --name mediaworkerqueuealarmsstatus

That was easy, right ? Let's continue.

3 - Cloudwatch and Alarms

For our demo, I have already created a queue called "demoqueue". From here, we have different metrics to work with. Here, I am going to use the * ApproximateNumberOfMessagesVisible * . This number will stack up as long as messages reside in the queue without being consumed.

Warning

Remember that the metrics for the SQS service are updated only every 5 minutes. If for any reason you have to get the jobs started faster than CW to notify you, you will have to find a different way to trigger that alarm

3A - Alarm "There are jobs to process!"

The new Cloudwatch dashboard released just recently makes it even easier to browse the Metrics and create a new alarm.

Identify the metrics

On the CloudWatch dashboard, click on Alarms. There, click on the top button "Create alarm". The different metrics available appear by category. Here, we want to configure the SQS metrics.

Configure the threshold

Check the alarm summary

3B - Alarm "Chill, no more jobs"

For that alarm, we are going to follow the same steps as for the previous alarm, but, we are going to use a different metric and configure a different action. Both our ASG have a scale-down policy. So, let's create that alarm.

Identify the metric

Configure the threshold

Set the alarm actions

4 - The SpotManager function

As explained earlier, I create about everything via CloudFromation, which allows me to leverage tags to identify my resources quickly and easily. That said, the function I share with you today is made to work in any region, the only thing you might have to implement to suit your use-case is how to identify the asg ?.

The code

As usual, the code for the lambda function can be found here, on my github account. Be aware that this function is zipped with different files because I separated each different "core" function to be re-usable in different cases.

However, I have tried to get the best rating from pylint (^^) and document each functions params/return, each of those named with, I hope, self-explainatory names.

Warning

The code shared here is really specific to working with CloudFormation templates and my use-case. I use SQS where you might simply use EC2 metrics, or any kind of metrics. Adapt the code to figure out the action to trigger.

spotworth.py

This is the python file that is going to analyze for each different AZ where you have a subnet. For each subnet, it is going to retrieve the average spot price for the past hour of the instance-type you want to have.

Warning

You could have 3 subnets in 2 AZs within a 5 AZs region, so you actually can run instances within those 2 AZ only, hence why the script takes a VPC ID as parameter.

getAutoScalinGroup.py

Here is the CloudFormation parser that will read all information from the stackname you created the two ASG with. Those functions are mostly wrappers around existing boto3 ones to make it easier to get right to the information we are looking for. In our case, we are going to:

Assume the stack we are looking our ASG in could be nested. Therefore, we look on the stack name given and we find our ASG with their Logical Id expressed in the template (ie: asgGPU)

Once the ASG Physical ID names are found (ie: mystack-asgGPU-546540984) we can retrieve any sort of informations

Instances in the group ?

Scaling Policies ?

Any, but that's all we are looking for here ;)

Of course, we could have looked for the ScalingPolicy physical Ids right away from the CloudFormation template, but just in case you misconfigured / mislinked the ASG and the ScalingPolicy (the policy is not there ?), this helps us verify that that's not the case and our ScalingPolicy is linked to the right ASG.

spotmanager.py

This is the central script from which all the others are going to be executed. Originally, this function was called right away by an Invoke API call providing most of the variables to the function. In the repository, you will find an file named spotmanager_sns.py which is the adaptation of the code to our use-case. The main difference is that, we assume the topic name is a combination of the stackname (AWS::StackName) and other variables. That way we can simply know which Stack runs it and we can find out the rest.

So here is the algorythm.

Note

Any Pull Request to make it a better function is welcome :)

The IAM Role

As for every lambda function, I create an IAM role to control in detail every access of each individual function to the resources. Therefore, here are the different statements I have set in my policy

Note

Do not forget the AWS Managed policies AWSLambdaBasicExecutionRole and AWSLambdaExecute so you will have the logs in CloudWatch logs.

How could we make it more secure ?

I built this function to be work for all my stacks, hence why the resources are "all" (*). But if there is a risk of the possibility that the function could go rogue or exploited, we could do something very simple in our CloudFormation stack :

Create the policies and the role as described earlier specifying the resources as we created them in the CF Template.

Create the lambda function with the stack (requires to have bundled the function as a zip file)

A bit of extra work for extra security. Just keep in mind that, Lambda costs you a little bit for the code storage. But, probably negligeable compared to the financial risks of leaving the function go rogue ?

Conclusion

So the code, I hope you will see different ways to approach this to find out which ASG to trigger. That is the tricky thing to do. However, this really helped my friends to drop their GPU footprint and for the future if the business is successful, extend the GPU footprint above the 1 GPU machine to multiple with the same cost management control.

The use-case

CloudFormation it probably one of my favorite AWS service. It allows hundreds of people today with the deployment of all the architecture resources required for their applications to run on AWS : Instances, Databases etc.

I use CloudFormation all the time, as soon as there is a piece of architecture that I can use in multiple places, in different environments, this becomes my default deployment method (even just to deploy a couple VMs..).

Some of those resources require a particular care : some of the parameters or some of the values have to be kept secret and possibly as less human readable as possible.

Today I want to share with you a thought and the process I have decided to go with for those delicate resources that we might want to secure as much as possible, removing human factor out of the process. As part of the use-case, I want a fully-automated solution which I can re-use anytime and will guarantee me that those values are never the same from one stack to another as well as secured, both in the recovery sense and security.

Different approaches

Here, I am going to work with a very simple use-case : for my application, I neeed a RDS DB. That resource requires a password to get created and for consumers to get connected to it onwards.

1 - Generate from the CLI and hide the values

In our CloudFormation templates, we have the ability to set a "noEcho" on some of the parameters, so after creation we can't read the value. I find this really useful in the case I have some settings I have access to as an elevated administator of the Cloud which I don't want others to be aware of (ie: The Zone ID of a Route53 managed domain). Generally speaking those are values I could have a default value for in the templates (assuming the templates access is as much restricted as the default value you want to keep from others) and once we describe the stack, won't be displayed in clear-text.

Cons:
- Values are known by authorized users and show up in clear text without additional level of restriction

If you have set a default value, you could forget to change it

The values exist only in the templates and stack updates could not affect the resources we wanted to update.

2 - Leverage Lambda, KMS and DynamoDB

With CloudFormation, we can create CustomResource(s) and point it to a Lambda function. This Lambda function will execute some code and get treated as any sort of resource in the Stack. Now, in our use-case where we have to set a Master password for our RDS Instance, we want that password to be different for every stack we create, store it somewhere so we can retrieve it, but as we store it, it has to be protected so it won't be human readable.

That is true for a password, you could extend that function to encrypt and store any sort of information you would have compute generated and encrypted. You might just want the randomness, or the encryption, or both. It's up to you.

Pros:
- Every stack resource that needs some random value will get a new value everytime

Each random value will be encrypted with an AWS managed encryption key (using AWS KMS), and DynamoDB will store it region-wise.

We can backup our DynamoDB table to a S3 bucket (most likely, encrypted as well with a different key) for recovery (and leverage S3 replication to backup up globally).

Cons:

Can look like an overkill for not so much (I honestly had that thought at first)

Requires a good understanding of how CloudFormation and Lambda custom resources work together.

At the end of the day, the cost of that solution is probably around 1 USD per month, for the KMS key + the Lambda function + DynamoDB storage. So, it is a neutral argument, unless you end up with a bazillion of stacks and stored resources. If you think that would be your case, see the At Scale section.

3 - Use S3 bucket

Update on : 2016-11-07 in reponse to Harold L. Spencer. Thanks Harold for that proposal ;)

Here, instead of going with DynamoDB to backup the passwords etc, Harold asked if it would be better to use S3 to store the passwords : as the stack is created, we would create the same kind of record in a file, which we would encrypt and store into a S3 bucket (itself encrypted). So, here is what I see as pros and cons:

Pros:

No need for DynamoDB, so potentially removes the Capacity Units for reads and writes more expensive than S3 Get/Put

S3 has a replication mechanism multi-region, so we can save our data as we need

S3 has a versioning system, so we could version each new configuration if need be.

Cons:

No query capabilities in S3, so to find the file you are looking for, it needs to be unique and already need to know what is key is.

The parsing necessary depending on the file architecture made in the S3 bucket or the payload file could make it more difficult to update / delete the file

Even with versioning, you might not be able to determine what went wrong if you corrupted the file (or at least as complicated as with DynamoDB)

At this point, I would agree that using S3 for storage could be a viable and even cheaper solution. However, as said in the At scale section, here is why I think this might be a alike:

For both DynamoDB and S3, you have to make a KMS call to encrypt and decrypt the payload that is going to be stored. Regardless of the scale, you call KMS the same way in both cases..

In this very particular use-case, the chances that the DynamoDB table read requirements higher than the free-tier (25 Units) as extremly low.

With the right combination of automation, you can as easily backup to one (or more) S3 bucket(s) a DynamoDB table as would a S3 bucket with replication.

How to ?

So, at this point, I have decided to use that second method for all my RDS resources I will create with CloudFormation. Here is what we need to do:

Create a DynamoDB table (per region) we are going to use to store our different stacks passwords into

Create a KMS key (0.53$ per month, so ..)

Create the Lambda functions

Create a Lambda function to generate, encrypt and store the password in DynamoDB

Create a Lambda function to decrypt the key for both CloudFormation and any Invoke capable resource

Create the cloudformation resources in our stack to generate all of the above

Here is a very simple diagram of the workflow our CloudFormation stack is going to go through to create our RDS resources.

1 - The DynamoDB table

Why DynamoDB ? Well, because it is very simple to use and very cheap for our use-case. Not to mention, you won't even go over the free-tier. But at first, DynamoDB is a NoSQL service that you can use directly via API calls as long as the consumer has permissions to write/read from it.
Very simple : we are going to create a table with a primary key and a sort key (ensure we aren't doing anything stupid). The DynamoDB table structure is discussable. Please comment if you have suggestions :)

Create the table - Dashboard

In your Dashboard, go to the DynamoDB service. There, start to create a new table.

With CloudFormation, I use extensively the "Env" tag to be able to identify all other resources via mappings etc. To create my table, I decided to pair the stack name (which is unique inthe region, granted) and this env value. That way, it sorts of ensure me that I am not overwritting a key in the occasion of a mistake and instead of creating a new item, the function will update the field and you could possibly loose the information ..

There, we are going to use only a very little of the writes and reads. Therefore, there is no need to go with the default values of 5 RSU for reads and writes.
Wait for the table to be created (should only take a minute really ..). Make sure all settings look good.

2 - The KMS Key

DynamoDB doesn't come up with a native encryption solution, and furthermore, all data is potentially cleartext at a certain extent. So, prior to storing our password, we are going to leverage KMS to cypher our password.
The good thing is : KMS has probably less risks of loosing your key than you have to loose your USB key or tape, for old-school.

Create the KMS key - Dashboard

In IAM, select the right region and create a new key.

I have decided to call my key so I have a very simple way to identify what each key does. Maybe something to exploit to mistake the enemy ? ^^ Make sure you get KMS to manage it for you ..

The administator of the key are the users / roles who can revoke (delete) a key or change its configuration. Choose very carefully the users. Here, I select my user as the only administator of the key.

Now, just as for the admins, I select which users can use the key to encrypt / decrypt data with it. For now, I only select my user. Later on, we will grant those user rights to our IAM role for the lambda functions.

Final validation of the IAM policy that is for the key itself. This is a key policy, check it twice !

Click on Finish to complete the key creation.

Here ! Your key has been created and we can start using it. Note the KeyID somewhere or remember how to come back here, we will need that key for later.

3 - The Lambda functionS

AWS Lambda .. How awesome service, right ? Write some code, store it, call it when you need it, with no additional pain. So, this is where you discover that I am a Python developer, and as such, all my Lambda functions are done in Python. A bit of history : I started with one of the first versions of boto 2. And, it was nice, but, once I tasted some of boto3 and its documentation .. this is where the sweetness comes ;) boto3 really makes it super easy for us to talk to AWS.

So, as for the code, you will be able to find it on gists / github.com in links as we go through that script.

3A - Generate, encrypt and store the password

The lambda function's role

And that's it. Remember, in AWS as in general, the less privileges you give to a function, the lesser the risks of exposing problems where someone gains access to it.

Start with going in IAM again, in the roles this time. Now here, click on create a new role

I usually prefix the role with the roletype. Here, lambda as this role will be used by the Lambda function. Now, cfEncrypt tells me this is the role we will use for the encryption function.

In the AWS Services roles, select AWS Lambda. This is what's called the trust policy. It simply exposes that for this role, IAM will allow API calls from the lambda functions.

We are going to select 2 AWS managed policies as AWS preconfigured those for general purpose. Those policies are necessary for Lambda to create the logs files and other reports. If you find those too permissive, feel free to change them. Beware that you have to know all the details around CW and Lambda functions logging.

Here, final step. We are good to create the role :)

Now we have the baseline for our Lambda function to have the appropriate powers, we still have to create a policy so it will be allowed to write (or read for the decrypt function) to our DynamoDB table. The policy should be as follows :

So, via the Dashboard again, here is simple run-through how-to create the policy properly.
Go to the IAM service, then in policies section, then click on "Create policy" button. On the next screen, select "Policy Generator"

The policy generator is a very simple and efficient tool to help you build the JSON policy if you aren't familiar / used to write and read JSON IAM policies.

In the service dropbox, select "AWS DyanmoDB", then select the "PutItem" Action. In the Resource ARN filed, use the DynamoDB ARN of your table. This is the ultimate way to be sure that the policy won't allow any other action against any other table.

Once you've clicked you will see a first statement has been created for the policy. For now, we don't need any other statement for the policy, so now click on "Next step"

The last step before creation is to review the JSON and the policy name / description. Once you have named your policy and description, click on "Create policy"

At this point, we simply have to attach the policy to our existing role.
So back to the roles in the IAM dashboard, select the "lambdaCfDecrypt" role. In the role description page, select "Attach policy". You are taken to a new page where you can select the role to attach:

Create the function

So, first of all we want to generate a password that will comply to our security policy and works for our backend. In my use-case, it is a MySQL DB so, as it is, I go for letters (lower and major cases), numbers and a special caracter.
One the password is generated (0.05ms later ..) we are going to call KMS and use our key to cypher the password text (add another 20ms). Then, we write the base64 of the whole thing to our DynamoDB table with all the attributes necessary to make sure we are making it unique.

In this part, I am going to do it only via the Dashboard so it stays user friendly.

In the Lambda dashboard, go to create function. Skip the blueprint selection by clicking on the next step right away in the top left corner.

As Lambda is an event triggered function, you can define trigger to execute the lambda function. Here, we don't need to configure a specific trigger as we are going to call our Lambda function only when CloudFormation will.

"Oh oh .. lots of settings here" - Don't panic ! We have already prepared all the necessary for this step. Here, we give our function a lovely name, a meaningful description and the code. To make the tutorial user-friendly, I have selected "Upload a Zip file" just so everything fits within the page, but you will use the code here and copy-paste it inline.

Here we are, ready to create the lambda function :)

3B - Decrypt the password

The other lambda function's role

As you guessed, here we are going to create a role that is just like the previous one, but instead, we are granting read-only access to the DynamoDB table and decrypt rights on the KMS Key.

To create the lambdaCfDecrypt function, follow exactly the same steps as described in the "Encrypt" function.

Once you've clicked you will see a first statement has been created for the policy. For now, we don't need any other statement for the policy, so now click on "Next step"

The last step before creation is to review the JSON and the policy name / description. Once you have named your policy and description, click on "Create policy"

Create the function

To create the function, follow the exact same steps as for the cfRdsPasswordGenerate function.
The code is in this gist, so you can put that inline.

Note

Do not forget to change the role of the function to the lambdaCfDecrypt role.

4 - Put it all together with CloudFormation

Now we have created our Lambda functions and tested those, it is time to get our Cloudformation running. As for the lambda functions, you can find the full CloudFormation template on my Github account or here.

So, you might know all AWS:EC2:Instance resource attributes, but do you know the custom resource ? Here is the special one that we are going to call our Lambda function with parameters and that is going to generate and capture the values we want. Here is a very simple snippet of those two resources that we want to get through our Lambda functions (those go in the parameters object of your template).

If the resources creation succeded, how can we get the password out of the lambdaGetDBPassword resource ?

Below is a very very small snippet of a RDSInstance resource for which I volountarily kept only the DB password attribute. Here, we first ensure that the lambda custom resource worked and could be created successfully, using the "DependsOn" attribute. Then, for the password, we simply have to get the password out of it, using the function "Fn::GetAtt".

The attribute name is the one that in the code we previously set in the "Data" object of the response.

I had the question : why have 2 separate functions used by CloudFormation instead of have the generator function return the cleartext directly ?

Well, it might make you save around 10 seconds in the stack creation to use only one function, but I find use a function to decrypt a very nice way to be sure that at the creation of the Lambda function, the "reverse" process of get and decrypt the password works as expected. That way, you know that you can reuse that function in different places again and again and keep the logic very simple.

Conclusion

This is a very simple example of all the possibilities Lambda and CloudFormation offer us. I hope this will help you in your journey to AWS and automation.

At scale

When we created the DynamoDB table, as you can see, we have set the read and write capacity units to 1, because this table will be potentially used only when we will create a new stack for dev/test and our resources need a password. But if tomorrow you find yourself in the position where you have 100 RDS dbs, and for each individual DB you have 10s of consumers which when they initialize themselves, will call our Lambda function. Lambda won't be our limitation here, but DynamoDB might be. In this case, you might want to look at the table metrics, and maybe raise the read capacity so you can have more consumers potentially reading all at the same time without a throttle.

Also, it is worth mentionning that KMS has a cost per call. So again depending on the kind of resources that need to decrypt the information with the key, you might have to make sure that your resources are asking for decrypt only when it is necessary (Free-tier ends at 20k requests globally, then goes at 0.03$ per 10k requests).

Edited on 2016-11-07

The 3rd option, 3 - Use S3 bucket, could give us an alternative, but, at risks : if the number of calls you have to make to KMS to get and decrypt the payload becomes a struggle for your bill, if you are super confident in your ability to write S3 bucket policies and your VPC network configuration, you could have the payload non-encrypted in the bucket, leverage VPC Endpoint to S3, and have the instances / resources that need the information and get it in clear-text.

At your own risks if you happen to store your non-protected root password for your black-box.

Hi everyone. I have decided to move away from Wordpress.com, whom I thank for hosting my previous blog for the past 2 years. But, as I am moving forward with my journey to AWS and cloud in general, I wanted to start using more of the AWS'someness.

A while ago, I started a blog using Nikola, which I am using again today to generate all the future blog articles and guides/how-to's and host it directly in S3.

Where are the old posts ?

Some of the very old articles I have are still in my Github account, therefore I will simply republish them, probably in the archives part.

For the most recent AWS or Eucalyptus articles, they will simply be re-published here very soon :)

It has been a while since I haven't written a blog post, but today I wanted to share some recent experience with my public cloud of heart and their GPU instances offering. I know that, many people probably did it and did it way better than I did, especially on Ubuntu. But as I am not Ubuntu #1 fan, and much prefer CentOS I wanted to share with you my steps and results using FFMPEG and the NVIDIA codecs to leverage GPU for video encoding and manipulations.

In a near future, I will take a look at the transcoder service in AWS, but as it doesn't meet the requirements for the entire pipe of my video's lifecycle, I am yet to determine how to leverage that service.

So, historically I wanted to use the Amazon Linux image with the drivers installed (the one with the nice NVIDIA icon). But I faced much more problems with it than I thought I'd have. Therefore, I decided to go on a basic minimal install of CentOS 7 and take it from here. And here we go !

Pre-requesites

As I just mentionned before, I use for this tutorial the official CentOS 7 image available on the AWS Marketplace. This is a minimal install, so at first I recommend to install your favorite packages as well as some of the packages coming from the Base group.

yum upgrade -y; yum install wget git -y; reboot

Also, to install the NVIDIA drivers, you will have to remove the "nouveau" driver / module from the machine.

Install CUDA

CUDA is not required to do the encode and use the latest h264_nvenc encoder available in FFMPEG. However, CUDA seems required to leverage the NVIDIA api "NVResize" which redimensions videos and uses the GPU for that. Thought, as I am not an expert, there might already be an option to do so with the nvenc encoder in the latest FFMPEG version (to be continued research).
IT IS IMPORTANT THAT YOU START WITH CUDA BEFORE INSTALLING THE NVIDIA DRIVERS !!!!

From your AWS EC2 Instance, this won't take long. Grab a tea :)
Now we have it, execute the run file as root

sudo chmod +x cuda_7.5.18_linux
sudo ./cuda_7.5.18_linux.run

Accept the terms, then the options I took were :

Install CUDA
Install the Libraries
Do not install the samples
The CUDA installer will install some NVIDIA drivers. At the end of the process (if successful), you should be able to enable the nvidia kernel modules :

sudo modprobe nvidia sudo modprobe nvidia_uvm

So far so good, with lsmod you should see those enabled.

The CUDA utils

Thanks to this guide, I realized there would be potential steps to have some CUDA library to help FFMPEG to communicate with CUDA. Pretty straight forward step.

The Right NVIDIA drivers

To save you lots of troubles, let's just say that after 24h of different annoying non-verbose errors, I figured something was wrong with the drivers delivered by the CUDA installer (v.352.79). So, now, let's get the NVIDIA latest drivers. You can always get the latest from NVIDIA website : http://www.nvidia.com/download/driverResults.aspx/106780/en-us Those are the latest (on Aug. 2016, v.367.44). Make sure you download the Linux x64 version for the GRID K520.
On your instance :

Accept the terms, acknowledge that this installer will install new drivers and uninstall the old ones.
I did not pick yes for the compatibility drivers for 32bits, and went for the DKMS. Once the driver install is finished, I strongly suggest to reboot, and then, as before, check on the kernel modules to verify those are enabled and working (use lsmod).

The nvEncodeAPI.h

You will need this header file in your library to compile FFMPEG with the --enable-nvenc and use the encoder. To get this one, you will need to subscribe to the developer program of NVIDIA and get the Video_Codec_SDK_7.0.1. I could have made this available for all, but, I will leave you to accept the terms and conditions and get your hands on it yourself.
Once you have it, upload it (via SFTP mostlikely) to your instance. Once you have it, unzip the file, and locate the nvEncodeAPI.h file. Keep it in your back-pocket, we will need it soon.

Compile FFMPEG

Now arriving on the final step. As the guide referenced earlier mentions, a few steps are required prior to compile FFMPEG :
get the right ffmpeg version
get the patch to enable the nvresize
For my own FFMPEG, I needed some additional plugins. Here is the script I used to install those. Note the exit 1 just before FFMPEG. This was a fail safe to avoid forgetting some of the little details that follow. Use the script for the non-in repos packages (ie: for x264).
Prepare your compilation folders

I personally like to put extra, self-compiled packages in /opt as they are easy to find. But feel free to do as you prefer. For the following steps, I will be doing all the work as root (I know, I know ..) in /opt (If you used my script so far, skip the folders creation).
mkdir -p /opt/ffmpeg_sources mkdir -p /opt/ffmpeg_build

Now, we can go ahead with building all the dependencies. The shell script I have done will cover those parts.

From this point, you can't use ./configure with --enable-nvresize and --enable-nvenc: we are missing the libraries.

nvEncodeAPI.h

Simply copy the headers file in /opt/ffmpeg_build/include
cudautils

Go back to your cuda utils folder. I did the quick and dirty, yet working, cp * /opt/ffmpeg_build/include and cp * /opt/ffmpeg_build/lib. Now, you could just put the .a in the lib and the .h in the include folders.
Configure

If something doesn't work to run FFMPEG at this point, something went wrong before.
Test FFMPEG with NVENC and NVResize

Well for this part, I have followed the basic demo and test commands that this PDF guide suggested. So see the CPU and GPU usage, I had side-by-side in my TMUX sessions running htop and nvidia-smi (watch -d -n1 nvidia-smi).
Now in a 3 part of my TMUX, I ran the different commands, such as :

VPC

In 2014 VPC became the default networking mode in AWS letting the EC2 Classic networking mode go. VPC is a great way to manage and have control over the network environment into which the AWS resources will run. It also gives full control in the case of an hybrid cloud or at least in the case of your IT extension, with a lot of ways to interconnect the two.

A lot of new features came out from this but most importantly, VPC would provide the ability for everyone to have backend applications running in Private. No public traffic, no access to and from the internet unless wanted. A keystone for AWS to promote the Public cloud as a safe place.

Midokura

Midokura is a SDN software which is used to manage routing between instances, to the internet, security groups etc. The super cool thing about about Midokura is its capacity to be high-available and scalable in time. Of course being originally a networking guy, I also find super cool to have BGP capability.

Requirements

Here is what my architecture looks like :

VLAN 1001 is here for Eucalyptus components communication. VLAN 1010 is here for Midonet components communication including Cassandra and Zookeeper. VLAN 1002 is our BGP zone. VLAN 0 is basically the ifaces default network which is not relevant here. We will use it only for packages download etc.

From here, we need :

UFS (CentOS 6)

CLC (CentOS 6)

CCSC (CentOS 6)

NC(s) (CentOS 6)

Cassandra/Zookeeper (CentOS 7)

BGP Server / Router

Of course, you could use all on the same L2/L3 network. But you are not a bad person, are you ? ;) For the Eucalyptus components we will be on CentOS 6.6, CentOS 7 for the others. Some of the Eucalyptus components will have a midonet component depending on the service they will be running. Today we will do a single-cluster deployment, but nothing will change in that regard.

But before going any further, we should sit and understand how Midokura and Eucalyptus will work together. So, Midokura is here as a SDN provider. Eucalyptus is here to understand VPC / EC2 etc API calls and pilot the other components to provide the resources.
Now, what is VPC ? VPC stands for Virtual Private Cloud. Technically what it means ? You will be able to create a delimited zone (define the perimeter) specifying the different networks (subnets) in which instances will be running. By default, those instance will have a private IP address only, and no internet access unless you specifically give it.

In a classical envirnoment, that would correspond to have different routers (L3) connected to different Switches responsible for traffic isolation (L2). Here, this is exactly what Midokura will do for us. Midokura will create virtual routers and switches which will be used by Eucalyptus to place resources.

How Eucalyptus and Midokura work together - Logical view

I am on my Eucalyptus cloud. No VPC has been created. At this time, I have no resources created. Eucalyptus will have created on midokura 1 router, called "eucart". This router is the "top-upstream" router, to which new routers will be created. For our explanation, we will call this router EUCART.

So now, when I create a new VPC, I do it with

euca-create-vpc CIDR
euca-create-vpc 192.168.0.0/24

This in the system will create a new router, which we will call RouterA. This routerA will be responsible for the communication of instances between subnets. But at this time, I have no subnets. So, let's create two :

Now I have two different subnets. If I had to represent the network we have just created we would have :

Of course, if I had multiple VPC, the we would simply have duplicated VPCA group, and more switchs if we had more subnets.

As in AWS, you can have an "internet gateway" which is simply a router which will all instances to have public internet access, as soon as they have an Elastic IP (EIP) associated with those.

Here it is for a few logical mechanism. Let's attack the technical install. Brace yourself, it is going go take a few times.

Eucalyptus & Midokura - Install

First we are going to begin with Eucalyptus. I won't dig too much on the steps you can find here for the packages install.
As said, we gonna have in this deployment : 1 CLC, 1UFS, 1 CC/SC and 1 NC. But before initializing the cloud, we are going to do a few VLANs to separate our traffic.

While it is initializing, we are going to add on our CLC and NC a new VLAN, as well as on the Cassandra and Zookeeper machines. For those, I will use VLAN 1001

vconfig add em2 1001# here I am putting my machines in a different subnet than for VLAN 1000 | change for each machineifconfig em2.1001 192.168.1.65/28 up

Cassandra - Zookeper

Alright, let's move onto the Cassandra / Zookeeper machine. In this deployment I will have only 1 machine to host the cluster, but of course for production, the minimum recommended is 3 to have a better scale and redundancy capacity.

For all components, don't forget to change the VNET_MODE value to "VPCMIDO" instead of "MANAGED-NOVLAN" which will indicate to the components that their configuration must fit VPCMIDO requirements.

So, from here the Cassandra and Zookeeper will allow us to have midonet-API and midolman installed.
Midonet-API is here to be the endpoint against which the Eucalyptus components will do API calls to create new routers and switches, as well as configure security groups (L4 filtering). Midolman is here to connect the different systems together and make the networking possible. You MUST have a midolman on each NC and CLC. Midonet-API is only to be installed on the CLC.

To have the API working, for now in Eucalyptus 4.1.0 we have to (sadly) have it installed on the CLC and the CLC only (that is the sad thing). Here our CLC will act as the "Midonet Gateway", this EUCART router I was talking about previously.

Let's do the install : (of course, here you will also need the midonet.repo we used before).

yum install tomcat midolman midonet-api python-midonetclient

Tomcat will act as server for the API (basically). Unfortunately, the port in Eucalyptus to talk to the API has been hardcoded :'( to 8080. So before going any further we need to change one of Eucalyptus' port to a different on Eucalyptus itself:

$> euca-modify-property -p www.http_port=80818081 was 8080

If you don't make this change, your API will never be available.

Now the packages are installed and the port 8080/TCP free, we must configure Tomcat itself to serve the midonet API. Add a new file into /etc/tomcat/Catalina/localhost named "midonet-api.xml"

Alright, now we can start tomcat which will enable the midonet-api. To verify, you can simply do a curl call on the entry point

curl <CLC VLAN 1001 IP>:8080/midonet-api/

Midolman

We can configure midolman. The good thing about the midolman configuration, is that you can use the same configuration across all nodes. Once more, we simply have to change a few parameters to use our Cassandra / Zookeeper server. Edit /etc/midolman/

Once you have installed and configured midolman for every components, we need to configure midonet to have all our cloud component, here we will simply call them "hosts" (the terminology is very important).

Back on our CLC, let's add a midonetrc file so we don't have to specify the IP address everytime

Here, the credentials are not important and won't work. So anytime, to get onto midonet-cli, use the option "-A"

Before we get any further, there are 2 new packages which MUST be installed on the CLC : eucanetd and nginx. Explanations later.

yum install eucanetd nginx -y

We are half the way. I know, sounds like quite a lot. But in fact, that is not that much. We now need to configure Eucalyptus network configuration. This, as for EDGE, is done using a JSON template. Pay attention, a mistake will cause you headaches for a long time.

PublicIps : List of ranges and / or subnets of Public IPs which can be used by the instances

Mido : This is the most important object !!

EucanetdHost: String() which points onto the server which runs the eucanetd binary and midonetAPI

GatewayHost: String() which points onto the server which runs the midonet GW. As said for now the GW and EucanetdHost must be the same machine.

GatewayIP : String() which indicates which IP will be used by the router EUCART. Here, you must use an IP address which DOESNT EXIST !!!

GatewayInterface : The IFACE which is used for the GatewayIP. Here, I had created a dedicated VLAN for it, vlan 1002.

PublicNetworkCidr: String() which is the network / subnet for all your public IPs. Here in my example, I am using a /16 and defined only a /24 for my cloud public IPs. It is because I can have multiple Clouds in this /16 which each will use a different range of IPs

PublicGatewayIP : String() which points on our GBP router.

Warning

Don't forget that the GatewayInterface must be an interface WITHOUT an IP address set

For now in 4.1.0 as VPC is techpreview, many configuration and topologies are not yet supported. So for now, you must keep the MidonetGW on the CLC and have the EucanetdHost and EucanetdHost pointing onto the CLC DNS' name. And this MUST be a DNS name, otherwise the net namespaces wont be created correctly.

Also as we speak of DNS, if those VLAN we created can lead to resolve the hostname, you MUST add in your local hosts file the VLAN1001 IP to resolve your hostname.

Alright, at this point we can have instances created into subnets, but they won't be able to get connected to external networks. We need to setup BGP and configure midonet for that.

Get onto the the BGP server. Here, we are only going to create 1 VLAN, which we will use for public Addresses of instances. Here, we gonna use 172.16.0.0/16 and our BGP router will use 172.16.255.254 as we indicated into the JSON previously.

vconfig add em1 1002ifconfig em2.1002 172.16.255.254 up

To have it working, it is very easy : (originally I followed this tutorial)

Here we can see that the server will get BGP information from 2 "neighbor" with unique IDs. We will later be able to have 1 peer per midonet GW which will be used by the system to reach networks.

To simplify : the BGP server is waiting for information coming from other BGP servers. Those BGP servers will be our MidonetGW. Our MidonetGW will then announce themselves to the server saying "hi, I am server ID XXXX, and I know the route to YY networks". Once the announcement is done on the root BGP router, all traffic going to it to reach our instances EIP will be sent onto our MidonetGW.

On your CLC, you should see new interfaces being created called "mbgp_X". This is good sign. This means that your BGP processes are running and broadcasting information. Let's check on the upstream BGP that we have learned those routes.

Today I did my first install of CEPH, which I used as backend system for the Elastic Block Storage (EBS) in Eucalyptus.
The advantage of CEPH is that it is a distributed system which is going to provide you replication (persistence for data), redundancy (we are using a pool of resources, not a single target) and scalability : the easiest way to add capacity is to add nodes which will be used for storage, and those nodes can be simple machines.

I won't go too deep into CEPH installation. The official documentation is quite good for anyone to get up to speed quickly. I personally had no troubles using CentOS7 (el7). Also, I won't go too deep into Eucalyptus installation, I will simply share with you the CEPH config files and my NC configuration files which have some values non-indicated in Eucalyptus docs.

I will simply spend some time to configure a non-admin user in CEPH which I will use for my Eucalyptus cloud. Back on your ceph admin node, simply have :

Create the pools

In CEPH, there is a default pool called 'rbd' (pool 0). I don't like to use the default values and settings when I deploy components I can tune / adapt to my use-case. So here, I am going to create 2 pools : 1 to store the EBS volumes and 1 to store the EBS snapshots

ceph osd pool create eucavols 64ceph osd pool create eucasnaps 64

Here, that's about all we had to do to create the pools.

Warning

The number you set after the pool name (here, 64 in my example) depends on how many OSDs and replication factor you want to assign. If you have a dev test cluster, small numbers will do. For larger deployments, ref. to the CEPH pool pg planning. A factor of 2 is always best (save yourself CPU cycles ;) )

Create a CEPH user for our pools

When you installed your CEPH cluster, you used and did every basic activities using the ceph administrator keys and credentials. Just like you tell people not to dev as root, you don't let the softwares (here, Eucalyptus) with too many power over your cluster. So, without any further wait, we are going to create a CEPH user, called "eucalyptus", which will have only read access on the monitors, and full control over eucavols and eucasnaps pools.

Running that command on the monitor as the CEPH admin will create a eucalyptus user and generate the eucalyptys.keyring file.
Copy that keyring file on all your NC and on the SC (by default to /etc/ceph/, otherwise, as follows below).

The ceph.conf on the SC and NC

When Eucalyptus implemented CEPH storage, it was on a fairly "old" version of CEPH at first, and some non-default parameters that are generated by the CEPH installation scripts are missing. Here is what the ceph.conf file has to look like on the NC and the SC.

[global]fsid=ef66f3c8-2cbe-4195-8fbc-bc2b14ba6d69public_network=192.168.100.0/24cluster_network=192.168.101.0/24mon_initial_members=nc-0mon_host=192.168.100.3## mon addr is not by default in the configuration. Do not forget to add it as follows:# if you have multiple, simply list them with a ',' as separatormon addr=192.168.100.3:6789#auth_cluster_required=cephxauth_service_required=cephxauth_client_required=cephx

Eucalyptus NC configuration

In /etc/eucalyptus/eucalyptus.conf, edit files accordingly. Make sure the eucalyptus Linux user/group has read access on the keyring file and ceph.conf file

On Eucalyptus, and also on Amazon Web Services (AWS) there are two types of backed instances : instance-store and EBS. Each one has its advantages and drawbacks, and what we are going to see here is which they are, and how can we use it the most efficiently.

Instance-Store are usually the first type of VM you will add on your cloud. Very easy to use thanks to eustore (http://www.eucalyptus.com/docs/eucalyptus/3.1/ug/eustore-browse-install.html), instance-store EMIs are stored on Walrus (S3 on AWS). These EMIs are very light, with the minimal setup and packages. This is why you will be able to find images of 500Mo for a complete CentOS system. In an instance-store, there are 3 components :

Kernel Image

Ramdisk Image

System Image

Obviously, the kernel and the ramdisk are here to fit your hypervisor (XEN / KVM) and still work on the same way. Then we have the image. This is your OS image. Compressed, this will be expanded to a static size once it runs in your cloud. So as this EMI is packaged in S3, this is very easy for you to have it on all your region.

As you can see on the picture, bfEBS are instances which can be ran on any node like an instance-store one, but which boots here, not from the hard drive of the Node Controller, but from the Storage Controller EBS Volume. This will make our VM dependant on the IOps limitations. This also is a useful feature, this way you will be able to create instances with disks using different IOps according to the VM usage: low IOps for front-end VMs and high-IOps for databases or file servers.

The logs on a system are the more useful files a developer or a sysadmin will use to know if everything is running in a proper way, or to find a bug in an application. For one or two servers, that's not very difficult to manage it, but when you've got more than ten servers to manage, go on each server can be really long. That's why rsyslog was created. Thanks to this program, you will be able to send all your logs (depending on your criteria) to a remote server which will catch them and allow you to collect all the logs from your servers.

There is a very simple design for rsyslog :

As can imagine, this would be useful to have this on your cloud. And you can also have VMs from scratch or VMs created by an auto-scaling group. To not be search in every logs to find to which server these logs are, we also are going to create templates to separate each files in different directory according to the server's name. In this part we will log everything as files, but you can also store every logs in databases and use webapps to have a user friendly view of every logs.

So first go on the machine you want to use as log server. Here I'm using Debian 7, and the config is the same on RedHat. We're going to modify /etc/rsyslog.conf which is the main conf file of rsyslog.

So first, we will consider that the syslog server is a syslog client, we can also comment a part of the configuration, and at the same time, activate the listening system.

Forwarding is also one of the very interesting capacity of bind. Imagine, we have somewhere.net hosted by primary DNS ns1 and we have a really big zone "tutorials" which is held by a secondary DNS ns2. We'd like that when we query "test.tutorials.somewhere.net" on master server (which does not have the zone hosted in his configuration files), the ns1 server will ask to ns2.somewhere.net. There, ns0 will act as a forwarder.

To do so, what we're going to do is tell the ns0 server that "tutorials" is held by ns2. So in our db.somewhere.net we have :

; db for somewhere.net
;
$TTL 86400
@ IN SOA ns1.somewhere.net. root.somewhere.net. (
1 ; Serial
604800 ; Refresh
86400 ; Retry
2419200 ; Expire
86400 ) ; Negative Cache TTL
;
; Here we define the nameservers of the domain.
@ IN NS ns1.somewhere.net.
@ IN NS ns2.somewhere.net.
;
;Here we set the MX records for our domain
@ IN MX 10 smtp.somewhere.net.
;
; Now we set the IP of the nameservers - Use yours
ns1 IN A 192.168.1.1
ns2 IN A 172.16.1.1
;
; Now, we set some zones
www IN A 192.168.1.10
smtp IN A 192.168.1.2

Great. We also have to add this zone as a forwarded one in our named.conf.local