Friday, November 30, 2012

Successfully delivered to Jeff Barr. Notice my face: I usually don't look so silly... I was nervous! :)

Jeff Barr, AWS

Carlos Conde was very difficult to locate at the event: He's and important man. But "the creator" deserves a t-shirt and a special version one.

Carlos Conde, AWS

It took some courage to give my present to Adrian Cockcroft. He's like a star! :)

Adrian Cockcroft, Netflix

Bring ideas and find out about future plans: Success!

Anil Hinduja, CloudFront

Tom Rizzo, EC2 AWS

AWS Training Team

I had a good chat with the Training Team and there are VERY interesting news about Certification. I'm pretty sure we will have and official announcement in the following weeks. We'll wait for that.

News:

Zadara Storage: A surprisingly and interesting approach to provide high-end storage for EC2 Instances. They've managed to have space at AWS Data Centers to install there SAN Disks Arrays and they're willing to connected them to your EC2 Instances using Direct Connect. This connection method is used to connect your office or your on premise infrastructure to your VPC but in this case they connect storage through iSCSI or NFS. The price of the service is per hours basis and you get full access to the admin tool to define your volumes and parameters like RAID configuration. With a solution like that, there is no limit for the kind of application to run on EC2. Even the more I/O demanding ones. We are talking here about non virtualized storage. The old fashioned SAN array. Currently is only available at US-East Region but with plans to expand to other regions.
Besides technical and commercial considerations, this product/service says a lot of how open is AWS when it comes to giving tools to their costumers. Is hard for me to imagine others companies letting in a competitor into their buildings. Well done!

New EC2 Instance Types: A "Cluster High Memory" instance with 240 GB RAM and two 120GB SSD disks. A "High Storage" instance 117 GB RAM and 24 hard drives (48 TB total). I only can say: Awesome! According with the EC2 Team, this internal storage will be managed as the any other kind of Instance Storage and therefore is: Ephemeral. Using their words: "It will be amazing to see how you (the costumers) create new ways to use this storage". I couldn't agree more.

AWS Marketplace is not just a place to sell AMIs. Thanks to the talk of Craig Carl I've got a wider perspective of AWS Marketplace. We should see it like a tool to sell anything your are able to create in Amazon Web Services cloud. Not just an AMI with an application stack in, but a dynamic configuration set. A configuration that adapt to the consumer needs gathering information automatically of interacting with the user.
And a new concept of product just emerged: A Marketplace application could be something else than an application. I'll try to explain it with an example: You could create an application to access some information. The information is what the costumer wants (no the application itself). As long the application is running, the costumer is accessing to the information and therefore is billed (and you get your cut). When the contract expires, the application shuts down and the deal ceases. Commercial or infrastructure costs on your side (the provider) = zero. Awesome.
I my opinion, a new job role has been created: "Marketplace application stack developer".

An EC2 Spot Instance can be automatically terminated at any given minute. We knew that they can be terminated without previous warning when a "On Demand User" needs the resources you're using but we didn't know when it could happen.

Saturday, November 17, 2012

- I would like a handshake with Jeff Barr, AWS Evangelist and leader of its official blog. I think he's doing and excellent job and I admire how he manage to find time to accomplish his tasks.

- I would like a handshake with Carlos Conde, AWS Europe Solutions Architect. I had the opportunity of helping him at the last Navigate the Cloud Barcelona/Madrid and there I have discovered that he is the designer of the awesome design used in all the AWS Official Architecture Diagrams. He is an excellent communicator and as it turns out, he is brilliant graphic designer. I have no words to express my admiration.

- I would like to have some beers with my friends of Celingest.com. They are going to be there and I have a present for them (and for the people mentioned above). What it is? You will see ;)

- I would like to know if there is an AWS Architect Certification on the road map and if so, details about it. Now you have an official architecting training course but I hope there is more coming about this topic.

- I would like to know if there is any plan to adopt BGP routing for Disaster and Recovery solutions. AWS is doing an effort to become the perfect choice when it comes to D&R and I think it is. The option of having a "sleeping infrastructure" waiting for a disaster to happen and booting up when that happens is... priceless. And the cherry on the cake would be the option of route customer Public IP traffic (Only for costumers with their own Autonomous System, of course).

- I would like to suggest to the EC2 Team the idea of not auto-terminating EC2 Instances living into and Auto Scaling Group until their "paying hour" has been spent. When in an Auto Scaling Group, the EC2 instances are automatically launched and terminated. That's the way it should be. But if the application load decreases, could happen that an instance that was brought to life 30 minutes ago will be terminated (no longer needed) and you will waste the other remaining 30 minutes. Would be nice to have an option to tell AS not to terminate an instance until the whole hour has passed.

Thursday, November 15, 2012

Thanks to a friend I had the opportunity to test the Newvem Beta tool connected to his AWS Customer account and I'd like to share some conclusions.
With the fast growing of the Cloud market, some tools are emerging to help us to managing those "invisible" and fast-growing architectures. Some of them are trying to help us answering the question: "How can I pay less each month?". I have to say in advance that there is no magic answer. What is good for me could not be good for you. But there are some common scenarios where a bit of help could be useful.

Security:

First thing that caught my eye was the security recommendations. I wasn't expecting this here but I have to admit that they're convenient. With a constantly growing infrastructure and a group of Admins taking care of it, there is no such a thing as unnecessary security recommendations.

Tell me about the money:

With the Spend Efficiency chart Newvem tell us some topics to pay attention. The tool has no way to know what is normal for us from what is not. For example, in that evaluation, a bunch of instances were manually stopped after a special event and this was detected as an abnormal situation and an alert was generated (Monthly cost changed by -34.00%). So those warnings should be considered just suggestions coming from someone who can't read your mind. The "better safe than sorry" approach.

Reserved Instances Recommendation:

Well, this is not rocket science. An Instance that has been up 100% of the time during the last 2 months should be a Reserved Instance. And among Light, Medium and Heavy Reserved Instance should be Heavy. That's the recommendation.
This RI Calculator gives us also some numbers showing how much money we have to pay in advance (Upfront) if we decide to purchase RI for all those Instace-types in a 1-Year and 3-Year scenarios.
What I really appreciate here is that simple table is a good starting point to begin to understand the concept behind EC2 Reserved Instances. This a confusing topic for beginners no matter in which company area they are. Thanks to this table, 3 key concepts are explained using our current AWS infrastructure: RI Instance-Type, RI Availability Zone and RI Hourly Price.

RI Instance-Type:
A Reserved Instance purchase applies to a EC2 Instance-Type. No instance hostname, nor Instance ID present or future. A RI gives you a better price for an Instance-Type wherever its usage or which of your EC2 Instance will end up using it.

RI Availability Zone:
A Reserved Instance applies to an Availability Zone. If you run two different Instances in two different AZs within a Region you will have to purchase two RIs. One for each AZ.

RI Hourly Price:
The year savings shown on the table above are the multiplication of the better price/hour you'll get when buying a RI and the amount of hours in a year. What it is telling us is the potential benefit we would get if our machine is 100% of the time up and running. Benefit of the RI model compared to On-Demand. But this doesn't mean that we have (or we will) to keep our instance always up. We will do what we will need. Starting and stopping it, but with a better Hourly Price.

And again, when it comes to recommendation there are not flawless and we need the human in the process. For example here:

For this m1.small us-east-1d we have a RI Light recommendation but the historic chart shows me that this Instance Type is not longer used in that particular AZ and it probably won't be at the future. Obviously, this is something I know and the RI Calculator don't. The human touch.

S3

Newvem also give us information about our Simple Storage Service but with my current scenario there are few things to say. This website stores in S3 its static content and with "only" 12 GBytes total space used, no recommendations needed.

In conclusion,

I think that this kind of tool is useful now but it will much more in the near future. There is no limit for what the software could learn and predict and all those third-party products will advance faster than the cloud provider (AWS in this case) when it comes to "high level" management. I'm not saying that we will never see a button on our Cloud Console with the name "How to pay less" on it. Just saying that some else will be always faster to put that at work.

There are areas not covered where help is needed to handle important cost sources, like Internet and CloudFront traffic. This is a burden for heavy traffic sites and currently AWS don't give you a report to understand where your spending in traffic is going. You need third party software to collect and process logs so here... room from improvement.

The application covered here is in Beta stage and free. Looking forward to knowing its final price... This will be the key to conclude if is useful for my customers or not.

Monday, November 12, 2012

Our Goal: Easy access to our Instances by Name instead to locate them through EC2 Console after an IP change caused by a stop/start action.

Is quite tedious the need to open the AWS Console to find an instance Public IP after a stop/start action or if we forgot which previously it was. Here I show you a tool that consists in a script executed inside the instance that updates its DNS records in Route53 using the instance Tag "Name". This is and optional Tag we can use to store the "Host Name" when launching a new instance or edit it anytime we need afterwards. If this optional tag is not present, the script I show you here, will use the instance ID to update (or create) the corresponding DNS A Record. This way we will have always the instance accessible through its FQDN and it will be stable (It won't change overtime).
Example: My-Instance-Host-Name.ec2.My-Domain.com

Instance Tag Name
Configure your EC2 instance with a Tag Name using the Console. Usually the Instance Launch Wizard will ask you for it but if is empty, you can update it any time you want. In this example the Tag Name will be "webserver1".

Reading /var/log/messages you should have something like this output. First the script gathers the Instance ID and the Public IP reading the Instance Metadata. Then the current IP ($CURRENTDNSIP) configured at the DNS (if any) using dig and the Instance Tag Name using the ec2-describe-instances command. The first change to happen is the Host Name. If the Instance Tag Name is present it will become the machine Host Name and if not, the Instance ID will play this role. One way or the other we will have a stable way to identify our servers. The Instance ID is unique and won't change over time. Then we call the Route53 API using dnscurl.pl four times. There is no API call to "overwrite" and existing DNS record so we need to Delete it first and Create it afterwards. The Delete call has to include the exact values the current entry has (quite silly if you ask me...) so that is why the scripts needs the current Public IP configured. We Delete using the old values and Create using the new ones. One dnscurl execution for the Instance ID (that always exists) and again for the Instance Tag Name (if present).

Two entries should have been automatically created in your Hosted Zoned and present at Route53 console for our Instance:

Those entries are ready to use and now you can forget its Instance ID or volatile Public IP and just ping or ssh to the name. Example: webserver1.ec2.donatecpu.com.

Auto Start
The main purpose is to maintain our servers IPs automatically updated in our DNS so we need that the main script is executed every time the machine starts. Once we've verified that it works fine is time to edit /etc/rc.local and add start-up-names.sh full path to it:

#!/bin/sh## This script will be executed *after* all the other init scripts.# You can put your own initialization stuff in here if you don't# want to do the full Sys V style init stuff.touch /var/lock/subsys/local/root/bin/start-up-names.sh

And that is it. I suggest you to manually stop and start your instance and verify that its new assigned Public IP is updated in the DNS. All AMIs you generate from this Instance will include this described configuration and therefore they will dynamically maintain their IPs. Cool!

Note: When playing with changes in DNS Records their TTL value matters. In this exercise we've used a value of 600 seconds so a change could take up to 10 minutes to be available in your local area network if your DNS server has cached it.

Is important to be sure that all the ingredients are working as expected. Auto Scaling could be difficult to debug and nasty situations may occur like: A group of instances starting while you are away or a new instance starting and stoping every 20 seconds with bad billing consequences (AWS will charge you a full hour for any started instance, despite it has been only one minute running).
I strongly suggest to manually test your components before create a Auto Scaling configuration.

- Create your Key Pair (In my example "juankeys").

- Deploy an ELB (In my example is named "elb-prueba") in your default AZ ("a"). Configure the ELB to use your custom /ping.html page as Instance Health Monitor. You should see something like this:

- Create a Security Group for your Web Server instances (In my example "wed-servers"). Add to this Security Group the ELB Security Group for Port 80. It should look like the capture below. In this example this SG allows to Ping and TCP access from my home to the Instances AND allows access to port 80 to the connections originated in my Load Balancers (amazon-elb-sg). The Web Server port 80 is not open to Internet, is only open to the ELB.

- Deploy a EC2 Instance using the previous created Key Pair and Security Group. Install a Apache HTTP server and be sure it is configured to start automatically. Create a Test Page called /ping.html at the web sever root folder. This text page can print out ant text you like. Its only mission is to be present. A HTTP 200 is OK and anything else is KO.

- In this exercise we will add to our custom Linux AMI a script and a crontab configuration to create a Custom CloudWatch Metric. We will use what we've learned in this previous post.
Once you have the Apache HTTP server installed and mod_status configured following that previous post instructions, copy this new script version:

It is similar to the one used before but now we collect just one metric (instead of two) and we store it under a common CloudWatch Name Space. All instances involved in this Auto Scaling exercise will store its Busy Workers values under the same Name Space and Metric Name. In my example the Name Space will be "AS:grupoprueba" and the Metric Name "httpd-busyworkers".

- Create a crontab configuration to execute this script every 5 minutes.

- Create your Custom AMI from the previous created temporal instance. Terminate the previous created temporal instance when finished.

- Deploy a new instance using the recently created AMI (In my example "ami-0e5ee467") to test the Apache server and the script. Check if the HTTP Server starts automatically.

- Manually add the recently created instance under the ELB. Verify that the Load Balancer Check works and it gives you the Status "In Service" for this instance. Verify that the /ping.html page can be accessed from Internet using a browser and the ELB public DNS name ("http://(you-ELB-DNS-name)/ping.html").

- Verify that the script executes every 5 minutes (following the previous instructions) and that CloudWatch is storing the new metric. You could either check that using CloudWatch console or using command line:

With as-create-launch-config we define the Instance configuration we will be using in our Auto Scaling Group: Launch config name, AMI ID, Intance Type, Advanced Monitoring (1 minute monitoring) disabled, Security Group and Key Pair to use.

With as-create-auto-scaling-group we define the group itself: Group Name, Launch Confing to use, AZs to deploy in, the minimum number of running instances that our application needs to run, the maximum number of instances we desire to scale up to, ELB name, the Health Check type set to ELB (by default is the EC2 System Status) and the grace period of time grant to a instance before is checked after launch (in seconds).

With as-put-scaling-policy we create a Policy called "scale-up-prueba" for the previous created AS Group. When triggered it will increase the AS in one unit (one instance). No other AS activities for this Group are allowed until 300 seconds passes. After this successful API call a ARN identifier is returned. Save it because we will need it for the Alarm definition.

With mon-put-metric-alarm we create a new CloudWatch alarm called "scale-up-alarm" that will be triggered when the last 10 minutes average of all the values of "httpd-busymetrics" is bigger than 10. Then the scale up policy will be executed through the ARN identifier. In this example, each Apache server with no external load has an average of 5 busyworkers so a good way to test it is to define a threshold of 10 to increase our cluster capacity. In a real world configuration those values will be very different and you have to tune them to mach your application.

The same way we did before with the scale up alarm, we create a new one to trigger the down scale process. The configuration is the same but now the threshold is 9 Apache busy workers after 10 o more minutes.

Note: By default all the API calls are sent to the us-east-1 Region (N.Virginia).Describe:

We use "as-describe-" commands to read the result of our last configuration. Special attention to as-describe-auto-scaling-instances:

# as-describe-auto-scaling-instances --headers No instances found

This command give us quick look to the running instances within our AS Groups. This is very useful when dealing with AS to find out the amount of instances running and its state. Now the result is "No instances found" and this is correct. Our current configuration says that zero is the minimum healthy instances our application needs to work.

Under normal circumstances, the "scale-down-alarm" will have the state "Alarm" and this is normal.
Using CloudWatch Console you can add to this alarms and action to send an Email notification to obtain better visibility during the test.Bring it to Production:

Now the cluster is idle, no instances running. So now we will tell to AS that our application requires a minimum of 1 healthy instance to run:

Notice that now Minimum is 1 in the AS configuration and now there is a new instance under our AS Group ("i-9d022be1" in this example). This instance has been automatically deployed by AS to match the desired number of healthy instances for our application. Notice the "Pending" status that means that it is still in the initialization process. We can follow this process with as-describe-auto-scaling-instances:

Now the recently launched instance is in service. That means that its Health Check (ELB ping.html test page) verifies OK. If you open the AWS Console and read the current ELB "Instances Tab", the new instance ID should be there, automatically added to the Load Balancer and your application up and running.

Common problem scenarios:
- If you observe that the new instances are constantly Launched and Terminated by AS this probably means that /ping.html page fails. Stop the experiment with "as-update-auto-scaling-group grupo-prueba --min-size 0" and verify your components.
- If your web server and test page verify OK but the AS is still Deploying and Terminating the instances without a chance to rise to the Healthy status then you should increase the value of "--grace-period" in the AS Group definition to give more time to your AMI to start a initialize its services.
- If the instances start but they fail to automatically be added to the ELB then probably the Instances are deployed in a incorrect Availability Zone. Either correct your AS Launch Configuration or expand the ELB to the rest of AZs in your Region.

Force to Scale UP:

To test the AS Policy we can lie to CloudWatch and tell it that we have much more load than we really have. We will inject a false amount of Busy Workers to the CW Metric:

And after a while, the average Busy Workers value rises and this triggers the scale up Alarm and then its AS Policy:

# as-describe-scaling-activities --headers --show-long viewACTIVITY,135c95fa-8d67-4664-85e4-5d78dfb73353,2012-11-05T16:25:13Z,grupo-prueba,Successful,(nil),"At 2012-11-05T16:24:14Z a monitor alarm scale-up-alarm in state ALARM triggered policy scale-up-prueba changing the desired capacity from 1 to 2. At 2012-11-05T16:24:27Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.",100,Launching a new EC2 instance: i-ebeac397,(nil),2012-11-05T16:24:27.687Z

If we keep feeding CloudWatch with fake values and we keep the average high, soon a third instance will be launched:

# as-describe-scaling-activities --headers --show-long viewACTIVITY,ACTIVITY-ID,END-TIME,GROUP-NAME,CODE,MESSAGE,CAUSE,PROGRESS,DESCRIPTION,UPDATE-TIME,START-TIMEACTIVITY,ef187965-9a79-463f-8a2d-b6f413cc9226,2012-11-05T16:31:11Z,grupo-prueba,Successful,(nil),"At 2012-11-05T16:30:14Z a monitor alarm scale-up-alarm in state ALARM triggered policy scale-up-prueba changing the desired capacity from 2 to 3. At 2012-11-05T16:30:30Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 2 to 3.",100,Launching a new EC2 instance: i-99e4cde5,(nil),2012-11-05T16:30:30.795Z

# as-describe-scaling-activities --headers --show-long viewACTIVITY,ACTIVITY-ID,END-TIME,GROUP-NAME,CODE,MESSAGE,CAUSE,PROGRESS,DESCRIPTION,UPDATE-TIME,START-TIMEACTIVITY,7095a10e-d7b7-4e68-a1c9-cb350e8b0d45,2012-11-05T16:45:03Z,grupo-prueba,Successful,(nil),"At 2012-11-05T16:43:48Z a monitor alarm scale-down-alarm in state ALARM triggered policy scale-down-prueba changing the desired capacity from 3 to 2. At 2012-11-05T16:44:04Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 3 to 2. At 2012-11-05T16:44:04Z instance i-9d022be1 was selected for termination.",100,Terminating EC2 instance: i-9d022be1,(nil),2012-11-05T16:44:04.106Z# as-describe-auto-scaling-instances INSTANCE i-99e4cde5 grupo-prueba us-east-1a InService HEALTHY config-pruebaINSTANCE i-ebeac397 grupo-prueba us-east-1a InService HEALTHY config-prueba

# as-describe-scaling-activities --headers --show-long viewACTIVITY,ACTIVITY-ID,END-TIME,GROUP-NAME,CODE,MESSAGE,CAUSE,PROGRESS,DESCRIPTION,UPDATE-TIME,START-TIMEACTIVITY,31e8673e-7255-410e-b8a7-51ee677f2bb8,(nil),grupo-prueba,InProgress,(nil),"At 2012-11-05T16:50:23Z a monitor alarm scale-down-alarm in state ALARM triggered policy scale-down-prueba changing the desired capacity from 2 to 1. At 2012-11-05T16:50:35Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 2 to 1. At 2012-11-05T16:50:35Z instance i-ebeac397 was selected for termination.",50,Terminating EC2 instance: i-ebeac397,(nil),2012-11-05T16:50:35.538Z

We have learned something here: An instance in an AS environment is volatile. It could disappear at any time because it is Terminated and with the instance its EBS volumes. You have to take that into account when designing your infrastructure. If your web server needs to store some information that you could need later you should save it elsewhere: Cloudwatch, external log server, data base, etc.

Also notice that the survived instance is the i-99e4cde5. This is the last one that was deployed. And the first one to be terminated during the shrinking process was the first member of the group. Auto Scaling uses that logic to help you to get more value for your money. EC2 bills you the full hour, so leaving alive the last launched instance gives you a chance to use what you've already payed for.Average of what?

The Policy used in this example is not a perfect method and this Average Metric is a bit confusing. First we have to know that the Average CPU used in the official documentation for Auto Scaling is a native CloudWatch metric. It is automatically created when you define your AS Group. EC2 takes the CPU usage of all Instances in your AS Group and store there the Average value (It does the same with other EC2 metrics: CW Console -> All Metrics pull-down menu -> "EC2: Aggregated by Auto Scaling Group"). An elegant method could be do the same kind of aggregation but with our custom metric, but I don't know how to do that. So, what we have is a single metric name receiving all those different values from our cluster members. Then is important that all those members send that information in a timely fashion to not distort the average calculation. I think that a "crontab */5 * * * *" is a good solution but I'm quite open to other suggestions.The ELB role:

By default the Load Balancer will send an equal amount of connection to the web cluster members and therefore the amount of Apache Busy Workers will remain "balanced" among the cluster. The configuration described here is not useful when using "sticky sessions". If a web server increases its connections above the other cluster members, could trigger an unnecessary scale-up action.Cleaning:

You don't want an AS Group doing things while you sleep so I suggest you to delete all your AS configurations after your test is done.