All things DevOps

Category: How To’s

These days the pace of innovation in DevOps can leave you feeling like you’re jogging on a treadmill programmed to run faster than Usain Bolt. Mastery requires hours of practice and the last decade in DevOps has not allowed for it. Before gaining 10 years of experience running virtual machines using VmWare in private data-centers, private cloud software like Openstack and Cloudstack came along, and just when you and your team painfully achieved a stable install you were told running virtual machines in public clouds like AWS, GCP, and Azure is the way forward. By the time you got there it was time to switch to containers, and before you can fully appreciate those, server-less functions are on the horizon, but I digress. If you want to know more about server-less functions, see my previous article on AWS Lambda. Instead, this article will focus on running Docker containers inside of a Kubernetes cluster on Google’s Cloud Platform.

Linux Containers, which were recently popularized by Docker need something to help manage them and while there are many choices, Kubernetes the open-sourced container management system from Google is the undisputed king at this time. Given that Kubernetes was started by Google, it should be expected that the easiest way to install it is using Google’s Cloud Platform (GCP). However, Openshift from Redhat also provides a nice batteries included abstraction if you need to get up and running quickly as well as kops.

Pre-Requisites

In addition, you need some form of a computer with Internet connectivity, some typing skills, a brain that can read, and a determination to finish…For now I will give you the benefit of the doubt and assume you have all of these. It is also nice to have your beverage of choice while you do this, a fine tea, ice cold beer, or glass of wine will work, but for Cancer’s sake please skip the sugar.

Here is where I would normally insert a link to facts on sugar and Cancer’s link, but I literally just learned I would be spreading rumors… Fine drink your Kool-Aid, but don’t blame me for your calories.

The Build Out of our Self Healing IRC Server Hosting Containers

I lied dude, IRC is so 1995 and unfortunately, ICQ’s been dead and Slack won’t let me host their sexy chat application with game like spirit and better jokes than Kevin Heart. So…sorry to excite you… but I guess I will fallback to the docs here and install Nginx like us newb’s are supposed to.

After you are done following these inspirational leaders in the community go to youtube and watch every Kelsey Hightower video you can find. Kelsey Hightower is perhaps the tech communities best presenter and no one has done more to educate and bring Kubernetes to the mainstream than Kelsey. So a quick shout out and thank you to Kelsey for his contributions to the community. In his honor here are two of my favorite videos from Kelsey. [ one ] [ two ].

AWS Lambda functions are a great way to run some code on a trigger/schedule without needing a whole server dedicated to it. They can be cost effective, but be careful depending on how long they run, and the number of executions per hour, they can be quite costly as well.

For my use case, I wanted to create snapshot backups of EBS volumes for a Mongo Database every day. I originally implemented this using only CloudWatch, which is a monitoring service, but because it’s focused on scheduling, AWS also uses it for other things that require scheduling/cron like features. Unfortunately, the CloudWatch implementation of snapshot backups was very limited. I could not ‘tag’ the backups, which was certainly something I needed for easy finding and cleanups later (past a retention period).

Anyway, there were a couple pitfalls I ran into when creating this function.

Pitfalls

Make sure you security group allows you to communicate to the Internet for any AWS API’s you need to talk to.

Make sure your time-out is set to 1 minute or greater depending on your use case. The default is seconds, and is likely not high enough.

“The Lambda function execution role must have permissions to create, describe and delete ENIs. AWS Lambda provides a permissions policy, AWSLambdaVPCAccessExecutionRole, with permissions for the necessary EC2 actions (ec2:CreateNetworkInterface, ec2:DescribeNetworkInterfaces, and ec2:DeleteNetworkInterface) that you can use when creating a role”

Personally, I did inline permissions and included the specific actions.

Upload your zip file and make sure your handler section is configured with the exact file_name.method_in_your_code_for_the_handler

Also this one is more of an FYI, Lambda Function have a maximum TTL of 5 minutes ( 300 seconds).

I think that was it, after that everything worked fine. To finish this short article off, screenshots and the code!

Spinnaker is a tool created by Netflix (of whom I have always been a big fan) that succeeded Asgard a tool I used in my past at PayPal. Not to digress but my favorite companies when it comes to DevOps tools are Hashicorp and Netflix. Obviously, that’s a bit of apples and oranges there, but they both make solid DevOps tools…Moving on…

Here is a quick overview on Spinnaker’s GUI

If you are familiar with Jenkins, then Spinnaker will make a lot of sense to you. Spinnaker is all about configuring a pipeline with stages to automate/orchestrate a number of steps with regards to ‘continously’ deploying your application code to an environment. Spinnaker puts the CD in CI/CD 😉 Corny, but had to say it…

Moving on…

To use Spinnaker effectively, you need to use Jenkins with it. Jenkins is responsible for your git / bake phase, where code is downloaded and then launched on a VM in your environment i.e. AWS. At the end of that launch assuming everything works out, a snapshot is taken and an AMI is created to be use in subsequent steps for install. A typical pipeline looks like this..

Grab the latest checkin from Git repo

Build a package rpm/deb of your application/code (+ dependencies)

Install the package above, aka Bake the code on a VM in AWS, then take a snapshot ( create AMI)

Subsequently deploy your code to compute / includes LB setup if there is any.

Should also mention Spinnaker automatically sets up ASG (auto-scaling groups) which is a nice feature ( if a machine dies its re-created based on the capacity you set/require)

And finally if you need a cleanup task such as “destroy the boxes running the old code” that runs.

Visually pipeline configuration, it looks like this…

Notice the shrink cluster step. This is one of Spinnaker’s built in stages that you can use. However, there is a better way to handle this vs. manually creating the ‘cleanup’ phases after a deploy. You can instead employ what are called strategies…

For example, if you click Deploy in the pipeline to go to the configuration for that…. and then click on edit under a Server Group you have created under Deploy Configuration…

You will get a new window popup that looks like this… .

As you can see here I have clicked into Strategy, and am currently using Highlander which states ‘Destroys all previous server groups in the cluster as soon as new server group passes health checks’.

The highlander strategy is extremely useful for rolling out new builds. Essentially your old code will run, until your new code is healthy at which point the old server groups is destroyed. Assuming the new build is healthy (i.e. your health checks and all previous build tests etc have good coverage) you should be good to go to the shortest amount of time possible. I tried all sorts of customizations to my pipeline to emulate the above behavior without using the strategy and found that the highlander strategy is the fastest.

Anyway, your use cases may be different depending on the type of pipeline you are configuring. So take some time to familiarize yourself with the availability strategies depending on your specific use case.

Now I kinda jumped ahead a few chapters, but I did that to make the point quick before you lose interest, that Spinnaker will deploy your code, it will make sure your nodes are always running, and it can do this all automatically once configured…

What I mean is if we go backwards now to the original configuration step let’s look at your Configuration step in the pipeline…

Under the section called “Automated Triggers” I have added some configuration to listen to a Jenkins server for changes on a Job that I have defined in Jenkins. I am not going to go into much detail on Jenkins because this article would be too longer, but just to show this job in Jenkins very quickly for a full understanding.

This is a screenshot of a job that polls Git every minute for changes. So with these configured, Spinnaker will Listen for changes to our specified repo via Jenkins. Subsequently, if changes are detected Spinnaker will proceed with the rest of the pipeline configuration. The next step is Build.

The first step ‘Configuration’ detects changes to our git repo. The second step Build, turns those changes into an installable package again utilizing Jenkins to do this. Here we tell spinnaker to run our build job in Jenkins…let’s take a look at that in Jenkins…first we do our git configuration, and nothing under build triggers.

Then if necessary we can inject files during build time for packaging..

To make use of the injected files we must copy them from a variable to the host using a build step to execute a shell…

We then run our actual packaging script that turns it into an rpm/deb etc. Mine is called ‘package-collector.sh’ and looks like htis…

And finally there is magic… The last line of our executed shell build step uploads our package to S3….

Next Spinnaker goes to Bake, this is the step where the package gets installed to a VM “baked” and then snapshotted and turned to an AMI for the Deploy step.

The Bake step in Spinnaker is the most straight forward…

Essentially, as long as your package name ‘ciscollector’ as shown here matches the base name of the package you are creating, Spinnaker (is already configured to look in our S3 bucket) will find it and install it on a VM, at the end it will snapshot that VM and created an AMI to use to install in the subsequent Deploy stage of the pipeline.

Finally, we are back to the Deploy step. The deploy step depends on the bake step, just like the bake step depends on build and build depends on Configuration. This is how we create the workflow of our pipeline using ‘depends on’ ( sorry for not explaining that sooner ). So when Build completes successfully we start our Deploy step, which as shown before, has a server group created. Of course you will not, so you must create a new server group, which is how Spinnaker manages/groups servers & load balancers for deployments.

The deploy steps requires that you fill out the following sections with your specific configuration…

As discussed previously under Basic Settings it is best to pick a strategy as part of the Deploy stage rather then trying to do many custom/complicated things yourself. So rather then screenshot through this… it’s pretty self-explanatory to configured Load Balancers, Security Groups, Instance Type etc… The one thing I would highlight is the capacity section and also tagging under Advanced Settings.

The capacity section is important for 2 reasons. First whatever you set for number of instances, turns into an auto-scaling group requirement, thus whatever you set there, if a node is killed or lost for any reason, your auto-scaling group in AWS will make sure you always have that number of nodes running. This is helpful if nodes are dying for whatever reason, although not super helpful if they continually die quickly due to misconfiguration, etc. The second part about consider deployments successful when % of instances are healthy is very important for the speed of your pipeline. Let’s say you have a pool of 16 servers… if all those servers run the same image, the chances that you need to wait until 100% here are slim, for example lets say you run 4 servers, and you only need 3 servers to service 100% of capacity for request. In this case it would be acceptable to move on from this Deploy step at 75% capacity, because waiting for the last 25% isn’t really necessary to service request and you are 99% sure that last host is going to come up. So feel free to tweak.

Last as mentioned, make sure you take the opportunity to tag things in the Advanced Settings sections.

Another highlight, is this is where you would inject userdata to Ec2 instances (under advanced settings). This is helpful when you want to past at-boot-time configuration to a system. For example, you might want to override a configuration file at boot time and say something like if I get this value in user data, override the config, if I don’t use the config. Just remember to pass your userdata as base64.

I want to leave you with some final comments. Spinnaker is a great deployment tool. It is helpful to use the pipeline/stage workflow for reproducing deployments, however, there are many limitations that Spinnaker has, and it will always lag on feature set parity with providers like AWS. As an example, the ALB support is quite limited at the time of this writing. So for that I had to add a custom Jenkins step that runs a script on the Jenkins server to manually add nodes to target groups for an ALB. If you are interested or need more details on that solution email me at tuxninja@tuxlabs.com

I hope you found this brief overview on Spinnaker useful. It’s a great tool that can be used to easily reproduce application deployments.

I have been a Hashicorp fan boy for a couple of years now. I am impressed, and happy with pretty much everything they have done from Vagrant to Consul and more. In short they make the DevOps world a better place. That being said this article is about the aptly named Terraform product. Here is how Hashicorp describes Terraform in their own words…

“Terraform enables you to safely and predictably create, change, and improve production infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.”

Interestingly, enough it doesn’t point out, but in a way implies it by omitting anything about providers that Terraform is multi-cloud (or cloud agnostic). Terraform works with AWS, GCP, Azure and Openstack. In this article we will be covering how to use Terraform with AWS.

Terraform Configuration Files Overview

When you execute the terraform commands, you should be within a directory containing terraform configuration files. Those files ending in a ‘.tf’ extension will be loaded by Terraform in alphabetical order.

Most users of Terraform choose to make their configs modular resulting in multiple .tf files with names like main.tf, variables.tf and data.tf… This is not required…you can choose to put everything in one big Terraform file, but I caution you modularity/compartmentalization is always a better approach than one monolithic file. Let’s take a look at our main.tf

Main.tf

Typically if you only see one terraform config file, it is called main.tf, most commonly there will be at least one other file called variables.tf used specifically for providing values for variables used in other TF files such as main.tf. Let’s take a look at our main.tf file section by section.

Provider

The provider keyword is used to identify the platform (cloud) you will be talking to, whether it is AWS, or another cloud. In our case it is AWS, and define three configuration items, an access key, secret key, and a region all of which we are inserting variables for, which will later be looked up / translated into real values in variables.tf

1

2

3

4

5

provider"aws"{

access_key="${var.aws_access_key_id}"

secret_key="${var.aws_secret_access_key}"

region="${var.aws_region}"

}

Resource aws_instance

This section defines a resource, which is an “aws_instance” that we are calling “jump_box”. We define all configuration requirements for that instance. Substituting variable names where necessary, and in some cases we just hard code the value. Notice we are attaching two security groups to the instance to allow for ICMP & SSH. We are also tagging our instance, which is critical an AWS environment so your administrators/teammates have some idea about the machine that is spun up and what it is used for.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

resource"aws_instance""jump_box"{

ami="${lookup(var.ami_base, "centos-7")}"

instance_type="t2.medium"

key_name="${var.key_name}"

vpc_security_group_ids=[

"${lookup(var.security-groups, "allow-icmp-from-home")}",

"${lookup(var.security-groups, "allow-ssh-from-home")}"

]

subnet_id="${element(var.subnets_private, 0)}"

root_block_device{

volume_size=8

volume_type="standard"

}

user_data=<<-EOF

#!/bin/bash

yum-yupdate

EOF

tags={

Name="${var.instance_name_prefix}-jump"

ApplicationName="jump-box"

ApplicationRole="ops"

Cluster="${var.tags["Cluster"]}"

Environment="${var.tags["Environment"]}"

Project="${var.tags["Project"]}"

BusinessUnit="${var.tags["BusinessUnit"]}"

OwnerEmail="${var.tags["OwnerEmail"]}"

SupportEmail="${var.tags["SupportEmail"]}"

}

}

Btw, the this resource type is provided by an AWS module found here https://www.terraform.io/docs/providers/aws/r/instance.html

You have to download the module using terraform get (which can be done once you write some config files and type terraform get 🙂 ).

Also, note the usage of ‘user_data’ here to update the machines packages at boot time. This is an AWS feature that is exposed through the AWS module in terraform.

Resource aws_security_group

Next we define a new security group (vs attaching an existing one in the above section). We are creating this new security group for other VM’s in the environment to attach later, such that it can be used to allow SSH from the Jump host to the VM’s in the environment.

Also notice under cidr_blocks we define a single IP address a /32 of our jump host…but more important is to notice how we determine that jump hosts IP address. Using .private_ip to access the attribute of the “jump_box” aws_instance we are creating/just created in AWS. That is pretty cool.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

resource"aws_security_group""jump_box_sg"{

name="${var.instance_name_prefix}-allow-ssh-from-jumphost"

description="Allow SSH from the jump host"

vpc_id="${var.vpc_id}"

ingress{

from_port=22

to_port=22

protocol="tcp"

cidr_blocks=["${aws_instance.jump_box.private_ip}/32"]

}

tags="${var.tags_infra_default}"

}

Resource aws_route53_record

The last entry in our main.tf creates a DNS entry for our jump host in Route53. Again notice we are specifying a name of jump. has a prefix to an entry, but the remainder of the FQDN is figured out by the lookup command. The lookup command is used to lookup values inside of a map. In this case the map is defined in our variables.tf that we will review next.

1

2

3

4

5

6

7

resource"aws_route53_record""jump_box_dns"{

zone_id="${lookup(var.route53_zone, "id")}"

type="A"

ttl="300"

name="jump.${lookup(var.route53_zone, "name")}"

records=["${aws_instance.jump_box.private_ip}"]

}

Variables.tf

I will attempt to match the section structure I used above for main.tf when explaining the variables in variables.tf though it is not really in as clear of a layout using sections.

Provider variables

When terraform is run it compiles all .tf files, and replaces any key that equals a variable, with the value it finds listed in the variables.tf file (in our case) with the variable keyword as a prefix. Notice that the first two variables are empty, they have no value defined. Why ? Terraform supports taking input at runtime, by leaving these values blank, Terraform will prompt us for the values. Region is pretty straight forward, default is the value returned and description in this case is really an unused value except as a comment.

1

2

3

4

5

6

7

variable"aws_access_key_id"{}

variable"aws_secret_access_key"{}

variable"aws_region"{

description="AWS region to create resources in"

default="us-east-1"

}

I would like to demonstrate the behavior of Terraform as described above, when the variables are left empty

1

2

3

4

5

6

7

8

9

10

11

12

➜jump terraform plan

var.aws_access_key_id

Enteravalue:aaaa

var.aws_secret_access_key

Enteravalue:bbbb

Refreshing Terraform state in-memory prior toplan...

The refreshed state will be used tocalculate thisplan,but

will notbe persisted tolocal orremote state storage.

...

At this phase you would enter your AWS key info, and terraform would ‘plan’ out your deployment. Meaning it would run through your configs, and print to your screen it’s plan, but not actually change any state in AWS. That is the difference between Terraform plan & Terraform apply.

Resource aws_instance variables

Here we define the values of our AMI, SSH Key, Instance prefix for the name, Tags, security groups, and subnets. Again this should be pretty straight forward, no magic here, just the use of string variables & maps where necessary.

It’s important to note the variables above are also used in other sections as needed, such as the aws_security_group section in main.tf …

Resource aws_route53_record variables

Here we define the ID & Name that are used in the ‘lookup’ functionality from our main.tf Route53 section above.

1

2

3

4

5

6

7

variable"route53_zone"{

description="Route53 zone used for DNS records"

default={

id="Z1ME2RCUVBYEW2"

name="tuxlabs.com"

}

}

It’s important to note Terraform or TF files do not care when or where things are loaded. All files are loaded and variables require no specific order consistent with any other part of the configuration. All that is required is that for each variable you try to insert a value for, it has a value listed via the variable keyword in a TF file somewhere.

Output.tf

Again, I want to remind folks you can put these terraform syntax in one file if you wanted to, but I choose to split things up for readability and simplicity. So we have an output.tf file specifically for the output command, there is only one command, which lists the results of our terraform configurations upon success.

Congratulations, you now have a jump box in AWS using Terraform. Remember to attach the required security group to each machine you want to grant access to, and start locking down your jump box / bastion and VM’s.

Outro

Remember if you take the above config and try to run it, swapping out only the variables it will error something about a required module. Downloading the required modules is as simple as typing ‘terraform get‘ , which I believe the error message even tells you 🙂

So again this was a brief intro to Terraform it does a lot & is extremely powerful. One of the thing I did when setting up a Mongo cluster using Terraform, was to take advantage of a map to change node count per region. So if you wanted to deploy a different number of instances in different regions, your config might look something like…

main.tf

1

count="${var.region_instance_count[var.region_name]}"

variables.tf

1

2

3

4

5

6

7

8

9

variable"region_instance_count"{

type="map"

default={

us-east-1=2

us-west-2=1

eu-central-1=1

eu-west-1=1

}

}

It also supports split if you want to multi-value a string variable.

Another couple things before I forget, Terraform apply, doesn’t just set up new infrastructure, it also can be used to modify existing infrastructure, which is a very powerful feature. So if you have something deployed and want to make a change, terraform apply is your friend.

And finally, when you are done with the infrastructure you spun up or it’s time to bring her down… ‘terraform destroy’

My team and I are building a CMDB for AWS, which provides us with everything happening in our AWS environment + OS level metadata + change history. There will be a separate article on the CMDB journey, but today I want to focus on a specific service in AWS called S3, which is their object store. S3 is a bit of a special snowflake when it comes to AWS services and because of that I ran into challenges structuring my code, because up until S3 (which was the last service I wrote code for) everything had been very similar, and easily modularized. We will get to more detail, but let’s start this article by covering how to use the Go SDK for AWS.

Dependencies

This article assumes you already program in Go and have Go installed on your machine. To get started you will need a couple additional items.

This enables you to interact with AWS API’s on the fly so you can understand the output of commands as you are searching for what you are trying to accomplish.

The Collector Structure

The collector is the component in my CMDB architecture that does all the work of collecting the metadata that we shove into our CMDB. The collector is heavily threaded using go routines for performance. The basic structure looks like this.

Call a go routine for each service you want to collect

//pass in all accounts, regions (from config-file) and pre-established awsSessions to each account you are collecting

Store the response (I loop through mine and store it in a map using the resource-id as the key)

Finally, kick off another go routine to write to our API and store the data.

Ok, so hopefully that seems straight forward as a basic structure…let’s get to why S3 through me for a loop.

S3 Challenges

It will be best if I show you what I tried first, basically I tried to marry my existing pattern to S3 and that certainly was a bad idea from the start. Here was the structure of the S3 part of the code.

The S3 go routine gets called from main.go

//all accounts, regions and AWS Sessions are past into the next go routine

Inside of the S3 go routine, loop over accounts & regions

Launch a go routine for each account & region

Inside of those go routines List S3 Buckets

For each S3 buckets returned

Call additional API’s such as GetBucketTagging()

Ok so what happened ? I got a lot of errors that’s what 🙂 Ones like this….

Ah ok, the BucketList being returned on an AWS Session established with a specific account and region, ignores the region. Because S3 Buckets are global to an account, thus all buckets under an account are returned in the ListBuckets() call. I knew S3 buckets were global per account, but failed to expect a matching behavior/output when a specific region is passed into the SDK/API.

Ok so how then can I distinguish where a bucket actually lives?

As spfink says above, I needed to run GetBucketLocation() per bucket. Thus my code structure started to look like this…

For each account, region

ListBuckets

For each bucket returned in that account, region

GetBucketLocation

If a LocationConstraint (region) is returned, set the new region (otherwise if response is null, do nothing)

Get tags for the bucket in account, region

With this code I was still getting errors about region, but why ?

Well I made the mistake of thinking a ‘null’ response from the API for LocationConstraint had no meaning (or meant query it from any region), wrong (null actually means us-east-1 see from my google below) thus the IF condition evaluated false and the existing region from the outer loop was used because GetBucketLocation() returned null and this resulted in many errors.

Forbuckets located inUS Standard,the location constraint will be null.ForS3,here isthe list of region names with their corresponding regions:http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. Notice that the location constraint is none for US Standard.

The Working Code

collector/main.go fires off a bunch of go routines per service we are collecting for.

It passes in accounts, and regions from a config file.

For the S3 service/file under the ‘services’ package the entry point is a function called StoreS3Resources.

Everything in the code should be self explanatory from that point on. You will note a function call to ‘writeToCis’… CIS is the name of our internal CMDB project/service. Again, I will later be blogging about the entire system in detail once we open source the code. Please keep in mind this code is MVP, it will be changed a lot (optimization, modularized, bug fixes, etc) before & after we open source it, but for now he is the quick and dirty, but hopefully functional code 🙂 Use at your own risk !