Come and have a go, if you think you're nerd enough

I’ve recently had some issues where I’ve had to investigate the AWS API usage on one of our accounts. Enabling Cloudtrail is a start but all it does is shove a load of gzipped json files into an S3 bucket which is no use if you actually want to make use of the data.

Enter Traildash a self contained ELK stack on a Docker image which will pull that bucket’s contents into Elasticsearch and display it usefully in a Kibana frontend.

“Docker?”, I thought to myself, “Doesn’t AWS have something that could do that?”. Of course it does. ECS, the EC2 Container Service, allows you to run your own docker cluster. So here’s how to set up Traildash on ECS

First of all you need to follow the instructions in the Traildash readme to setup traildash in AWS. This gets your Cloudtrail up and running and connected to SQS/SNS for pushing to Traildash. The one thing different is the IAM role which will be the ECS instance and service roles created as part of the cluster build below. If you already have running ECS instances, add the “SQS full access” and “S3 Read Only Access” managed policies to your ECS Instance and ECS Service IAM roles. If not, wait until your ECS cluster instances are built and add the policies after.

Next is creating a task and a cluster in ECS.

If you’ve not used ECS before, follow the default set up wizard to get to the point where you have running cluster instances, otherwise you should be able to use your existing ECS configuration.

Create a new volume (name doesn’t matter) and give it the source path /var/lib/elasticsearch/appliedtrust/traildash

Then Add Container with the following settings:

Image: appliedtrust/traildash

Port mappings: Host: 7000 Container: 7000

Environment Variables:

AWS_SQS_URL <your SQS URL>

AWS_REGION <your AWS Region>

AWS_ACCESS_KEY_ID <your AWS key>

AWS_SECRET_ACCESS_KEY <your AWS secret>

Mount point:

Source volume: <your previously created volume>

Container path: /home/traildash

Create the Service with your task definition, a single task and an ELB if you want one. (You may need to edit the instance/ELB security group to allow port 7000 access. )

Run the task and enjoy your Cloudtrail-y goodness on <your instance or ELB URL>:7000/#/dashboard/file/default.json

It may take 10-15 minutes for your data to start to appear.

If you have Cloudtrail already working in your account and the data has been building up for a while, Traildash provides a backfill script to get it into your dashboard. In order to use the backfill script I changed it to use my aws credentials profile name:

Transitioning from engineering to management/leadership is tough unless you understand the sacrifices you have to make. Rarely can you keep your hands in the guts of the engineering you used to know intimately. And you’ll have to do more paperwork. That said it can be just as rewarding to see your team grow in expertise and experience because of your leadership.

Here are a few simple rules to start with:

If there’s an exciting fun task and a messy unpleasant task, assign the fun task to someone else and do the unpleasant task yourself.

If someone on your team wants to ask you a question, always make yourself available and absolutely pretend that you don’t mind being interrupted. But if you need to ask someone on your team a question, always ask first if it is a good time for them to talk and offer to come back later if they are in the middle of something.

If someone wants to try an approach that you think is wrong, say: “I’m not sure that’s the right approach, because of X, Y, and Z. However, I’ve been wrong before, so I might be wrong about this, too. How long will it take you to research this approach and see if it works out?” If you’re working on a tight schedule, this may not be practical, but if you want to develop good engineers in the long run, this can be beneficial for everyone.

Expect to do less actual engineering, but still keep on top of one or two components (DB, CM, CD etc) for up to 1/3 of your time. This helps to maintain an ear-to-the-ground on ongoing work and to communicate intelligently with the technical team.

Loftesness’ 90-day framework has three distinct stages: Own Your Education (Days 1-30), Find Your Rhythm (Days 31-60) and Assessing Yourself (Days 61-90). It also helps with the decision to enter management in the first place.

Documentation tends to be a polarising, all or nothing, topic in the Operations teams I have been a part of. Everyone agrees on its fundamental importance but no one seems to like to spend time or effort producing it. Especially if they see no immediate benefit in it for themselves.

“My code is self documenting”

“It’s all in the readme”

“JFGI”

“That guy knows – he built it”

This, of course, becomes a real issue when bringing new staff into the team. Onboarding takes far longer when the majority of the information that new hires require about the systems they will be administering has to be acquired piecemeal, anecdotally, using imperfect recollection and without the benefit of an index or search facility.

Storage, input, display and availability are also all contentious subtopics. People have their own favourite wiki flavours, editors and formatting techniques. Should the content be made available outside the team? Or to third parties e.g. vendors? Or even outside the company?

My current role had just such a problem. Documentation was in an awful state – exported in plain text from an old Trac wiki – partially converted to Github markdown in a private repo. A disorganised, incomplete process started by an engineer who had since left the company.

Our Issues

Numerous problems were apparent when comparing the old version of the docs:

Trac formatting, although similar to Markdown, was incompatible and hence content was mainly unreadable

Inter-document links were broken and unformatted

Images were not displaying properly

Much of the content was outdated, obsolete or plain wrong

How Much Doco Is Too Much?

Legend states that minimum viable documentation for a relatively complex software application should consist of:

How to install

How to create and ship a change

A Project roadmap

A Changelog

A Glossary – if necessary

How to troubleshoot or Where to get help

And for Open Source projects:

How to contribute

Much of this was lacking for nearly all the projects in our document repository.

So where do you start?

Start by being systematic. Some considerations:

Get some consensus on where the docs should live

Local file server? Github? Dedicated wiki instance?

Public or private?

Private to the team/group/department/company? Public to the world? Or by invitation e.g. username/pass?

Authentication

Will you need authentication to edit? Or to view?

Collaboration

Do you need group collaboration on documents? Or will a page be locked to one user?

The Winning

Fortunately the total amount of documentation we had to manage wasn’t too great, maybe a few hundred pages. So we settled on keeping the docs in Github and using a wiki front end for reading and searching.

First stop was MDwiki which is a nice, simple single file wiki front end. It is ideal for small amounts of documentation or single project wiki sites such as when documentation is included in a Github repo. However we had a significant amount of pages in nested directories. And MDwiki has no search facility.

A few weeks ago I started looking at SaltStack, my current config management package choice, as the central component of an open source componentised monitoring package.

This is now up and running in a rudimentary fashion. I have a scheduler state that is applied to several machines in my estate which sends system monitoring data to both a MySQL instance for storage and reuse and to a Graphite endpoint for display.

Ahh, Elasticsearch, the cause of, and solution to, all of lifes problems.

I run a Logstash/Elasticsearch/Kibana cluster on EC2 as a application/system log aggregator for the web service I’m supporting. And it’s not been plain sailing. I have a limited AWS budget so I am somewhat restricted in the instances I can fire up. No cc2.8xlarges for me. So I was stuck with two m1.larges. And they struggled.

It was processing around 3 million documents for a total index size of around 4Gb per day. And sometimes it coped and sometimes it didn’t. I often found myself restarting the logstash and elasticsearch services around once or twice a week sometimes losing 7-9 hours of processed logs.

And the most frustrating thing? I had no idea what I was doing wrong. Had I misconfigured? Is it just that the instances were too small?

So I’ve upped my game a bit. Not without some trial and error. “Fake it til you make it” as those of us without an extensive background in Lucene indices and grid clustering are fond of saying.

But I think I’ve cracked it. And this may be a good lesson for people starting out with a set up like this.

I’ve now got two c3.xlarges, which with 10 more compute units to play which makes a big difference to the throughput.

I’ve tweaked the Logstash command line to give me 8 filter workers instead of the default 1. Helps a lot when the document volume increases.

And the most important thing? I’ve done my homework and put some effort into making my Elasticsearch config right.

Port specification to prevent mismatch

transport.tcp.port: 9301

EC2 discovery plugin with filtering to ensure the instances see each other and increased ping timeout to account for network irregularities.

I spoke recently at the Februrary London DevOps meetup about my adventures with solo DevOps and the SaltStack config management system ( Slides )

One of the other talks at the meetup, Stop Using Nagios by Andy Sykes from Forward3D (@supersheep) got me thinking about using Salt as the core component of a distributed monitoring system. I believe it fits the mould very well:

It has an established, secure and most importantly, fast, master-minion setup

It has built in scheduling capability both for the master and separately for the minions.

It already has built in support for piping whatever comes back from your minion status checks into Graphite for graphing and MySQL/Postgres/SQlite or Cassandra/Mongo/Redis/Couch for storage/trending etc in its Returners

It can act on event on the minions using it’s Events and Reactor systems

And there are plenty of Graphite Dashboards that could be co-opted and built upon to provide other views of check data, not to mention Salt’s experimental Halite which may have possibilities as another UI facet.

I’ve started doing some testing of my own on this, but I’d be very interested in feedback.

For some time I’ve been using Cloudwatch as a supplement to my other graphing and monitoring packages, but I finally got tired of the poor UI and lack of customisation. I had seen and tested some other GitHub releases that seemed like they may do the trick to allow me to run my own CW graphs but none had the features I required. So i wrote my own.

Based on aws-cloudchart I have built a set of tools that will give a much better at-a-glance insight into your AWS EC2, RDS and ELB instances.

The code should be fairly easy to follow and it uses the older AWS PHP-SDK v 1.62 as it’s base interface with the Amazon Cloudwatch API. And all it needs is a set of IAM creds. Feel free to fork and improve.