What I want is a way of monitoring different parts of my Alexa skill, especially:

Execution time

RSS feed download time

RSS feed parsing time

Custom errors

Which intents are being used the most

Sending custom metrics to CloudWatch via Metric Filters

After looking through the various NPM packages, and the CloudWatch API. Thankfully I found a much easier way to push custom metrics to CloudWatch. These are called Metric Filters. They allow you to set up parsing rules for CloudWatch log lines, when a log line matches the filter it can be pushed as a metric as well (and then graphed). This means you can graph metrics, just by writing a log line in a consistent format!

Log Format

The log format I settled on is pretty simple, I have two main metric types:

Timers – These log a value (duration) against a specific metric name

Counters – These just log out a metric name and can be used to count the number of times something happened (perfect for errors).

I wrapped this format in a simple logger object so I can just call logger.error(), logger.timer() or logger.increment() in my code and pass the name (and value if it’s a timer).

This results in a set of log lines being automatically written to the CloudWatch log file. The example below shows a counter for the metric PlayLatestSermon, a timer for FeedFetchTime (taking 1,001ms) and a timer for FeedProcessingTime (taking 3,278ms).

Setting up the Metric Filters

Setting up the filters is pretty simple. Log into CloudWatch in the AWS portal, and click on ‘Logs’. Then click on the log group for your AWS Lambda Function and then click the blue ‘Create Metric Filter’ button.

You’ll then be prompted to enter the filter pattern. This is the pattern you want to match for in each log line. You can see from the log examples above that my log structure is in the format

DATETIME REQUESTID LOG_DATA

Since I’m logging two different types of metrics (timers and counters) my filters are pretty much one of two.

The first line looks for a log line starting with ‘TIMER_METRIC’ it then takes the next value as the metric name (for this example it’s the FeedFetchTime metric), then the value I write log. The second example is almost identical but the type of log is different, the metricName is different, and since it’s a counter I don’t emit a value.

You can test the filter pattern against sample log data but be warned, the drop down is in reverse order, so you can only test it against some of your oldest log data (not great if you’ve added new data that’s in your newer log files). To work around this you can copy/paste log data and test against that. Once you’re happy it’s matching log lines correctly you can configure the metric itself.

Here you can set a namespace to group your metrics together. The metric name (I copied the name that I emit to the logs to keep it simple). If you want to log the value from the logs then you can click the different quick-links below the value field (these will match the names you used in your filter pattern). For my example I used ‘metricValue’ but for counters I just leave this field blank since I don’t care about a specific value. When that’s done you can save your filter.

Graphing your new metrics

The metrics will normally take a few minutes to come through the CloudWatch metrics side. After a few minutes you should be able to click on the name of the namespace you configured in your metric filter and then find the metric you want. Once it’s graphed you can change the statistic type too. Some useful ones are

avg – The average value for the specified interval. Be careful with this stat, it’s nice to have but it can give you a false sense of security

sum – This calculates the sum of all of the values emitted in the time window

min – The lowest value emitted in the time window

max – The largest value emitted in the time window

sample count – The number of times the metric was emitted (useful for error counts etc, but make sure you don’t log error=0 or something similar since that will have a sample count of 1)

p99 / p90 / p50 etc – These are immensely valuable, these are percentile metrics. P99 shows you the values at the 99th percentile. In other words, they show the top 1% of values seen. Likewise P90 is the top 10% etc. These are great for timer metrics. The P99 execution time, shows you what the slowest (larger values = longer time) 1% of users experience. The P90 would show what the slowest 10% of users experience. This really show where users might be getting a much worse experience than you think if you just focus on the average values.

I suggest duplicating the same metric on a single graph and changing the statistic type to show avg, P90 and P99 on a single graph so you can get a feel for the average experience, but also what the top 1% and 10% of users experience too.

Creating a Dashboard

Once you’ve got data coming through to CloudWatch metrics I suggest creating a custom dashboard to show the data you care about in a single screen. Click on dashboard and pick a name and then click ‘add widget’ to add a new graph. You can expand graphs to take up different numbers of columns / rows but remember to click “save dashboard” whenever you modify something!