Monitoring at QuizUp: Datadog

One of the services we use a lot for monitoring at QuizUp is Datadog.

We use several software-as-a-service’s, but this one is one of my absolute
favorites. Datadog is an aggregator for StatsD metrics, and StatSD is
as I have mentioned before, one
of my favorite technologies. The way we use Datadog is both through their
large set of excellent
integrations, which retrieve a host of standard metrics from every
third party service in our stack, but also monitor all of our hosts for
standard operating system metrics like load average, disk usage, memory
usage broken down by application, cpu load, etc; and we’ve written our
custom
integrations if one did not exist before.

Additionally we instrument our applications with statsd calls which feed
important higher level metrics into datadog where all of these can be
combined in powerful dashboards and used by metric alerts to trigger
e-mails, pings to Slack or PagerDuty, etc. These metric alerts are our
first line of defence, allowing us to detect and fix problems before they
affect users, so effectively they keep our uptime good. Whether it’s
a sudden increase in the number of API non-2xx, increased latencies, disks
filling up, elevated service health transitions, cluster size changes,
no matter; it’s all fed into Datadog and we’re alerted from there. This
is the kind of service, which if it proved flaky, I would get very
frustrated very quickly. It’s never been flaky. No pressure ;-)

The purpose of this post is however to point out a fantastic recent
feature which I learned about this week, and might just change the way
we work when examining incidents. Until now I’ve been heavily utilizing
the Screenshot+Dropbox+Copybuffer integration offered by Dropbox, to basically
screengrab a portion of a graph or a dashboard, then paste the link to the
relevant Slack channel and ping people who I believe should investigate.

But there’s an easier way to do this,
click the snapshot icon and tag the corresponding people, or use a hashtag
to add custom tags which can be used with event timeline searches. Also if you
have Slack integrated, you can tag the relevant Slack channels directly from
the comment. This way the discussion of any issue can happen in context, and
the comments and thoughts of the engineers working on the issue will be seen
by anyone viewing that particular graph, at that time, later.

Commenting on an anomaly

This week I had the pleasure of meeting Alexis (@alq) from Datadog again, and he showed
me this feature amongst others, and also signed me up for notifications when
they start beta testing new features.

If you’re running an app and would like to give Datadog a try, but feel the
per-host fee is expensive, it’s possible to setup a single datadog statsd
collection node and make your other nodes send their metrics to that node.
This would not get you their nice host-level monitoring, but it’s a cheap
way to try. At QuizUp we monitor all our hosts though, as when running larger
instances the fee charged per host is trivial for us, so I highly recommend
that if cost is not an issue.