Velocity Conference Takeaways

7digital software developer Mia Filisch attended the October 28th Velocity conference in Amsterdam. She was kind enough to share her account of the core takeaways here with us. She found that the core recurring theme around security was enough to inspire some internal knowledge sharing sessions she has already started scheming on. The diversity of insights led to a productive and informative conference. See below for her notes.

Be aware it’s pretty long (at Velocity the session took 3hs and that was with him actually skipping all the exercises), but it really does cover a lot.

Using Docker Safely (Adrian Mouat)

This talk discussed the different attack vectors of containers, as well as a good few practical steps and strategies for applying common security paradigms (defence-in-depth and least privilege) to Docker and containers generally.

As an industry, we don’t currently tend to manage secrets very well (even when bearing in mind that security is always about trade-offs)

Secret management should be considered tier 0 / core infrastructure (should be highly available, have monitoring, alerting and access control)

In light of this, Schoof proposed the following core principles of modern secret management:

The set of actors who can do something should be as small as possible

Secrets need to expire (set up efficient, easy ways to do secret rotation - this shouldn't require a deploy) ((This also implies that secrets shouldn't be in version control))

It should be easier to handle secrets in secure ways than insecure ways

Security of a system is only as strong as its weakest access link

Secrets must be highly available (as they will stop the basic functioning of apps if they aren't)

The talk went on to discuss all the various aspects of building a secret management system, which I’ll leave up to you to follow along via the slides, it was quite interesting.

Existing services that were discussed and recommended in the talk were: Vault, Keywhiz and CredStash, but all of these solutions are still pretty new, so with any of them there’ll probably still be quite a bit of tweaking required to get a management system in place that works well.

Seeing the Invisible: Discovering Operations Expertise (John Allspaw)

John Allspaw reveals what he gets up to in his free time, i.e. pursuing an MA in “Human Factors and Systems Safety” at Lund University Sweden (obviously).

His own research explores the area of human factors in web engineering, both with respect to understanding catastrophic failures, but also with respect to understanding the human factors involved in not having catastrophic failures in the face of things potentially going wrong literally all the time. Human Factor & Ergonomics (HFE) research has a long history in areas like aviation, surgery and mining, but for our industry is still relatively under-researched.

TL;DR: The language we use and views we hold when talking about failure shape the outcome of that discussion, and what we learn for the future.

Both “Why” and “How” questions tend to limit the scope of our inquiry into incidents; instead “What” questions are a much better device for building empathy, and also help focusing the analysis on foresight - rather than it’s less constructive counterpart hindsight, which more easily falls prey to various cognitive bias and to blameful thinking.

Always assume local rationality: “people make what they consider to be the best decision given the information available to them at the time.” - there isn't really a just culture that doesn't revolve around this premise.

Alert Overload: Adopting A Microservices Architecture Without Being Overwhelmed With Noise (Sarah Wells)

No huge surprises but a good summary on how to set up useful alerts - below are some key points discussed.

Focus on business functionality:

Look at architecture and decide which parts or relationships are crucial to your core functionalities

Decide what it is that you care about for each - speed? errors? throughput? ...

Focus on End-to-End - ideally you only want an alert where you actually need to take action

Make alerts useful, build with support in mind!

readability! (eg. use spaces rather than camel casing etc.)

add links to more information or useful lookups

provide helpful messages

If most people filter out most of the email alerts they are getting, you should probably fix your alert system.

The Definition Of Normal: An Intro and guide to anomaly detection (Alois Reitbauer)

As anomaly detection has a nice role to play in spotting issues early (ideally before any really bad things happen), I was really excited about this talk, but it quickly turned out that if you’re not from a relatively strong maths / stochastics background (like I am not), then you probably need to rely on other people for anomaly detection magic. So the following is a more high-level view.

Anomalies are defined as events or observations that don’t conform to an expected pattern.

events are checked against your hypotheses, applying a likeliness judgement

how the event performs against this likeliness judgement translates into whether it is an anomaly or not

How to approach setting the baselines which define your normal model? One thing to bear in mind that some of them (such as mean/average or median) don’t learn very well. The presenter recommended using exponential smoothing instead, since it is both easy to calculate and learns very well.

Tech is only half the work - identify all stakeholders and their goals; involve Legal/Finance early (especially if you might have to battle early terminations of legacy infrastructure contracts), work on awareness and knowledge transfer across teams

WebPageTest.org now offer a few “Real Mobile Networks” test locations - only a handful for the time being, but if they extend this it could be pretty interesting for us testing client web apps from different locations etc.!

Over the last month we've started using ServiceStack for a couple of our api endpoints. We're hosting these projects on a debian squeeze vm using nginx and mono. We ran into various problems along the way. Here's a breakdown of what we found and how we solved the issues we ran into. Hopefully you'll find this useful. (We'll cover deployment/infrastructure details in a second post.)

Overriding the defaults

Some of the defaults for ServiceStack are in my opinion not well suited to writing an api. This is probably down to the frameworks desire to be a complete web framework. Here's our current default implementation of AppHost:

For me, the biggest annoyance was trying to find the DefaultContentType setting. I found some of the settings unintuitive to find, but it's not like you have to do it very often!

Timing requests with StatsD

As you can see, we've added a StatsD feature which was very easy to add. It basically times how long each request took and logs it to statsD. Here's how we did it:

It would have been nicer if we could wrap the request handler but that kind of pipeline is foreign to the framework and as such you need to subscribe to the begin and end messages. There's probably a better way of recording the time spent but hey ho it works for us.

At 7digital we use Ajax to update our basket without needing to refresh the page. This provides a smoother experience for the user, but makes it a little more effort to automate our acceptance tests with [Watir](http://wtr.rubyforge.org/). Using timeouts is one way to wait for the basket to render, but it has two issues. If the timeout is too high, it forces all your tests to run slowly even if the underlying callback responds quickly. However if the timeout is too low, you risk intermittent fails any time the callback responds slowly. To avoid this you can use the [Watir `wait_until` method](http://wtr.rubyforge.org/rdoc/classes/Watir/Waiter.html#M000343), to poll for a situation where you know the callback has succeeded. This is more inline with how a real user will behave. ### Example

At 7digital we use [Cucumber](http://cukes.info/) and [Watir](http://wtr.rubyforge.org/) for running acceptance tests on some of our websites. These tests can help greatly in spotting problems with configuration, databases, load balancing, etc that unit testing misses. But because the tests exercise the whole system, from the browser all the way through the the database, they can tend be flakier than unit tests. Then can fail one minute and work the next, which can make debugging them a nightmare. So, to make the task of spotting the cause of failing acceptance tests easier, how about we set up Cucumber to take a screenshot of the desktop (and therefore browser) any time a scenario fails. ## Install Screenshot Software The first thing we need to do is install something that can take screenshots. The simplest solution I found is a tiny little windows app called [SnapIt](http://90kts.com/blog/2008/capturing-screenshots-in-watir/). It takes a single screenshot of the primary screen and saves it to a location of your choice. No more, no less. * [Download SnapIt](http://90kts.com/blog/wp-content/uploads/2008/06/snapit.exe) and save it a known location (e.g.

[TeamCity](http://www.jetbrains.com/teamcity/) is a great continuous integration server, and has brilliant built in support for running [NUnit](http://www.nunit.org/) tests. The web interface updates automatically as each test is run, and gives immediate feedback on which tests have failed without waiting for the entire suite to finish. It also keeps track of tests over multiple builds, showing you exactly when each test first failed, how often they fail etc. If like me you are using [Cucumber](http://cukes.info/) to run your acceptance tests, wouldn't it be great to get the same level of TeamCity integration for every Cucumber test. Well now you can, using the `TeamCity::Cucumber::Formatter` from the TeamCity 5.0 EAP release. JetBrains, the makers of TeamCity, released a [blog post demostrating the Cucumber test integration](http://blogs.jetbrains.com/ruby/2009/08/testing-rubymine-with-cucumber/), but without any details in how to set it up yourself. So I'll take you through it here.