Web Operations, Culture, Security & Startups.

Good & Bad Patterns in Development and Operations

As part of my role at a new company I’ve been asked to provide feedback
about structuring Dev & Ops as well as what sorts of things work and don’t. I
certainly don’t claim to have all the answers, but I’ve seen some very
functional and some very dysfunctional organizations. I’ve spent a fair amount
of time thinking about what works & why.

Below is a cleaned up version of a message I sent to our CEO who asked for my
thoughts on what does and doesn’t work. This was intended as scaffolding for
further discussion so I didn’t go into deep details. If you want more details
on any particular area just throw some comments out there.

I realize not all these issues are black & white to many folks – there are
gray areas. My goal with this message was to drive conversation.

I figure this is probably review to many folks, but maybe it’ll help someone.

First, there are some very simple goals that all these bullets drive toward & they’re somewhat exclusive to SaaS companies:

Customers should continuously receive value from Developers as code is incrementally pushed out

Developers should get early feedback from customers on changes by enabling features for customers to test

We can address problems for customers very quickly – often in a matter of hours

We can inspect and understand customer behavior very deeply, gathering exceptional detail about how they use the service.

We can swap out components & substantially change the underlying software without the customer knowing (if we do it right)

We can measure how happy customers are with changes as we make them based on behavior & feedback

The lists below are what I feel make that possible (Good) and what inhibit it (Bad)

Bad:

Developers don’t look at metrics unless something is brought to their attention

Code doesn’t expose metrics until someone else asks for it

And here is the long version of all of that…

#1 Culture & Communication

Above all else I consider these most important. I think most problems in other
areas of the business can be overcome if you do well in these areas. Rally has
been, by far, the best example of a very successful model that I’ve seen in
this area. They aren’t unique – there are other companies with similar models
& similar successes.

Main points

Stand ups. By far the most effective tool for keeping everyone in touch. As teams grow you have to break them apart, so you have a 2nd standup where teams can bring cross-team items to share.

Projects are tackled by relatively small, typically self-formed teams. Get individuals who are interested in working in an area together & they feed on each others passion.

Perform retrospectives. This gives individuals & small groups the ability to voice concerns in a way that fosters resolution. There’s an art to facilitating this but it works well when done right. It also allows recognition of things that are done well.

Use open information radiators – it should be easy to see what’s going on by looking at status somewhere vs. having to ask for status, go to meetings, etc. Kanban boards are great for this.

Leaders exist to facilitate and help drive consensus but decisions are largely made by teams, not leaders. This makes being a leader harder, but it makes the teams more empowered.

Accept that things may not work & the team and company will adjust when things do not work. This makes it easy to try new things & easy for people to vocalize when they think it isn’t working. If it’s hard to change process then people are more resistant to try new things. This goes back to retrospectives for keeping things in check. Also important in this are “spikes” or time boxed efforts explicitly designed to explore possibilities.

Give developers time to pursue their own projects for the company. Many awesome features have come out of Hackathons where developers spent their own time to build something they were passionate about.

Hire for personality fit first. I have seen many awesome people find a special niche in a company because they grew into a role that you couldn’t hire for – but what made that possible was that they worked well with the team as an individual. Hiring for technical skill also means you lose that skill when that person leaves, I would prefer to have cross-functional teams.

Data driven decisions. This helps keep emotion and “I think xyz” out of the discussion & focuses on the data we do and do not have. If we don’t have data we either get more or acknowledge we may not be making the right decision but we’re going to move forward.

Make the right thing the easiest thing. I’ve seen too many companies put process out there that makes the “right thing” really difficult, so it gets bypassed. The right thing should be an express train to done – very little resistance and very easy to do. It’s when you start wanting to do things differently that it should become harder, more painful.

Also, everyone owns the quality of the service. This includes
availability, performance, user experience, cost to deliver, etc. At my last
company, there was exceptional collaboration between Operations, Engineering
and Product (and across engineering teams) on all aspects of the service and
there was a strong culture of shared ownership & very little finger pointing.

If you want more details on this specific to Rally I wrote a blog post with some more info:
Blog Post

#2 Obsessively eliminate manual process – let computers do what they are good at.

This is so much easier to do up front. There should be as little manual
process as possible standing between a developer adding value for customers
(writing code) and that code getting into production. There may be business
process that controls when that feature is enabled for customers – but the act
of deploying & testing that code should not be blocked by manual process. I
refer to this as separating “Deploy” from “Release” – those are two very
different things.

Testing should only be manual to invalidate assumptions, validating
assumptions should be automatic When we assume that if x is true then y will
occur, there should be a test to validate that this is true. Testers should
not manually validate these sorts of things unless there is just no way to
automate them (rare). Testers are valuable to invalidate assumptions.
Testers should be looking at the assumptions made by Developers and helping
identify those assumptions that may not always be correct.

Too many organizations rely on manual testing because it’s “easier”, but it
has some serious drawbacks:

You can only change your system as fast as your team can manually test it – which is very slow.

Your testing is done by humans who make mistakes and don’t behave predictably so you get inaccurate results.

The # of tests will only grow over time, requiring either more humans or more time, or both. It doesn’t scale.

Over time the software quality gets lower, takes longer to test, and the test
results become less reliable. This is a death spiral for many companies who
eventually find it very hard to make changes due to fear & low confidence in
testing.

Avoiding this requires developers spend more time up front writing automated
tests. This means developers might spend 60-70% of their time developing tests
vs. writing code – this is the cost of doing business if you want to produce
high quality software.

That may seem excessive, but the tradeoffs are significant:

Much higher code quality which stays high (those tests are always run, so re-introduced bugs (regressions) get caught)

Faster developer on boarding, the tests describe how the code should behave and act as documentation.

Refactoring code becomes easier because you know the tests describe what it should do.

Each commit to the codebase is fully tested, allowing nearly immediate deployment to production if done right.

Problems that make it into product feed back into more tests & continually improve code quality.

Much of the time developing tests is spent thinking about how to solve the
problem, but you are also writing code with the intent of making it testable.
Code is often written differently when the developer knows tests need to pass
vs. someone manually testing it. It’s much harder to come along later and
write tests for existing code.

You will hear me talk about Continuous Deployment & Continuous Integration – I
feel these practices are extremely important to driving the above “good”
behaviors. If you strive for Continuous Deployment then everything else falls
into place without much disagreement because it has to be that way. This has a
lot of benefits beyond what’s listed above:

Value can be delivered to customers in days or hours instead of weeks or months

Developers can get immediate feedback on their change in production

New features can be tuned & tweaked while they are fresh in a developers mind

You can focus on making it fast to resolve defects, no matter how predictable they are, rather than trying to predict all the ways things might go wrong.

Most of the tools and behaviors that enable Continuous Deployment scale to very large teams & very frequent deployments. Amazon is a prime example of this, deploying something, somewhere, about every 11 seconds. Many companies that are in the 30-100 engineer size talk about deploying tens of times per day.

This also impacts how you hire QA/Testers. This is a longer discussion, but you want to hire folks who can help during the test planning phase & can help Developers write better tests. Ideally your testers are also developers & work in a way that’s similar to Operations, helping your Developers to be better at their jobs.

#3 If it moves, measure it

I mentioned above, two big advantages a SaaS organization has are the amount
it can learn about how customers use the product & the ability to change
things rapidly. Both of these require obsessive measurement of everything that
is going on so that you know if things are better or worse. Some of these
metrics are about user behavior & experience to understand how the service is
being used. Other metrics are about system performance & behavior.

The ability to expose some % of your customer base to a new feature & measure
their feedback to that is huge. Plenty of companies have perfected the art of
A/B testing but at the heart of it is the ability to measure behavior. Similar
to testing, the software has to be built in a way which allows this behavior
to be measured.

System performance similarly requires a lot of instrumentation to identify
changes in trends, to identify problems areas in the application & to verify
when changes actually improve the situation.

I’ve been at too many companies where they simply had no idea how the system
was performing today compared to last week to understand if things were better
or worse. At my last company I saw a much more mature approach to this
measurement which worked pretty well, but it required investment. They had two
people fully dedicated to performance analysis & customer behavior analysis.