Lessons Learned Running a Production Service

DON'T MIX DAY-TO-DAY OPS, PROJECTS, AND INTERNAL IS/IT WITH PRODUCTION IN
SAME PERSON

Once upon at time I was working for a company facing tough economic times. We
had to lay some people off and needed to run lean until revenue caught up with
expense.
We didn't want to radically shift people's jobs or create a multi-class
workplace (e.g. people who were doing the day to day, and the "special people"
who got to work on project, so we tried to use people's skills for both
internal IS/IT as well as running the production service that customers paid
for. We tried to make use of all the people's skills, so people typically had
projects in addition to their day-to-day jobs.

What resulted
was little work on IS/IT since production was always more important. Most of the
team had trouble finishing the long term projects because they let the
day-to-day consume them. This set up an situation where people didn't feel bad
when they missed long term goals because they were busy keeping the existing
world running.

We needed to split day-to-day from longer term projects, and call for the
long-term projects to deliver. We could have fought against the multi-class workplace by
periodically rotation between the two group, and really stressing to the long
term people that their work will be judged by how successful the day-to-day
folks are. The day-to-day folks would be encouraged
to speak up for themselves and understand that they hold in their heads lots of
things which the project oriented people need to know... but also realize they
can be so down in the details that they might miss how to make things better.

PRUNE WORK UPFRONT

Too often operations groups commit to a workload that isn't
possible. As a result ops teams often felt that they are behind, even when they
were working very well and hard. It never let up. This can quickly lead to a
team feeling under siege and defeated. Rather than accept commitments that can
be kept, it's better to realistically communicate what can and can't be done
up-front. That way the people who are depending on the ops team to deliver have
a realistic understanding of what may or may not get finished. They can change
plans or work with the ops team to get additional resources. What you don't want
is a project to run for several month only to have the ops group say in the 11th
hour "We don't have time to deploy this". The other thing that can often happen
is that the larger projects end being "so important" (aka too big to fail) that
everything is sacrified to make them success. Often this means that small
things, some which would only take a minute or two and really unblock someone
are rejected because they weren't "in the plan". If a group is falling to
make it's commitments, needs should be assessed and re-negotated.

YOU HAVE TO OWN YOUR OWN TOOLS

Several times I have worked places were we planned to have a tools team but we didn't have the
headcount to just do tools We did a tools / service ops group. This was a
mistake. Day-to-day needs made it very difficult to get longer-term projects
done. Also, the people who were becoming our "tools" people weren't the
strongest engineers and were not getting enough input to turn them into good
engineers. We hand our tools folks to engineering where we were promised that
they would have additional help, more active and higher quality mentoring /
management, not to mention be combined with others to have more people working
on tools. This happened for a short time but two things happened. First, tools
were written to solve the engineers problems, not necessarily the operations
teams problems. Second, as soon as the engineers problems were solved they moved
onto other issues, leaving the netops team without good tools support.

There are two solutions for this. The first is for an ops team to have
at least one person who is given time to do really tools development without
day-to-day interrupts. The second option is to distribute tools work through the
teams (people make small tools for themselves), and when larger things are
needed to in essence contact out the development process to some sort of
dedicated resource (internal or external) with very well developed
specifications.

NEED TO ACTIVELY MANAGE ENGINEER/OPERATIONS INTERFACE AND INSIST ON HIGH
STANDARDS

Often operations groups are willing to take things from engineering which weren't ready
and through people at the flaws. This is often refered to "products are thrown
over the wall". Often promised fixes don't arrive because
other issues took priority. Operations groups should establish a clear set of
deployment requirements. Accepting incomplete work often sends the wrong
message... people notice that the deployment was successful and fail to see the
operations expense and the opportunity cost... so there will be not pressure to finish the job in
engineering. If something isn't shipping, that produces a lot more pressure on
the engineers to finish the job.

For this to work, the operations group really needs to work with engineering on good
requirements. The operations group needs to actively engage with larger projects so
they aren't surprised as the project changes. As deadlines loom, "less
critical features" will get cut. Customer facing features are typically
championed by product managers. The ops team needs to make sure their "features"
don't get cut. Interfaces between organizations need people to actively manage those interfaces.

100% SOLUTIONS WILL PAY FOR THEMSELVES

Often times we have been faced with more projects that we can possibly
complete. The first thing we would do was ask is if we scaled back the work to
80% solutions (which is roughly good enough) could we get enough of the
important project finished. There where two problems with this. First, the cost
of managing the remaining 20% by hand was almost always more expensive than we
expected, and finishing the project (which we were unlikely to do) hung over
people's head.

Our team is addicted to 80% solutions. But if we donít fully finish things,
then we need smart people to use the tools, follow the process, etc. There are
only so many clueful people. Staffing will be hard, and it will be very hard for
new people to get up to speed.

Furthermore, issues related to the last 20% will often pop up at the worse
possible time which makes scheduling even more difficult.

WRITTEN ARCHITECTURE CRITICAL

It's critically important to document the overall architecture and core design principles.
Failure to do this has a number of consequences. First, new people (and even older people)
have trouble coming up to speed. Second, not having the architecture clearly in
most people's head makes it likely that people will be pulling in different
directions. Finally, but not having clearly documented and agreed to
architecture / design principles mean that any time we wanted to do something
new, we had to get lots of people in a room to hash through things because you
never knew what bits of info where in people's heads (each person might know a
constraints which are needed) and each person might be pulling in a different
direction. By making the documents we would have driven to a clearly articulated
conclusion which would allow one person to operate consistent with the group
direction without haven't to involve the whole team. It would also make is much
easier for team members to come up to speed quickly, and facilitate cross team
communication.

GET THE WRONG PEOPLE OFF THE BUS

In general, I have worked places were we were very successful at getting the right people on the
bus (e.g. good hiring). But we typically didn't take effective actions in the
few cases when we brought in people who weren't getting the job done or when we
move people into positions where they were not succeeding. It's easy to thing
that the wrong person is costing you 1/2 an FTE... e.g. the right person would
be twice as effective. This is not the case. My experience is that the wrong
person can cost you 2-4 FTE worth of labor. Why? People get frustrated and
unmotivated. The wrong things are done which requires everyone to have to stop
what problems are fixed. People end up having to engineer around the problem
person. As soon as you see a problem, fix the problem.

FAILING TO STAFF WILL RESULT IN NEEDING TO OVERSTAFF

Often times, especially when finances are tight, there is a tendency to defer
hiring. This can work with a static work load. In a situation where a company is
growing their customer base / work load, this will, long term, be a disaster.
The longer the staffing is put off, the more likely there will be a crisis. Once
the crisis hits there will be a tendency to hire in desperation which increases
the odds that the wrong people will get on the bus. Also, there are many
solutions a few people with more time will be able to solve better than throwing
lots of people near the end. See mythical man month for more insight.

NEED FOR COORDINATING ROLES

In large ops groups, most teams function as service bureaus or technology
centers without any group with the role of coordination (other than project
management which is more narrowly focused). This isn't good. When an ops
group is running a complex infrastructure there is a need for a set of people to
"own" the overall service. These sorts of teams can have a variety of names
"Service Engineering", "Service Integration", "Systems Group", etc. What's
important is that they have the explicit responsibility to coordination
changes to the production service This group's job was to take the various technologies that were being
supplied to operations and make sure they really were ready to be deployed, and
the work with the NOC to do the deployments. They provided an interface with the
platform team (advocating for features needed to practically run the platform,
and helping answer engineer questions). They would sometimes write tools which
pulled things together or make it easier for all of netops (before we had a
NOC), or the NOC run the service.