At the beginning of the year I contributed to an article Appdynamics put together called ‘The State of DevOps at the Start of 2017‘. I joined a motley collection of contributors with a broad range of backgrounds who all answered an interesting array of questions. If you’re unlucky enough to work in one of those organisations that’s still debating the relative merits of DevOps then I think this little article gives a nice concise summary of the current state of the industry.

Standardisation

Since I started shamelessly self-promoting my DevOps prowess I’ve been asked to contribute to a few of these things. There’s usually one question that really piques my interest and gets me thinking. In this piece it was the question:

“Do you think DevOps will become more standardized?”

Software Fragmentation

As this piece was sponsored by AppDynamics I initially started thinking about software and services. In the early days of the DevOps movement there was a real danger that it would get written off as merely Continuous Integration. That focus spawned a plethora of software and plug-ins. Then when the #Monitoringsucks movement gained momentum it heralded a wealth of new approaches to monitoring tools and solutions, Outlyer (formerly known as Dataloop.IO) started in earnest, Datadog and the Application Performance Management tools started to really gain in popularity. Splunk and ELK become the standards for log-processing. All throughout this period various Cloud solutions launched. Redhat integrated various open-source solutions into an entire private cloud solution together which it put forward as a competitor to VMWare.

I was just putting an answer together about how at the start of major new trends in IT there’s always a profusion of software solutions around as organisations try to capitalise on all the press attention and hype when it occurred to me that the approach organisations have taken to adopt DevOps practices have become highly standardised.

Regardless of the exact mix of software and cloud solutions we always try to automate the building, testing and deployment (and more testing) of software onto automatically provisioned and configured infrastructure (with yet more testing). What has varied over the years is the precise mix of open and closed source software and the amount of in-house software and configuration required.

2nd Generation

The second generation of DevOps organisations including Netflix and Etsy were the first generation to be able to piece together pre-existing solutions built around CI tools like Jenkins and Teamcity to build their DevOps approaches. That’s not to say they didn’t have to build a lot of software, they did but Netflix were able to build on top of Amazon’s Cloud and Etsy relied on Jenkins and Elasticsearch among other tools. In technology circles Netflix are as famous for their codebase as they are for their video-on-demand service. While Etsy are perhaps a little less famous for their code their contributions are arguably much more influential. Where would we be without Statsd? Etsy’s Deployinator was responsible for driving a lot of the conversation about continuous integration in the early days of DevOps.

3rd Generation

I think we’re getting towards the end of the 3rd generation of DevOps. The current approach to abstracts us even further away from the hardware. It’s built on the concepts of containerisation, microservices and serverless functions. We’re trying to build immutable services, very rapidly deployed and automatically orchestrated. The goal is to support the most rapid test and deployment cycle possible, to minimise the need for separate functional and integration test cycles, to make the most efficient use of resources and allow us to scale very rapidly to meet burst demand. The software we’re forced to build at this point tends to be integration software to make one solution work with another and expose the results for analysis. Rather than building our solutions around CI tools we’re building them around orchestration mechanisms. We still need the CI tools but their capabilities aren’t the limiting factor anymore.

4th Generation

I think the next generation of DevOps will bring data management and presentation under control. Too often we choose the wrong data stores for the types of data we’re storing and the access patterns. There’s still a tendency to confuse production data stores with business intelligence use-cases. We’re still putting too much business logic in databases forcing us to run large data structure updates and migrations. We also need to lose the idea that storage is expensive and that everything must be immediately consistent, we also need to manage data like we do software in smaller independant files.

Standards

From the perspective of the daily lives of engineers and their workflows DevOps appears highly standardised. We haven’t standardised around a small number of software solutions that define how we work as previous generations of engineers did we have standardised our workflow. We’ve been able to do that thanks to the open source movement which has given us so many flexible solutions and made it easier for us to contribute to those solutions.

Synopsis

As an organisation becomes product-focused and teams begin to focus their attention on their suite of microservices the organisation must find a way to ensure there is still appropriate focus on the quality of the service being provided to customers.

The power of a product focused approach

I became convinced of the power of a product-focused approach to DevOps when I was working at EA Playfish. Our games teams were product-focused cross-functional and, to an extent, multi-discipline. They were incredibly successful. They had their problems but they were updating their games weekly with a good cadence between content updates and game changes. They reacted to changing situations very quickly and they even pulled together on programme-wide initiatives well. Playfish’s games teams were my inspiration for wanting to scrap the old development, test and ops silos and create teams aligned behind building and supporting products and services.

The insidious risk of having a product-focus

In Next Gen DevOps I examine the implications of taking a product-focused approach to a DevOps transformation. The approach has since been validated by some high profile success stories that I examine in the second edition of the book. Some more recent experiences have led me to conclude that, while the approach is still the right one I, and several others, have missed a trick.

In breaking down a service into microservices or even starting out by considering the bounded contexts of a problem it’s easy to lose sight of the service from the customer perspective.

Background

One of the functions the Ops teams of the past assumed was maintaining a holistic view of the quality and performance of the service as a whole. It’s this trait more than any other that led to friction with development teams. Development teams were often tasked with making changes to specific aspects of a service. Ops would express their concerns about the impact of the proposed changes on performance or availability. The Developers were then caught between an Ops team expressing performance and reliability risks and a Product Manager pushing for the feature changes they need to improve the service. None of the players had the whole picture, very few people in those teams had much experience of expressing non-functional requirements in a functional context and so the result was conflict. While this conflict was counter-productive it ensured there was always someone concerned with the availability and performance of the service-as-a-whole.

Product-focused DevOps is now the way everyone builds their Technology organisations. There are no Operations teams. Engineers are expected to run the services they build. The most successful organisations hire multi-discipline teams so while everyone codes some people are coding tests, some infrastructure and data structures and some are coding business logic. The microservices movement drove this point home as it’s so much easier to deliver smaller, individual services when teams are aligned to those services.

Considering infrastructure configuration, monitoring, build, test execution, deployment and data lifecycle management as products is a logical extension of the microservices or domain-driven development pattern.

However here there be dragons!

When there’s clear ownership of individual microservices who owns the service-as-a-whole? If the answer to that question is everyone then it’s really no-one.

A few years ago the DevOps community got into a discussion about the KPIs that could be used to measure the success of a DevOps transition. As with most such discussions we ended up with some agreement around some common sense measures and a lot of debate about some more esoteric ones.

The ones most people agreed with were:

Mean Time To Recovery.

Time taken to deploy new features measured from merge.

Deployment success (or failure) rate.

Due to the nature of internet debate most of the discussion focussed on what the KPIs should be and very little discussion was had about how the KPIs should be set and managed.

This takes us to the trick I missed when I wrote Next Gen DevOps and the trick many others have missed when they’ve tackled their DevOps transitions.

In our organisations we have a group of people who are already concerned with the whole business and are very focussed on the needs of the customers. These people are used to managing the business with metrics and are comfortable with setting and managing targets. They are the c-level executives. Our COOs are used to managing sales targets and conversation rates, our CFOs are used to managing EBITDA and Net Profit targets. CMOs are used to managing Net Promoter Scores and CTOs are used to managing infrastructure cost targets.

I think we’re making a mistake by not exposing the KPIs and real-time data we have access to so that our executives can actively help us manage risk, productivity and quality of service.

The real power of KPIs

In modern technology organisations we have access to a wealth of data in real-time. We have tools to instantly calculate means and standard deviations from these data points and correlate them to other metrics. We can trend them over time and we can set thresholds and alerts for them.

Every organisation I’ve been in has struggled to manage prioritisation between new feature development and non-functional requirements. The only metric available to most of the executives in those organisations has been availability metrics and page load times if they’re lucky.

Yet it’s fairly logical that if we push new feature development and reduce the time spent on improving the performance of the service service performance will degrade. It’s our job as engineers to identify, record and trend the metrics that expose that degradation. We then need to educate our executives on the meaning of these metrics and give them the levers to manage those metrics accordingly.

Some example KPIs

Let’s get right into the detail to show how executives can help us with even the most complex problems. Technical debt should manifest as a reduction in velocity. If we trend velocity then we can highlight the impact of technical debt as it manifests. If we need to reduce new feature production to resolve technical debt we should be able to demonstrate the impact of that technical debt on velocity and we should be able to see the increase in velocity having resolved the technical debt.

For those organisations still struggling with inflexible infrastructure and software consider the power of Mean time between failure (MTBF). If you’re suffering reliability problems due to under-scale hardware or older software and are struggling to get budget for upgrades MTBF is a powerful metric that can make the point for you.

A common stumbling block for many organisations in the midst of their DevOps transformations is the deployment pipeline. Two words that sum up a wealth of software and configuration complexity. Often building and configuring the deployment pipeline falls to a couple of people who have had some previous experience but no two organisations are quite the same and so there are always some new stumbling blocks. If you trend the Time taken to deploy new features measured from merge. You can easily make the point for getting some help from other people around the organisation to help build a better deployment pipeline.

The trick with all of this is to measure these metrics before you need them so you can demonstrate how they change with investment, prioritisation and other changes.

Implementation

Get your technical leadership team to meet with the executive team, discuss the KPIs that matter to you and some that don’t matter yet that might. Educate everyone in the room about the way the KPIs are measured so the metrics have context and people can have confidence in them. Create a process for managing the KPIs and then start measuring them, in real-time and display them on dashboards. Set up sessions to discuss the inevitable blips and build a partnership to manage the business using the metrics that really matter.

Last year Linuxrecruit published an ebook called DevOps Journeys. Over it’s 30 pages various thought leaders and practitioners shared their thoughts and experiences of implementing DevOps in their organisations. It was a great read for anyone outside the movement to understand what it was all about. For people inside the movement it presented an opportunity to learn from the experiences of some of the UK’s foremost DevOps luminaries.

Linuxrecruit have recently published the follow-up: DevOps Journeys 2.0. This one’s even better because I’m in it!

A year on we’re in a different place, most organisations now have DevOps initiatives, we’re in the midst of a critical hiring crisis, new technologies are on the hype train and large companies are jumping on last years band wagons. More companies are now starting to encounter the next generation of problems that arise from taking a product focussed approach to DevOps.I’ve worked with several of the contributors to DevOps Journeys 2.0 and they are some seriously capable people. If you’re interested in the challenges facing organisations as they embark on or progress along their DevOps Journey’s DevOps Journeys 2.0 is a great read.

I’ve been working at the Department for Work and Pensions (DWP) for the best part of a year now. If you’re not aware the DWP is the largest UK Government department with around 85,000 full time staff augmented by a lot of contractors like myself.

DWP is responsible for:

Encouraging people into work and making work pay

Tackling the causes of poverty and making social justice a reality

Enabling disabled people to fulfil their potential

Providing a firm foundation, promoting saving for retirement and ensuring that saving for retirement pays

Recognising the importance of family in providing the foundation of every child’s life

Controlling costs Improving services to the public by delivering value for money and reducing fraud and error

The DWP paid more than 22 million customers around £164 billion in benefits and pensions in 2013-14.

After decades of outsourcing it’s Technology development and support the government decided that it should provide it’s own Technology capability. Transforming such a large Technology organisation from, what was primarily, an assurance role to a delivery role Is no mean feat.

Having been a part of this journey for almost a year I thought it might be useful if I shared some of things that have worked well and some of the challenges we haven’t yet overcome.

Today I want to talk about the Technical Design Authority (TDA). I’ve never worked anywhere with a TDA before and I didn’t know what to expect. Established by Greg Stewart CTA / Digital CTO at DWP not long after he joined the DWP the TDA has a dual role.

The TDA hold an advisory meeting where people can introduce new projects or initiatives and discuss them with peers and the Domain Architects. In an organisation as large as the DWP this really helps find people with similar interests and requirements. It reduces the chance of accidental duplication of work and introduces people who are operating in similar spaces. Just finding out who people are who are working in a similar space has been tremedously valuable.

The TDA also hold a governance session where they review project designs. The template they provide for this session is really useful. It forces the architect or developers to consider data types stored, data flows including security boundaries and high-availability and scaling mechanims. That;s not to say every project needs those things but the review ensures that a project that does need them has them.

I can’t list the number of projects I’ve been involved with over the years that would have benefitted from a little forethought about non-functional requirements.

A TDA is a must have for an enterprise DevOps transformation. It makes sure Technology people working on similar projects in different parts of the organisation are aware of and can benefit from each other’s work. It ensures that projects pay adequate attention to the non-functionals as well as the functional requirements and it ensures that where standards are required they are promoted and where experiments are needed they are managed appropriately.

I’m often asked about my super-hero origin story. Everyone has one they’re not all as interesting as Superman’s baby crash landing on a farm prologue. My journey to DevOps started way back in my CompuServe days with my boss’ insistence that everything be scripted and automated. But another start point is the PDF I’ve included below.

I’ve written before about how AOL embraced Agile and had Thoughtworks help our development teams. I’ve also mentioned that the Ops team (my team) and the test team were left to fend for ourselves and figure out what Agile meant to us. One of the way I did that is with the excellent post below. I have failed to find a link to it on the internet hence presenting it in this format here. If anyone knows Ken Macleod from below please thank him from me. My thanks have already been given to the guy in my team who found this document. 14 years on it still makes excellent reading. Enjoy:

If you’d like the PDF complete with my scribbles from 2003 you can download it here: Extreme_Sys_admin

NEXT GEN DEVOPS: Creating the DevOps Organisation is getting a second edition!

I’ve been working on it for a while but it’s been my sole focus since I published the NEXT GEN DEVOPS TRANSFORMATION FRAMEWORK. The first edition came out around a year ago and a lot has changed since then.

The conversation now seems to be how organisations should approach DevOps rather than whether they should consider it. Friends and I are now talking about dropping the term DevOps because we feel it’s just good software engineering practice. Patterns that I dimly glimpsed two years ago are now clearly defined and have several supporting case studies.

The core theme of the book hasn’t changed. In fact none of the existing content has changed at all. I’ve corrected a few formatting mistakes here and there and I’ve been able to add some great photo’s that I think really bring the history chapter to life. Everyone who’s spoken to me about the book has commented that it’s their favourite chapter and now it’s even better!

All new content!

Apart from a redesigned cover I’ve added several new chapters the first of which is entitled The only successful DevOps model is product-centric which looks at the four organisations that are most frequently held up as DevOps exemplars Etsy, Netflix, Facebook and Amazon to see what they have in common and what lessons other organisations can learn from their successes and failures. I wrote this chapter to address a comment I’ve heard from several readers that they wanted more explicit instructions about how to transform their teams and organisations.

That’s also the reason I added the next new chapter: The Next Gen DevOps Transformation Framework this chapter provides explicit instructions describing how my DevOps Transformation Framework can be used to transition a business towards DevOps working practices. It impossible to re-format a framework designed to be used interactively on an HD screen to a 6×9″ paperback but I’ve been able to provide some supporting contextual information as well as providing some example implementations. This combined with the Playfishcase-studies I’ve published here on the blog should provide people with everything they need to begin their journey to DevOps.

The final bit of new content is an appendix to the history chapter. I learned a lot while I was researching the history chapter, far more than I could include without completing losing the thread of the chapter. What interested me most of all was the enormous role played by women in the development of the IT profession. I’ve worked with some great men and women in my 20 years in IT but I’ve only met two female Operations Engineers. Where are the rest? At Playfish I worked with a lot of female developers but whenever I was hiring I never met any women interested in careers in Operations. I couldn’t shake the thought that something was wrong with this situation. Over the past year I’ve read a lot about the decliningnumbersof women in IT so I decided to share what I learned while writing the history chapter and do a little research of my own.

Reduced price!

I need to eat some humble pie now. I think I made a mistake when I initially priced the book. When I was writing the book my focus was not on book sales. I know a couple of people who have authored and co-authored books and read numerous articles about how writing will not make you rich so I was under no illusions about my future wealth. I chose the price because I felt that it would lend credibility.

I didn’t factor in that ebooks and self-publishing has changed the market. When I published the book Amazon displayed a little graph demonstrating that $9.99 was a sweet spot for pricing and that I’d make more money publishing at that price. I’m not doing this for the money so what do I care?

That’s where I made a mistake. I don’t care about the money but I do want my message to get out. I think my book is unique because very few authors have spent 17 years operating online services and very few authors had the unique opportunity to work on one of the first examples of continuous delivery.

So the 2nd edition will be priced at $9.99. I understand that people who have paid the higher price for the first edition will quite rightly feel a little put out by this so I intend to publish the second edition as an update to the first this means that those people who bought the book on Kindle can just update their copy to get the second edition.

I can’t update the paperback version and I can’t give them away for free but I do have a plan. I’m going to be publishing a PDF edition of the second edition and selling it through my own online store. I can’t get details of who bought my book so I’m going to do my best to operate an honour system. If you bought a paperback version of Next Gen DevOps and want a PDF copy of the second edition email grant@nextgendevops.com and I’ll send you the PDF version.

While I don’t know who has bought paperbacks I do know how many paperbacks I’ve sold so once I’ve given away that many PDF copies the giveaway will be over so email me asap to ensure you get your free copy.

The second edition will be published in the next couple of weeks and will be accompanied by a formal press-release.

Recap

Last week I used Playfish, as it was in 2010, as a case study to show how the Next Gen DevOps Transformation Framework can be used to assess the capability of an organisation. We looked at Playfish’s Build & Integration capability, which the framework classified at level 0, ad hoc and 3rd Party Component Management which the framework classified at level 4.

This week

We’ll take a look at what the Next Gen DevOps Transformation Framework recommends to help Playfish improve its Build & Integration capabilities and we’re going to contrast that with what we actually did. We’ll also look at why we made the decisions we made back then and talk a little about where that took us. Finally I’ll end with some recommendations that I hope will help organisations avoid some of the (spring-loaded trapped and spiked) pitfalls we fell into.

Build & Integration Level 0 Project Scope

Each level within each capability of the Next Gen DevOps Transformation Framework has a project scope designed to improve an organisations capability, encourage collaboration between engineers and enable additional benefits for little additional work.

The project scope for Build & Integration Level 0 is:

Create a process for integrating new 3rd party components.
Give one group clear ownership and budget for the performance, capability and reliability of the build system.

These two projects don’t initially seem to be related but It’s very hard to do one without the other.

Modern product development demands that developers use many 3rd party components, sometimes these are frameworks like Spring, more often than not they’re libraries like JUnit or modules like the Request module for Node.js.

Having a process for integrating new 3rd party components ensures that all engineers know the component is available. It provides an opportunity to debate the relative merits of alternatives and reduces unnecessary work. It also, crucially, provides the opportunity to ensure that 3rd party components are integrated in a robust way. It’s occasionally necessary to use immature components, if there’s a chance that these may be unavailable when needed then they need to be staged in a reliable repository or an alternative needs to be maintained. Creating a process for integrating 3rd party components ensures these issues are brought out into the open an can be addressed.

Having talked about a process for integrating 3rd party components an organisation is then in a great place to decide who should be responsible for the capability and reliability of the build system. Giving ownership of the build system to a group with the skills needed to manage and support it enables strategic improvement of the build capability. Only so much can be engineers, no matter how talented and committed they are, without funding, time and strategic planning.

How Playfish improved it’s build capability

I don’t know how Playfish built software before I joined but in August 2010 all the developers were using an instance of Bamboo hosted on JIRA Studio. JIRA Studio is a software-as-a-service implementation of a variety of Atlassian products. I have’t used it for a while but back in 2010 it was a single server hosting whatever Atlassian components you configured. Some Playfish developers had set up Bamboo as an experiment and by August 2010 it had become the unofficial standard for building software. I say unofficial because Operations didn’t know that it had become the standard until the thing broke.

Playfish Operations managed the deployment of software back then and that meant copying some war files from an artefact repository updating some config and occasionally running SQL scripts on databases. The processes that built these artefacts was owned by each development team. The development teams had a good community and had all chosen to adopt Bamboo.

Pause for context…

Let’s take a moment to look at the context surrounding this situation because we’re at that point where this situation could have degenerated into one of the classic us vs. them situations that are so common between operations and development.

When I was hired at Playfish I was told by the CEO, the Studio Head and the Engineering Director that I had two missions:

Mature Playfish’s approach to operations

Remove the blocks that slowed down development.

Playfish Operations were the only group on-call. They were pushing very hard to prevent development teams from requesting deployments on Friday afternoons and then running out of the building to get drunk.

I realised straight away that the development teams needed to manage their own deployments and take responsibility for their own work. That meant demystifying system configuration so that everyone understood the infrastructure and could take responsibility for their code on the live systems. I also knew that educating a diverse group of 50-odd developers was not a “quick-win”.

This may explain why I didn’t want operations to take ownership of the build system even though that’s exactly what some of the my engineers wanted me to do. Operations weren’t building software at the level of complexity that required a build system back then. Operations code, and some of it was quite complex, wasn’t even in source-control, so if we’d taken control of the build system we’d have been just another bunch of bureaucrats managing a system we had no vested interest in.

…Resume

When the build system reached capacity and everyone looked to operations to resolve it I played innocent. Build system? What build system? Do any of you guys know anything about a build system? It worked, to a certain extent, and some of the more senior developers took ownership of it and started stripping out old projects and performance improved.

While this was happening I was addressing the education issues. Everyone in the development organisation was telling me that the reason deployments were so difficult was because there were significant differences between the development, test and live environments. Meanwhile the operations engineers were swearing blind that there were no significant differences. Obviously there will always be differences, sometimes these are just difference access control criteria but there are differences. As the Playfish Operations team were the only group responsible for performance and resilience they were very protective about who could make changes to the infrastructure and configuration. This in turn led them to being unwilling to share access to the configuration files and that led to suspicion and doubt among the development teams. This is inevitable when you create silos and prevent developers from taking responsibility for their environments.

To resolve this I took it on myself to document the configuration of the environments and highlight everywhere there were differences. This was a great induction exercise for me (I was still in my first couple of weeks). I discovered that there were no significant differences and all the differences were little things like host names and access criteria. Deployment problems were now treated completely differently. Just by lifting the veil on configuration changed the entire problem. We then found the real root cause. The problem was configuration it just wasn’t system configuration it was software configuration. There was very little control on the Java properties and there were frequent duplications with different key names and default values. This made application behaviour difficult to predict when the software was deployed in different environments. There was then a years long initiative to ban default values and to try and identify and remove duplicate properties. This took a long time because there were so many games and services and each was managed independently.

Conclusion

I won’t head into the subject of integration for now as there’s a whole other rabbit hole and we have enough now to contrast the approach we took at Playfish with the recommendation made in the framework.

The build system at Playfish had no clear ownership. Passionate and committed senior developers did a good job of maintaining the build system’s performance but there was no single group who could lead a strategic discussion. That meant there was no-one who could look into extending automated build into automated testing and no one to consider extending build into integration.

This in-turn meant that build and integration were always seperate activities at Playfish. This had a knock-on effect on how we approached configuration management and significantly extended the complexity and timescales of automating game an service deployment.

The Next Gen DevOps Transformation Framework supports a product-centric approach to DevOps. In the case of the Build & Integration it recommends an organisation treats it’s build process and systems as an internal product. This means it needs strategic ownership, vision, roadmaps, budget and appropriate engineering. At Playfish we never treated build that way, we assumed that build was just a part of software development, which it is, but never sufficiently invested in it to reap all the available rewards.