Production Engineering at OVO

Introduction

Jon Dodkins

Production Engineering at OVO

Production Engineering at OVO

It’s been 8 months since our Tech Director, Ed Conolly, wrote a blog post about our technology culture at OVO, describing the way we enable both product and technical autonomy through our take on a host of Agile and Lean patterns. This month I open recruitment for a new team and function at OVO: Production Engineering, and I wanted to share how this intends to support the continued success of that culture.

Non-functional concerns can be addressed through many different approaches. DevOps as a cultural movement shows that complete separation of responsibilities i.e. ‘throwing code over the wall’ for an ops team to run, is now unnecessary, inefficient and ineffective. The techniques described in Google’s SRE book show how far our industry has come, from a time where the technologies and levels of abstraction (and thus skills required) differed massively between the worlds of developing and operating software. This is no longer the case.

Production Engineering represents our take on SRE (Site Reliability Engineering), an approach that originated at Google but is now adopted by companies everywhere. It’s defined as “what you get when software engineers design operations functions” and is purported to be the answer to problems of reliability, scalability and performance - along with challenges faced in deployment & operations - through team of highly-skilled software engineers with a determined, problem-solving mindset. Whilst at OVO we’re also onboard with this in principle, the different flavours you see in execution highlight the devil being very much in the detail.

There are obvious pitfalls in introducing any extrinsic support or service for a team; missing the mark could have ripple effects that compromise clear ownership and cultural values such those that underpin the very autonomy, freedom and responsibility you’re trying to bolster. Broadly speaking, mistakes occur where an organisation essentially ‘cargo cults’ their design i.e. by the letter - without considering what actually fits their company & values.

For OVO, we’ve identified how a clearly targeted function along lines of SRE, could in fact be a key enabler for those values and practices that exist between teams; a lubricant that ensures the autonomy we grant, works... now and as we scale.

Our approach

Production Engineering at OVO is balanced between clearly-scoped hands-on assignments with teams, and building internal tools & products to benefit any team in their SDLC, should they wish to use them.

Team assignments

Production Engineers are not responsible for running any team’s services in production, despite how much they live and breathe this domain. Assignments do involve mucking in and experiencing production conditions with the rest of the team, the difference is that they are not assigned just as an extra pair of hands for a team’s ops work or to plug a skills gap. They’re there with a clear mission and set of target outcomes to ensure the greatest improvements are made and lasting impact is felt.

It’s like doing all your own car maintenance - you might need some help from a mechanic mate from time to time, either for a big job or to help on an area you haven’t worked on much yourself before. You’re still responsible for running the car and the safety of it’s passengers… your mechanic mate doesn’t live with you and if you blow a tyre on the motorway, don’t call your mate - you grab the jack and wrench!

Tools & platforms: engineering for engineers

We intend to make our engineers and teams happy and productive. They’re our customers.

Production Engineering create and order their backlog based on value, just like any other Product team. Their ‘customer research’ will be as part of engineering community discussions and the tapped feedback loops from incident management & post-mortem activities. Indeed, the longer term effectiveness of such communities and ceremonies could lie in ensuring there’s a rapid cycle around ideas and challenges raised, getting them the time and focus they warrant without risking feature development.

Sourcing tools and services will play a big role, socialising them and making experimentation easily accessible. Our bigger vision is to create an engineering platform that is optimised for covering all the core performance concerns, specifically tailored to building services handling data streaming energy operations at scale.

How we differ

So to exaggerate some differences slightly for a moment, our take on all this is:

Simpler than Google’s (for now), but SRE at it’s heart. Their SREs are responsible for running services in production and do this longer term from within a team. They use a series of rules and indicators to avoid misaligned incentives, and ensure reliability gets built in.

Facebook match the mindset we’re going for functionally. They also see the role as a bit looser and broader than Google, losing ‘reliability’ from the title. They’re interested in having positive impact at any level necessary. So are we. Facebook operate their own data centres, whereas we span multiple IaaS providers, so our levels of abstraction will differ in practice.

Netflix are building the kind of OSS we’d want to use, or build ourselves. Their take on creating an optional ‘paved road’ is spot on for us too; we want to make it easier to adhere to the core stuff we have in our ‘Service Checklist’ Ed describes.
Going beyond this, Production Engineering could also build-in the ability to leverage new patterns and dev-features making them rapidly available to all teams. This especially benefits those that due to circumstances are unable to spend the time researching and implementing from scratch themselves. Teams can jump on-board quick if experiments are working well elsewhere.

Interested? Get in touch!

Production Engineering at OVO is all about engineers; we’re interested in the challenges they face, the tools in their toolbox and how they use them to build and run the best products.

We want skilled engineers from a variety of backgrounds, that can have real impact at all levels of the stack. The most important thing is to have the kind of mindset and intrinsic drive that gets excited about immersion in these kinds of challenges and that wants to build something great.

At OVO we’re energy industry disruptors, so naturally we’re interested in extreme DevOps: having teams own their full service lifecycle. As with others doing this, taking such a position doesn’t negate the need for a shared services function, but it does require new thinking to navigate what’s right for your organisation. We’re experimenting and learning so no doubt our model will evolve too.

Production Engineering is one of many ways in which Tech Operations at OVO are evolving their centralised function in a decentralised world. The old Service Management command-centre monolith is no more, with much of those activities & competencies now distributed across ‘loosely-coupled’, specialised Product teams. The old model of adding middle-man orchestration roles now signifies waste and a symptom of anti-patterns in the system. They need to be overcome in order to realise the full benefits of cultural value-structures like ours at OVO. Whilst there’ll always challenges, those challenges are transformed and require the entire E2E to be viewed through a different lens.