Today we’re featuring one of our roles within software engineering: the Site Reliability Engineer, or SRE. We sat down with Andrew Widdowson, who joined Google in 2007, to give you an inside look into the role. If you’re interested in applying for a Site Reliability Engineer position, please apply for our general Software Engineer position, then indicate in your resume objective line that you’re interested in the SRE role.

Andrew in the lobby of the Google Australia office
Photo credit: Jurij Smakov

Tell us about your path to Google.Andrew Widdowson: I grew up in Oley, PA., a small town that has more cows than people. I attended Carnegie Mellon University in Pittsburgh, majored in computer science, minored in physics, and completed a Fifth Year Scholars Program in four years. After school, I spent two years in Boston as a research-oriented software engineer at Bose developing a music recommendation service. I moved to the Bay Area to join Google in 2007.

How do you define the term “Google Site Reliability Engineer”?AW: SREs at Google are the software engineers responsible for ensuring that all of Google’s services are super reliable and super fast, all of the time.

Wait, did you say “software engineer”?AW: Yes! Many people think that SRE is just a candy-coated term for "operations.” It's not—we're doing planet-scale engineering here, and that requires solid engineering principles and people. SREs typically start out as rock star software engineers interested in becoming rock star systems engineers, or vice versa. And unlike most operations groups, SREs are a volunteer army - they are free to transfer to other compatible software engineering teams at any time if they don’t like the work or the environment.

OK, what are SREs actually like?AW: To use an analogy, we’re not the actors on stage; we’re the folks behind the scenes wearing the headsets and making sure everything is running smoothly. Alternatively, our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph. As individuals, we’re a mix of software engineering and large-scale systems engineering experts. It takes a group with a diverse collection of expertise to maintain the stability, reliability and performance of Google services while enabling the company’s developers to be agile and to make feature changes every day.

What does your team do?AW: My team is one of the oldest SRE teams at Google. We work on everything involved in web search, from encrypted search to image search. Like most software teams at Google, we also handle emergency alerts when anything goes down for web search. While the systems we work with are complex, we have the benefit of working with Google infrastructure that's highly instrumented and fault tolerant, allowing us a lot of leeway when we need to make behind the scenes changes. If we do our job right, no one should notice any interruptions of service.

What advice can you give aspiring SREs?AW: Learn CS fundamentals and get as much experience as you can. I started early on; in high school I ran my own business doing web design, server administration and writing web apps. When I got to college, I looked for ways outside of class to apply my knowledge. I served as the IT Director at my college radio station, where I kicked servers around and applied the theory I learned in class in the real world. Beyond data structures and algorithms, get a very good applied understanding of the Linux operating system. Work on a server farm running Hadoop, intern with your campus network engineering group, or launch a user-facing service on your favorite public cloud computing environment.

To apply, please visit google.com/students and apply for the Software Engineering opening. After applying, the recruiter will ask which of our Software Engineering roles you are interested in — just indicate your interest in Site Reliability.

Hi there! What a great post. This is such a big help specially for someone who wants to get a reliability training courses, with the lessons they get in the course it will increase their knowledge and effectiveness.

Our methodology was designed as a response to manufacturing companies that found traditional RCM too expensive and time consuming.