Tag Archives: google

Post navigation

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

In this interview, Ben Treynor shares his thoughts with Niall Murphy about what Site Reliability Engineering (SRE) is, how and why it works so well, and the factors that differentiate SRE from operations teams in industry. READ MORE

Digital Rebar is the open, fast and simple data center provisioning and control scaffolding designed with a cloud native architecture.

Our extensible stand-alone DHCP/PXE/IPXE service has minimal overhead so it can be installed and provisioning in under 5 minutes on a laptop, RPi or switch. From there, users can add custom or pre-packaged workflows for full life-cycle automation using our API and CLI or a community UX.

A cloud native bare metal approach provides API-driven infrastructure-as-code automation without locking you into a specific hardware platform, operating system or configuration model.

For physical infrastructure provisioning, Digital Rebar replaces Cobbler, Foreman, MaaS or similar with the added bonus of being able to include simple control workflows for RAID, IPMI and BIOS configuration. We also provide event driven actions via websockets API and a simple plug-in model. By design, Digital Rebar is not opinionated about scripting tools so you can mix and match Chef, Puppet, Ansible, SaltStack and even Bash.

Next version: release of v3.1 is anticipated on 9/4/2017.

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

This week, we launched our new RackN website to provide more information on our solutions and services as well as provide customer examples. Click over to our new site and let us know your thoughts.

To ensure websites and applications deliver consistently excellent speed and availability, some organizations are adopting Google’s Site Reliability Engineering (SRE) model. In this model, a Site Reliability Engineer (SRE) – usually someone with both development and IT Ops experience – institutes clear-cut metrics to determine when a website or application is production-ready from a user performance perspective. This helps reduce friction that often exists between the “dev” and “ops” sides of organizations. More specifically, metrics can eliminate the conflict between developers’ desire to “Ship it!” and operations desire to not be paged when they are on-call. If performance thresholds aren’t met, releases cannot move forward. READ MORE

Rob Hirschfeld, Co-Founder and CEO of RackN provides his thoughts on how operators are equivalent to developers and work together to accomplish the critical task of keep the infrastructure running and available with constant changes in the data center

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

TL;DR: SRE makes Ops more Dev like in critical ways like status equity and tooling approaches.

In Datanauts 089, Chris Wahl and Ethan Banks help me break down the concepts from my “DevOps vs SRE vs Cloud Native” presentation from DevOpsDays Austin last spring. They do a great job exploring the tough topics and concepts from the presentation. It’s almost like an extended Q&A so you may want to review the slidesor recordingbefore diving into the podcast.

02:00 History of SRE term from Google vs Sys Ops – if site was not up, money was not flowing. SRE culture fixed pay equity and career ladder, ops would have automation/dev time, dev on hooks for errors

03:00 Google took a systems approach with lots of time for automation and coding

05:00 We’re seeing SRE teams showing up in companies of every size. Replacing DevOps teams (which is a good thing). Rob is hoping that SRE is replacing DevOps as a job title.

06:10 Don’t fall for a title change from Sys Op to SRE with actually getting the pay and authority

06:45 Ethan believes that SRE is transforming to have a broad set of responsibilities. Is just a new System Admin definition?

07:30 Rob things that the SRE expectation is for a much higher level of automation. There’s a big thinking shift.

08:00 SREs are still operators. You have to walk the walk to know how to run the system. Not developers who are writing the platform.

08:30 Chris asks about the Ops technical debt

09:00 We need to make Ops tooling “better enough” – we’re not solving this problem fast enough. We have to do a better job – Rob talks about the Wannacry event.

10:30 Chris asks how to fix this since complexity is increasing. Rob plugs Digital Rebar as a way to solve this.

11:00 People are excited about Digital Rebar but don’t have the time to fix the problem. They are running crisis to crisis so we never get to automation that actually improves things.

12:00 At best, Ops is invisible. SRE is different because it includes CI/CD with on going interactions. There’s a lot coming with immutable operating systems and constantly term.

13:00 The idea that a Linux system has been up for 10 years is an anti-pattern. Rob would rather have people say that none of their servers has been up for more than a week (because they are constantly refreshed)

13:19 Chris & Ethan – SECTION 1 REVIEW

SRE is not new, it’s about moving into a proactive stance (automatically reacting)

19:00 Ethan’s adds the insight: If you don’t have small steps then you don’t really understand your process

20:00 Platform as a Service is not really reducing complexity, we’re just hiding/abstracting it. That moves the complexity. We may hide it from developers but may be passing it to the operators.

21:00 Chris asks if this can be mapped to legacy? Rob agrees that it’s a legacy architectural choice that was made to reduce incremental risk. Today, we’re trying to make our risk into smaller steps which makes it so that we will have smaller but more frequent breaks.

22:40 The way we deliver systems is changing to require a much faster pace of taking changes

23:00 SREs are data driven so they can feed information back to devs. They can’t (shouldn’t) walk away from running systems. This is an investment requirement so we can create data.

24:00 We let a lot of problems lurk below the surface that eventually surface as a critical issue. Cannot let toothaches turn into abscesses. SREs should watch systems over time.

25:20 If you are running under performance in the cloud, then you are wasting money.

26:00 Cloud Native, an architecture? What is it? It means a ton of things. For this preso, Rob made it about 12 factor and API driven infrastructure.

26:50 “If you are not worried about rising debt then we are in trouble.” We need to root cause! If not, they snowball and operators are just running fire to fire. We need to stop having operators be heros / grenade divers because it’s an anti-pattern. Predictable systems do not create a lot of interrupts or crises. Operators should not be event driven.

28:40 Chris & Ethan – SECTION 2 REVIEW

Chris: Being data driven combats complexity

Ethan: Breaking down processes into smaller units reduces risk.

30:00 Cloud First is not Cloud Only. CNCF projects are not VM specific, they are about abstractions that help developers be more productive. Ideally, the abstractions remove infrastructure because developers don’t want to do any infrastructure. We should not are about which type of infrastructure we are using

31:30 The similarities between the concepts is in their common outcomes/values. Cloud First wants to be infrastructure agnostic.

32:30 Chris ask how important CI/CD should be. Are these still important in non-Cloud environments. Rob things that Cloud Native may “cloud wash” architectures that are really just as important in traditional infrastructure.

34:00 Cloud Native was a defensive architecture because early cloud was not very good. CI/CD pipelines would be considered best practices in regular manufacturing.

35:00 These ideas are really good manufacturing process applied back to IT. Thankfully, there’s really nothing unexpected from repeatable production.

36:30 Lesson: Pay Equity. Traditionally operators are not paid as well as developers and that means that we’re giving them less respect. HiPPO (highest paid person in organization) is a very real effect where you can create a respect gap.

38:00 Lesson: Disrupt Less. We love the idea of disruption but they are very expensive and disproportionately to the operators. Change for Developers may be small but have big impacts to operators. More disruptive changes actually slow down adoption because that slows down inertia. SREs should be able to push back to insist on migration paths.

40:00 Rob talks about how RedFish, while good to replace IPMI, will take long time before it. There are pros and cons.

Charity Majors is one of my DevOps and SRE heroes* so it was great fun to be able to debate SRE with her at Gluecon this spring. Encouraged by Mike Maney to retell the story, we got to recapture our disagreement about “Is SRE is Good Term?” from the evening before.

While it’s hard to fully recapture with adult beverages, we were able to recreate the key points.

First, we both strongly agree that we need status and pay equity for operators. That part of the SRE message is essential regardless of the name of the department.

Then it get’s more nuanced. Charity, whose more of a Silicon Valley insider, believes that SRE is tainted by the “Google for Everyone” cargo cult. She has trouble separating the term SRE from the specific Google practices that helped define it.

As someone who simply commutes to Silicon Valley, I do not see that bias in the discussions I’ve been having. I do agree that companies that try to simply copy Google (or other unicorns) in every way is a failure pattern.

I think Google did a good job with the book by defining the term for a broad audience. Charity believes this signals that SRE means you are working for a big org. Charity suggested several better alternatives, Operations Engineer. At the end, the danger seems to be when Dev and Ops create silos instead of collaborating.

Consensus: Job Title? Who cares. The need to to make operations more respected and equal.

What did you think of the video? How is your team defining Operations titles and teams?

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

The book Site Reliability Engineering helps readers understand how some Googlers think: It contains the ideas of more than 125 authors. The four editors, Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, managed to weave all of the different perspectives into a unified work that conveys a coherent approach to managing distributed production systems.

Site Reliability Engineering delivers 34 chapters—totaling more than 500 printed pages from O’Reilly Media—that encompass the principles and practices that keep Google’s production systems working. The entire book is available online at https://landing.google.com/sre/book.html, along with links to other talks, interviews, publications, and events.

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

The tension between Ops and Dev goes way back and has been a source of confusion for me and my RackN co-founders. We believe we are developers, except that we spend our whole time focused on writing code for operations. With the rise ofSite Reliability Engineers (SRE) as a job classification, our type of black swan engineer is being embraced as a critical skill. It’s recognized as the only way to stay ahead of our ravenous appetite for computing infrastructure.

I’ve been writing about Site Reliability Engineering (SRE) tasks for nearly 5 years under a lot of different names such as DevOps, Ready State, Open Operations and Underlay Operations. SRE is a term popularized by Google (there’s a book!) for the operators who build and automate their infrastructure. Their role is not administration, it is redefining how infrastructure is used and managed within Google.

Using infrastructure effectively is a competitive advantage for Google and their SREs carry tremendous authority and respect for executing on that mission.

Meanwhile, we’re in the midst of an Enterprise revolt against running infrastructure. Companies, for very good reasons, are shutting down internal IT efforts in favor of using outsourced infrastructure. Operations has simply not been able to complete with the capability, flexibility and breadth of infrastructure services offered by Amazon.

SRE is about operational excellence and we keep up with the increasingly rapid pace of IT. It’s a recognition that we cannot scale people quickly as we add infrastructure. And, critically, it is not infrastructure specific.

Over the next year, I’ll continue to dig deeply into the skills, tools and processes around operations. I think that SRE may be the right banner for these thoughts and I’d like to hear your thoughts about that.

Sunday, I found myself back in front of the the Board talking about the challenge that implementation variation creates for users. Ultimately, the question “does this harm users?” is answered by “no, they just leave for Amazon.”

I can’t stress this enough: it’s not about APIs! The challenge is twofold: implementation variance between OpenStack clouds and variance between OpenStack and AWS.

Get node PUBLIC address for node [NO, most OpenStack clouds do not have external access by default]

Login into system using node SSH key [PARTIAL, the account name varies]

Add root account with Rebar SSH key(s) and remove password login [PARTIAL, does not work on some systems]

Remove node specific SSH key [YES]

These steps work on every other cloud infrastructure that we’ve used. And they are achievable on OpenStack – DreamHost delivered this experience on their new DreamCompute infrastructure.

I think that this is very achievable for OpenStack, but we’re doing to have to drive conformance and figure out an alternative to the Floating IP (FIP) pattern (IPv6, port forwarding, or adding FIPs by default) would all work as part of the solution.

For Digital Rebar, the quick answer is to simply allocate a FIP for every node. We can easily make this a configuration option; however, it feels like a pattern fail to me. It’s certainly not a requirement from other clouds.

I hope this post provides specifics about delivering a more portable hybrid experience. What critical items do you want as part of your cloud ops process?