Mozilla IT & Operations

First off, some good news for those of you that felt the 3-month rotation for LDAP passwords was too short: most accounts have now moved to a 6-month rotation period. Of course, all that really means is that in a given year, one is merely 50% less likely to get locked out after having forgotten to change their password. Fortunately, resets of locked accounts are now totally self-service, so getting locked out is now much less irritating at is used to be.

Speaking of improving end-user experience, NetOps did some work on the wireless networks in both Paris and Taipei, moving them into larger configuration groups so as to standardise their configurations. They are now easier to manage, and thus, easier to diagnose and optimise.

On the graphing and visualisation front, a big shout-out to Ali and Anurag from the Metrics team, who have been hard at work on implementing a new visualisation tool, an example of which can be seen here. Their new self-service framework allows the rest of us to quickly generate graphs and dashboards for just about anything we might be interested in. In addition, Ben Sullins has been hard at work on implementing Tableau to help people build tools for analysing and displaying data, which Jacques Uber from WebOps has used to set up a hardware warranty summary graph, which has already helped us to better understand and plan for upcoming expiration dates across thousands of machines.

On the topic of machines, the Storage and Virtualisation team spent the better part of a week braving the noise, temperature, and glaring fluorescent lights of the data centre during their push to expand our back-end NetApp infrastructure. Thanks to their efforts, we now have room for some 300 additional virtual machines, which is going to give use some much-needed breathing room in that area. This is going to be a big help going forward as we decommission those aforementioned out-of-warranty machines and replace them with spiffy new VMs.

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org. See you next time!

Nagios World Conference 2013 was held between Sep 30th and Oct 3rd at St. Paul, MN. I represented Mozilla IT/SRE along with Sheeri Cabral, who spoke about MySQL plugins. I wanted to share some observations and my best takeaways from the conference. I attended about 10 talks in all and spent more time discussing setups and best practices.

The biggest draw at the conf this year was Nagios 4.0 that was announced at last year’s keynote. 4.0 brings in some long awaited and much needed rocket power to Nagios. The changelog has detailed information about the big features but the ones that interested me the most were:

Core Workers – I have been researching on how to scale up service check execution on some of our bigger instances. mod-gearman has till now been the tool of choice but with Core Workers, Nagios natively steps up to the task. The legacy forking-for-each-check model was unsurprisingly hitting limits in some places and 4.x replaces it with worker processes that get check execution delegated to them. There is a massive performance gain and I’m looking to leverage that vs. integrating with mod-gearman.

Query Handlers – This feels like baked-in MK Livestatus. It’s made available via a socket. Unlike livestatus, it doesn’t yet have a lot of fancy and it’s mostly basic at the moment. I’d expect it would get a lot of attention in future versions.

Among other things I’m looking forward to integrating Multisite in our infrastructure. We have close to a dozen Nagios instances here at Mozilla and our primary interface to each is via IRC bots. As one would imagine, it doesn’t scale well and isn’t ideal for dealing with mass changes. This is where Multisite comes in very handy. Along with Livestatus, Multisite provides for a supercharged way to deal with multiple instances and multiple service/hosts within each. Do try out the demo because it’s hard to put awesome in words

Nagios at Etsy by Avleen Vig, who had an eventful road trip to the conference and discussed some cool things Etsy has done, particularly measuring alert fatigue by correlating alerts and sleep inputs from fitbit worn by oncalls. He also spoke at length about “Monitoring hygiene” and how Etsy went from 300 alerts/day to 45 alerts over the course of two years.

In all, it was a great conference, like last year. Looking forward to a year of 4.x and trying to get the in-house puppet module out on github

I was playing around with MySQL Workbench earlier in the week, and ran across the “clean up SQL” feature, which I thought was neat. Here’s a picture-based demonstration – you can click on the pictures to make them bigger, so they are more readable.

Here is a typical complex query that looks pretty good formatted in the results from a performance schema query:

Simply click the “broom” icon and watch as your SQL is cleaned up, with one field in the SELECT per line and the JOINs indented and formatted prettily:

For those curious what IT has been up to lately, wonder no longer – here’s a quick status update from the past week, highlighting just some of the great stuff the Mozilla IT team has been working on recently.

First up, our team is growing:

Welcome Chris Knowles, who will be officially helping us out as a Storage and Virtualisation admin (unofficially, he’s going to help us with an even greater challenge: safely landing our Kerbals on the Mun).

A big hello to the entire OpSec team, who are now part of the IT team proper. As separate entities we already had a close and fruitful working relationship – now that we’re all together, things are only going to get better! Expect more updates about all the interesting stuff we’ll be working on together – as long as it’s not classified top secret.

Our fabulous SRE team had an epic bug squashing session during which they reduced their open bugs by over 40%. Given how diverse their queue is at any given moment, this is quite the accomplishment – great work guys!

Speaking of SRE’s, our very own Dumitru Gherman, along with developer Emma Irwin, made the long trip from Mountain View out to London to host a session on “Hacking your online safety” at MozFest.

Local Windows machine deployments at our San Francisco office used to take hours, but thanks to the efforts of Mike Poessy at the SFO service desk, via a combination of templating and new imaging techniques, this has been reduced down to as little as 20 minutes. This newly streamlined process will almost certainly find its way to our other offices as well.

And last, but most certainly not least, the Release Operations team has been very busy:

Jake Watkins wrote a module for Windows called “metric-collective” that polls system stats and forwards the results to graphite/statsd. This is going to be rolled out to all the Windows build and test systems during the next release cycle. Combined with related initiative to roll out collectd across all of the OS X and Linux machines, we’re going to have a whole new level of graphing and trending available across the entire release infrastructure.

Mark Cornmesser and Q Fortier have been hard at work on a brand new imaging and management mechanism for our Windows 2008 build hosts. The biggest change is a move away from our old, manually maintained monolithic image, to a modular image complete with proper change and configuration management. This new platform is currently being tested on a number of project branches and is expected to be rolled out to mozilla-central, try, and inbound in the next few weeks.

Dustin Mitchell worked with the Auto Tools team to set up an independent Puppet instance, itself destined to automate the management of the entire QA Mozmill CI infrastructure. He also set up local Python and NPM mirrors for use by Mozillians, providing a compelling model and implementation for other use-cases going forward.

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org. See you next time!

If you would like to learn MySQL from the ground up, consider joining us. This is for beginners – If you have no experience with MySQL, or if you are a developer that wants to learn how to administer MySQL, or an administrator that wants to learn how to query MySQL, this course is what you want.

If you are not a beginner, you are welcome to join too – maybe you need a refresher, or maybe you just want to test your knowledge or earn badges. That’s OK too!

The format of a virtual self-study group is as follows:

Each participant acquires the same textbook (Learning MySQL, the “butterfly O’Reilly book”, published 2007). You can acquire the textbook however you want (e.g. from the libary or from a friend) but if you buy the book, we ask that you buy it from our Amazon Store, to help pay for meetup fees.

Each participant commits to read one chapter per week, complete the exercises and post a link to the completed work. Tweet using the hashtag #mysqlmarinate.

Acquire the book (the only item that may cost money). Get your hands Learning MySQL – see if your local library has it, if someone is selling their copy, or buy it from our Amazon Store (this helps pay for meetup fees).

When your book arrives, start your virtual learning by reading one chapter per week. Complete the exercises; if you have any questions, comments or want to learn more in-depth, that’s what the forums are for!

Learning MySQL

FAQs:

Q: How long will the course last?

A: We will cover 12 lessons (chapters) in the book, so 12 (twelve) weeks starting October 1st, though we will have one week that is a break so that you can catch up if you need to or you have a week off if you need it. Refer to the MySQL Marinate Season 3 Master Discussion List for specific dates.

By January 1st, 2014, you will know MySQL!!

Q: Can I get ahead?

A: Sure! This is go-at-your-own-pace. To prevent spoilers, please put comments in the appropriate chapter threads.

Q: Does this cover the Percona patch set or MariaDB forks?

A: This covers the basics of MySQL, which are immediately transferable to Percona’s patched MySQL or MariaDB builds.

Q: What do I need in order to start the course?

A: All you need is the book and access to a computer, preferably one that you have control over. Installing MySQL is chapter 2, so really, all you need is the book and a computer to start, you don’t have to worry about any prerequisites. If you do not have the book yet, you can still do the first week by using the online material from “Browse Contents” at the

During yesterday’s MySQL Connect conference, Brandon Johnson and I gave a joint talk about how Mozilla uses puppet to manage hundreds of MySQL servers efficiently. We also released our mysql puppet module on github, so that all can benefit. The slides for the talk are at http://bit.ly/puppet-mysql-slides.

I had the privilege of attending and speaking at the Agile 2013 conference on August 6th – 9th, in Nashville, TN. Agile 2013 was the 12th annual conference of the Agile Alliance. It was a huge conference, with over 2000 people in attendance. Quite a bit bigger than I’m used to attending/presenting at, and it was my first time visiting Nashville, so that was also a treat.

DevOps Track:

This year they added a DevOps track, which I submit a talk to about Mozilla IT WebOps’ work on enabling a devops culture through self service tools, in particular, about our work building an in-house PaaS offering based on ActiveState’s Stackato product.

The DevOps track was excellent and attracted a great crowd of speakers and attendees interested in what DevOps is about. I attended almost all the DevOps track talks and also got some great hallway track time with folks like Gene Kim, John Willis, Andrew Clay Shafer, Karthik Gaekwad, Gareth Bowles, Dominica DeGrandis, and many more. I thoroughly enjoyed all the talks, but I wanted to highlight a few in particular.

Talk: How DevOps changed Everything

Karthik Gaekwad, a web engineer at Mentor Graphics Embedded, shared their devops transformation in a talk entitled “How DevOps changed Everything”. He explained how they started as with an environment where Waterfall Development was being done and the Dev, Ops, QA, etc teams were all silos, and this wasn’t working. To try and improve this they started by adopting Scrum with two week iterations, this helped improve things on the dev side and some improvements with IT as well, but IT was still mostly thinking “how can we be agile when it takes weeks just to order servers?”. At the same time, management starts asking about this “Cloud” stuff, so now IT is trying to figure out how to be more “agile” and now how to do “Cloud”. Enter DevOps! They focused on two things, improving culture and automation.

They worked on culture by committing to do Scrum and even though they knew it’d be difficult, they stuck with it. “Our 1st sprint(s) sucked!” To improve automation they focused on building a Platform for their developers. One that would have have automation, APIs, instrumentation, easy deploys, etc baked right in. They found that “Devs and ops understand good architecture” and a platform helped everyone speak the same language.

Finally, Karthik offered some thoughts on what DevOps looks like as you make progress with it.

DevOps 101

Config Management – find easy wins, config files

Monitoring – use tools that make it easy, CloudKick, New Relic, StatsD

He closed by re-iterating that focusing on creating small wins and getting people working together/speaking the same language was the key to their devops transformation. “Let’s work together, and solve the problems that our business wants us to!”

They started with some history of Salesforce.com, which in 2000 was three people and did four major releases per year, but fast forward through seven years of rapid growth on all fronts and delivering on releases got harder. 2006 saw one major release. In 2007, they embarked on an agile transformation which they coined Adaptive Development Methodology (ADM), essentially their take on agile development. This turned things around for development, but what about infrastructure? ADM didn’t have the same impact for their TechOps teams as it had for development, making your infrastructure agile poses different challenges from writing code. At first they tried scaling through hiring, but hiring takes time, meanwhile rapid success and growth continue.

So how do they innovate at scale? It’s now 2012 and there is now this DevOps thing going on. So how can we blend ADM and DevOps? Since DevOps encompasses agile development, automation and “infrastructure as code”, that means a lot of code needs to be written, so they secured time from developers to support Operations and build the tools Ops needed. This helped but they quickly found competing priorities slowed down development Ops needed. So they started creating cross functional teams with skills from data center and hardware to infrastructure to dedicated developers and unleashed them on the most needed improvements.

Automation: Automation is key, dedicated developers and cross-functional teams ensure the right automation is being made and improved

Metrics: “You can’t change what you can’t measure” and “Measure everything”. Graphite is a big part of their metrics strategy

Sharing: Salesforce Chatter is at the center of the company’s social atmosphere. Essentially an internal social network site that people enjoy using.

So why are they succeeding? First, they found their “DevOps Kata“, which consists of

Daily Standup – “Encourage effective two way communication and other means to drive out fear throughout the organization so that everybody may work effectively and more productively for the company.”

Sprint Retrospective – “Institute a vigorous program of education and self-improvement”

Sprint Demo – “Break down barriers between departments. People in research, design, sales, and production must work as a team, in order to foresee problems of production and usage that may be encountered with the product or service.”

Two key things for getting people interested in and spread the “devops culture” were having regular Hack Days and they held an Internal DevOps Mini-Conference. Some other key points were:

Building an Infrastructure Development Lifecycle – infrastructure code is developed with the same process and rigor as application code

Virtualization – They invested in the tooling and infra to make it so folks can develop for both infrastructure and application code locally, using Vagrant

They’re extremely happy with their progress and DevOps transformation so far, but still recognize that their present and future challenges include:

Bringing Agile into traditional IT Ops

Bringing IT Ops in with Infra Eng and R&D

Re-educating workforce

Recruiting

Scaling Securely

Andrew moderated a panel on Tuesday evening which he called the “DevOps AMA” (Ask Me Anything), which included John Willis,Gene Kim, and Mandi Walls. Andrew did a fantastic job of including the audience and the conversation ranged from “what is a good definition of devops?” to “doing devops in the enterprise” to “how can I do continuous delivery with mainframes?”.

Gene wanted to answer the question “Where Do High Performers Come From?” and so began studying this. He began working with the IT Process Institute, which has been studying high performing IT organizations since 1999, and the first result of this work was the The Visible Ops Handbook, which is described as “a methodology designed to jumpstart implementation of controls and process improvement in IT organizations needing to increase service levels, security, and auditability while managing costs”.

IT Operations has traditionally had the role of “fixing fragile artifacts”, is extremely interrupt driven, and it’s no surprise why terms like “special snowflake server” are widespread. Unfortunately, we’ve also seen a world were “IT Ops and Devs are at War!”. Surely there must be a better way.

The first way is Flow. Seek to understand your flow of work. You need to define your work and make it visible. Create one step environment creation process. Improves development, testing, QA. Change the Agile sprint policy: “At the end of each sprint, we must have working code and the environment it runs in.” Deploy Smaller Changes, More Frequently.

The second way is Feedback. Seek to understand and respond to the needs of all customers, internal and external. Developers are IT Ops customers, help them get quicker feedback on thier code. Google requires that developers maintain their services themselves for the first six months and has a very thorough acceptance process before Ops takes over maintain a service. Metrics and monitoring improve feedback and situational awareness. Continuous delivery, failures must result in automated tests in the continuous deployment pipeline (Release, Config, Change).

The third way is Continual Experimentation And Learning. Foster a culture that rewards Experimentation (taking risks) and learning from failure. Repetition is the prerequisite to mastery. You need a culture that keeps pushing into the danger zone and have the habits that enable you to survive in the danger zone. You Don’t Choose Chaos Monkey … Chaos Monkey Chooses You! Allocate 20% Of Cycles To Technical Debt Reduction.

In closing, If Gene could wave a magic wand, everyone will,

See the dead bodies in IT, and have confidence that your intuitions have been right all along…

Become conversant with DevOps and recognize the practices when you see them

Be energized about how practitioners can contribute in this organizational journey

Leave with some concrete steps to get some great outcomes

Help create a team that starts putting DevOps practices into place

Gene’s keynote ended the conference on an excellent high note, really drove home the importance of DevOps to the whole business, and demonstrated how the Agile conference adding a DevOps track to the conference was relevant and very smart.

Parting thoughts

My talk, entitled “Enabling DevOps: The road to a better culture“, started with providing some context and history around Mozilla IT’s growth, the formation of the Web Operations team, the evolution of our technology and tools. I then covered our current efforts to build self-service tools, spent a fair amount of time discuss why and how we’re building an in-house platform as a service offering, and talk about the bigger picture of the devops culture we’re building.

My talk was also well received. People really enjoyed hearing about how Mozilla IT is approaching building a devops culture, the tools we’re building and deploying, and our challenges and successes along the way. In particular, I got a few comments that it seems like we’re taking a very pragmatic approach and finding small wins that really help, which I definitely think is how WebOps approaches all this and I was glad this came through in my presentation and conversations.

Agile 2013 was a great conference, I enjoyed getting to present to a different audience and the conversations I had with folks who aren’t in the Operations world, and getting the opportunity to share about what Mozilla IT’s is up to.

I am doing a quick blog post to announce that I have put an indexing talk online*. Most recently, I delivered this indexing talk at Confoo and Scale 11x.

The talk is on YouTube at Are You Getting the Best Out of Your MySQL Indexes? There are also PDF slides.
From the official conference description, if you want to know more:
MySQL indexes are often used to make performance better. However, they can make performance suffer if you are not using them properly. Oracle ACE Director Sheeri Cabral explains the pitfalls to avoid with indexes and how to utilize compound indexes to maximize index availability with the least amount of write overhead.

*I know I have not been posting blogs for a long time. This was a very busy year, and I took March through July off from conferences in order to buy a house and move.