All Things Dork

Lately I've become extremely interested in accident analysis techniques. This is largely useful in the manufacturing and transportation industries, but there has been a growing trend to adopt these types of practices in the technology arena. Think Kanban, Lean Startup, and the Theory of Constraints to name a few.

But accident analysis and safety digs deep into the nature of failure within a system. Some of my favorite thinkers in the field like Sidney Dekker and Nancy Leveson have been forcing me to go beyond the surface of an issue and to dig deeper into the organizational issues that are equal contributors to failure.

Root Cause Analysis (RCA) is something that gets touted all the time in technology. When a system goes down, we're desperately trying to find out what caused things to go bad. Despite our best efforts, we never seem to go far enough with RCAs.

One of my favorite mantras regarding root cause is that "Root cause is simply where we stop looking." We go far enough down the rabbit hole that we simply can't explain further, don't have the will to explain further or we've reached a politically acceptable answer. (Leveson Engineering a Safer World)

So why do we go through the theater of Root Cause Analysis in technology? Because we need to explain the unexplainable. Because if we can't explain something, how can we possibly give assurances it won't happen in the future?

The technology field has done a great job of pretending that everyone has their shit together. No one should ever have a failure that goes undetected. Anyone who isn't alerted before a problem happens is an idiot. These are all worthwhile goals, but they are so far away from the reality of where we are in technology. But thanks to blogs, social media and the wisdom of hindsight, gaps in system and failure monitoring is largely the result of unqualified staff. This belief is held by management, furthered by people in the industry who talk a good game and then ultimately internalized by those with imposter syndrome. So we stress and we agonize over the root cause of a thing. And that's not entirely a bad thing, but here's the rub.

Lets say we get to a point where we find that the root cause was due to setting X not being set to a reasonable or correct value. That's it. We change setting X, explain how that moves up the cause of events and fixes everything had X been set to a sane value. But why was X set to the value it was set at? Easy, the person before us was an idiot. But is that really the case? What factors went into that decision? What organizational pressures were present that forced a more conservative value? Could we not spend the money for extra hardware that would better utilize X? Was there no time to do performance testing so we settled on the low value for X? Did our predecessor not get the training necessary to understand the impact of X? These are all questions that need to be answered to truly be able to do root cause analysis. And the truth of the matter is, most organizations don't have the stomach for it.

Companies have a hard time looking themselves in the mirror and assessing themselves in an honest light. How many projects weren't given enough time to be done right? How many projects have to skip a vital phase of the testing process due to time constraints? These are the problems you run into throughout your career, across countless companies, leading one to believe it's less about the company and more about the human condition. But regardless of the source of these problems, they are all things that contribute to the cause of failure in our systems. Organizations that have operational excellence are the orgs who aren't afraid to look at themselves honestly and follow the root cause of failure, no matter where it takes them.

Next time you participate in an RCA, take it to the next level. Don't stop at the 5th "why?" Go to the 100th, or the 1000th or however long it takes to be able to show organizationally where change needs to happen. Don't absolve yourself of all responsibility, but make sure everyone knows that the failure is not yours, but the organization's as a whole.

People have taken turns beating up on the Zack Snyder film Man of Steel. I'm coming in on this quite late, but now that I've got children, I find myself reliving parts of my life and then reflecting on them.

My daughter is a huge fan of Superman. (Full disclosure, she's only 2 years old) As I think about how to further expose her to one of our greatest superheroes, I obviously took to the body of film work on the character. At no point did Man of Steel cross my mind as a film to view. Why I didn't want to show it to my daughter, helped me to understand what I didn't like about the film.

Superman is one of those characters that embodies an idea. The honorable boy scout, the powerful guardian but most importantly the uncompromising moralist to name a few. These traits combined are what gives us the ability to have light moments with the character. Watching Superman walk through a hail of automatic gatling gun fire without so much as a scratch on his suit is awesome! It fills you with joy, as the bad guys get theirs, but it makes you want to cheer at the top of your lungs for Superman!

I missed those moments in this new incarnation of the character. I missed the joy and cheer that came with the previous Superman films. The But beyond the joy that was lost, it was the loss of some of the central tenants of the character that really made it difficult for me. Nothing illustrates this more than the killing of Zod.

To understand why killing Zod is such a major problem for me, you have to understand my feelings on Superman. He's not just any hero. He is an all powerful, unstoppable hero. His weaknesses are Kryptonite and the loved ones around him. That's it. When someone this powerful is flying through your skies, it's difficult to trust them. It's even more difficult to believe they have your best interests at heart.

But that's the beauty of Superman. His beliefs are uncompromising. And if you agree with his belief system, it gives us, the protectorate, the trust necessary to bestow upon him the role of protector. When Superman kills Zod, not only does he betray his belief system, but he also destroys the trust that's been built with the people. If he'll kill Zod, what's to stop him from deeming others a risk worthy of killing. (I know this gets muddied with the whole Doomsday thing, but I submit that Doomsday was a mindless, non-sentient killing machine. The equivalent of a robot. You may have dissenting views)

With the murder of Zod, now Superman is not this symbol of hope, altruism and unwavering morality. He's now this guy that protects us as long as we don't step out of bounds. As long as we don't cross the line that he's set, we're OK. But if our views differ, we may become dangerous enough to be killed, which limits his ability to be a completely trustable character, the way Superman in my view should be.

So that's my 2 cents on the film. I plan to re-watch it again. I've also taken some of my friends advice and started watching the old Superman: The Animated Series cartoon, which is also on Amazon Prime if you're a member. The show better embodies the way I want my daughter to think of Superman for the time being. When she's older, I'll let her make her own choices. :-)

When we're performing Puppet changes, we try to work within the framework of some sort of SDLC. We're in the process of migrating to Stash from SVN, so our processes are a bit in flux. But one problem we run into constantly is how to balance long running feature branches and separation of Puppet code that is still in development or testing.

The systems team works very closely with the developers, sometimes with our changes being dependent on one another. An example is when we move a file from being hosted on our web servers, to being served by S3. It requires a coordinated change of the Apache config to handle the redirects and the development code that automatically handles the population of S3. While this is being tested, we need to have to different copies of the Apache configuration for the site.

A) - Copy in production where files are located on the web server itself.

B) - Copy in test that handles a redirect to the S3 location.

The testing that takes place may take a day or it may take weeks. The longer testing takes, the more drift there is between my Puppet development branch and the master branch which is getting pushed to production regularly. I could just be a rebase monster while this is being tested, but sooner or later, I'll fail in my responsibility and I'll have some awful merge waiting to happen. I needed a better way and the best thing I could come up with was some form of Feature Toggle.

With a feature toggle, I have the ability to release some code, without all nodes receiving that code path. More specifically for my use case, I can commit code to master, without fear of it actually being executed. This is often leveraged in Continuous Integration environments to prevent incomplete code from impacting production.

With Puppet I decided to implement something very similar using if blocks and Puppet Enterprise console variables. When I'm developing something I put my resource declarations in a block like so

Then in Puppet Enterprise console, I'll assign the variable sitemapredirectfeature to enabled in the console. If you're not a Puppet Enterprise customer or aren't using the console, you could also specify it in a hiera lookup, with a default value.

hiera('sitemap_redirect_feature', 'disabled')

This makes it easier to assign to groups of servers based on your hiera configuration.

Because of the way Puppet variables are evaluated, any node that doesn't explicitly set the variable will follow the else path.

The plus side to this is while you're figuring out exactly how resources should be laid out, you can still commit to master without fear of breaking anything. (Just make sure you do all your static analysis so that your Puppet code is at least valid)

Once your testing is complete and your ready to push the changes to production, you simply remove your updated resource declarations from the if/else block so that they're always executed. Delete the if/else block and push your code.

I've been using this pattern for a few weeks now and so far it is working out pretty well. I may refine the approach as I run into new hurdles.

In the technology arena, things are constantly changing and new technologies are being spun out at a rapid rate. The problem is that as technologist, we're eager to try out the new hotness, with Docker being the new darling child. Just ask Google about the hype cycle behind Docker.

I'm not going to debate the anointed position of Docker. It is a very cool and incredibly useful technology. But what I do take issue with is using Docker for the sake of using Docker, without any real examination of the problems that are trying to be solved. Docker gets trotted out as a strategy, rather than taking its rightful place as a solution for a strategy.

Containerization of your application may or may not be a straight forward exercise. You could spend weeks getting things tuned and setup in a way so that you can now deploy your application via Docker. You're living the dream of developing on your desktop and having that same container move all the way through your pipeline into production. But if your build still takes 90 minutes, is it worth the effort? Have you actually solved your pain point?

I'm not dismissing the other intangibles that Docker offers, but I'm a big fan of the Theory of Constraints. Optimizing for anything other than the bottleneck is just a waste.

It sounds like I'm picking on Docker, but it's just an easy example because of its current popularity. But I'll give an example closer to home.

I'm working on a Fantasy Football site in my spare time. One of my strategies is to collect information from all of the various sites that provide fantasy data projection.

Notice how my strategy is devoid of any specific technology or implementation. That's how a strategy should be defined. In clear terms that don't hint towards a specific solution or direction.

Well, I lost site of that and immediately jumped to the solution. I wrote a series of scrapers to go out to various websites and pull down the information, without any thought to my actual strategy. I jumped to the solution because it's an easy thing to do as an engineer.

Fast forward a few weeks and I'm spending more time fixing the scrapers and coding defensively against changes to the source website, instead of continuing development of my application. But if I think about my strategy I could probably come up with a few quick solutions.

Mechanical Turk - I could hire someone for probably less than $10 dollars to have someone manually enter the data into a CSV document. Writing a CSV importer is a lot simpler than an HTML scraper.

Fantasy Data - While a bit pricier, I could also pay for an API end point to provide me with a bunch of data. ESPN, CBS, and Yahoo all have similar services available at varying prices.

Between the 3 options that I briefly described (Mechanical Turk, Fantasy Data and a custom scraper), the mechanical turk option makes the most sense for me. It's inexpensive, delivers the value I'm looking for and has the lowest amount of effort on my side, allowing me to focus on my core product.

The moral of the story is, remember to evaluate why you want to implement a technology. The strategy should be separate from the solution so that you can make sure your addressing your pain points.

I'm relatively new to the Rails community. I come from the Python/Django world, but I've been enjoying the transition, except for one minor part; Models.

When I dig around looking for info on how to structure my code, I keep running into Best Practices that advocate for a skinny controller/fat model pattern. The idea being that the model contains most of the program logic. I feel like an ass-hat because I'm the new guy but this sounds crazy to me, and others definitely agree. Why limit ourselves to three class types?

I've started to move some of my logic into separate classes that are not connected to a model or a controller. They're utility classes that deal with external sources of data that don't need to be persisted and definitely don't fit the role of the controller. In fact, their primary purpose is fetching of data from other sources, to be consumed elsewhere. With that use case in mind, I was a bit surprised when I mentioned this to a few programming buddies and it seemed like they hadn't thought of it. While we couldn't come up with a compelling reason why this was wrong, I was a little perturbed that it wasn't something regularly done. So now I have a generically named folder 'classes' to house some of these items.

I've been doing some research on MVC purely as a design pattern and I realize that I've been making one fatal mistake that's limiting my usage of the pattern.

Model != Persistence.

The problem I often run into is that my models are shaped based on how I store them in the database. But sometimes how I store an object isn't necessarily how I want to interact with the object. I end up traversing a bunch of relationships via the ORM. But if my actual storage strategy changes, I suddenly have to update code everywhere that doesn't necessarily care with how the data is saved. But in reading more about MVC, my model doesn't have to mirror my storage, as long as the model knows how to persist the object.

I'll be playing around with inserting an additional layer of abstraction for my models to allow me to interact with the object in its logical form, as opposed to its actual form in the database.

This has been a week of conference bliss for me. I attended Puppet Camp Chicago earlier in the week and spent the rest of the week I'll at Linux Con. I've never been a big conference attendee in the profesisonal aspect of my life, so it was a bit of a first. I have to tell you it's an awesome experience.

My experience has left me with a single question; Why are managers not pushing harder for employees to attend conferences? I'm paying for Linux Con out of my own pocket, but conference attendance is something bosses should embrace. It may seem like a scheme for employees to get a week off with paid expenses, but I assure you, it's more than that.

The energy at a convention is like nothing you've experienced before. The space is filled with upbeat professionals that are tackling problems both incredibly similar and radically different than your own. The conference talks usually run the gamut in terms of experience levels. As an attendee you'd be hardpressed to not find something you're interested in. Here's my line up for Day 1 of the conference. This doesn't include all of the talks I had to skip because of timing conflicts.

Linux Performance Tools - There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This talk summarizes the three types of performance tools: observability, benchmarking, and tuning, providing a tour of what exists and why they exist. Advanced tools including those based on tracepoints, kprobes, and uprobes are also included: perf_events, ktap, SystemTap, LTTng, and sysdig. You'll gain a good understanding of the performance tools landscape, knowing what to reach for to get the most out of your systems.

Tuning Linux for Your Database - Many operations folk know the many Linux filesystems like EXT4 or XFS, they know of the schedulers available, they see the OOM killer coming and more. However, appropriate configuration is necessary when you're running your databases at scale. Learn best practices for Linux performance tuning for MySQL, PostgreSQL, MongoDB, Cassandra and HBase. Topics that will be covered include: filesystems, swap and memory management, I/O scheduler settings, using the tools available (like iostat/vmstat/etc), practical kernel configuration, profiling your database, and using RAID and LVM.

Solving the Package Problem - In the beginning there was RPM (and Debian packages) and it was good. Certainly, Linux packaging has solved many problems and pain points for system admins and developers over the years -- but as software development and deployment have evolved, new pain points have cropped up that have not been solved by traditional packaging. In this talk, Joe Brockmeier will run through some of the problems that admins and developers have run into, and some of the solutions that organizations should be looking at to solve their issues with developing and deploying software. This includes Software Collections, Docker containers, OStree and rpm-ostree, Platform-as-a-Service, and more.

From MySQL Instance to Big Data - MySQL is the most popular database on the web but how do you grow from one instance on a single LAMP box to meets needs of high availability, big data, and/or 'drinking from the fire hose' without losing your sanity. This presentation covers best practices such as DRBD, read/write splitting, clustering, the new Fabric tool, and feeding Hadoop. 80% of Hadoop sites are fed from MySQL instances and it can be frustrating without guidance. MySQL's Fabric will manage sharding and provide more flexibility for your data. And using the memcached protocol to access data as a key/value pair can be up to 9 time faster than SQL (but

All of these talks are items that can help my career and my employer today. It has givien me a level of enthusiasm that I haven't had in quite some time. Now imagine if you could give that level of education, motivation and enthusiasm to every member of your team.

My conference buddy and I have already identified several technologies we want to look at implementing, as well as developed contacts with people who are already using them. We've met with some great people at Puppet Labs, like Lindsey Smith, the Puppet Enterprise product owner, who listened to our real world problems and pain points. He also got us setup with the Puppet Labs Test Pilot Program so that we can be involved in the direction of Puppet Enterprise.

We grabbed a few beers with Morgan Tocker the MySQL Community manager at Oracle. We shared stories, talked about some of our struggles with MySQL and just generally had a good time and got a ton of insight into potential pain points in the future as well as features to leverage in upcoming releases.

When we get back to the office on Monday, we've got a ton of things to discuss, evaluate, re-evaluate and expand upon. That's the power of conferences, and if you're a manager, it's why you should consider the next request for conference funds a little more carefully.

I was in attendance at Puppet Camp Chicago today and had some really awesome conversations with people. It's always worthwhile to hear how people are approaching similar problems to yours. It was also nice to get a chance to meet some of the developers of my favorite Puppet modules, but I digress.

One of the conversations that came up was what our local development process looked like for Puppet. Many people are attempting to find the right mixture of process and tools to help develop their infrastructure. With this in mind, I figured it might be worthwhile to share my developer setup. YMMV.

VIM - VIM is my editor of choice. Of course saying you use VIM is like saying "I have a car". Nobody just uses VIM these days. There's always some plugins that get mixed in there, my setup is no different.

vim-ruby - VIM Ruby is a nice plugin for all types of fun, helpful bits. Check it out.

NERDTree - A great plugin that adds some file browsing capabilities to VIM. Well worth it to avoid buffer hell.

Powerline - A great add-on for VIM, zsh, and bash that adds an awesome status bar to your VIM interface. The git status in the toolbar is extra helpful.

tmux - Tmux isn't really a VIM plugin, but it is essential to my workflow. Being able to create multiple windows, split panes and easily navigate amongst them with keyboard shortcuts.

Custom VIM Functions - I have one main custom function that I use hevaily for linting. The function determines whether the file is a Puppet (.pp), JSON(.json) or ERB (*.erb) file and runs the appropriate linter. Below is a copy of it.

Virtual Machine Setup

My local development environment consists of two virtual machines, a Puppet master and a Puppet client. I'm using Virtualbox for virtualization, but really any VM tool should be fine.

The nice thing with having a virtual puppet client on your desktop is that you can snapshot it to get your VM back to an initial state. So before you do any development on the Puppet client, make sure you take a snapshot so that you can get back to a clean starting point.

On the Puppet Master VM you'll want to create a shared folder in Virtualbox or your VM Manager of choice. Point the shared folder to whatever folder holds your Puppet manifests on the local machine. Now mount the shared folder in your VM so that it's accessible within the Virtual Machine. You should now have access to your Puppet manifests on your local machine, via the Virtual Machine.

Last but not least, modify your modulepath in the virtual Puppet master and add the shared folder path to the modulepath. By adding the shared folder to your modulepath, you can develop your Puppet manifests on your local machine, with all your tools without the need to develop inside the VM or to sync files from your local machine to your VM.

Remote Puppet Development

Occasionally you might hit a use case that isn't testable on a local machine and you need to test it on a Puppet master in your pre-prod environment. (You do have a pre-prod environment right?) When this situation comes up it's nice to have Puppet Environments setup. Most people use them in a dynamic fashion, but you can definitely use them statically. (And with SVN) After you've created the environments, it's just a matter of getting your files to the path on the remote server. Rsync is a great tool for this as it allows you to get your files to the remote server for testing, without the need to actually commit code that you're not sure will work yet. (Which in some environments might trigger a long, time consuming series of automated checks and builds)

That's pretty much it for my development environment. I should also mention that if you're working on a Mac, it might be worth checking out Dash, which is an awesome developer documentation tool. It basically sucks down the Docsets of various programming languages and tools. (Puppet being one of them)

At some point I'll probably write a follow up post to detail our actual development and deployment workflow. Hope this helps some poor soul out there on the web.

I feel like every team I talk to, at some point decides they need to blow up their Puppet code base and apply all of the things they've learned to their new awesome codebase. Well, we're at that point in my shop and there's a small debate going on about how to organize our Puppet modules.

This is really not meant to be a mind-blowing blog post, but more of a catalog of thoughts for me as I make my argument for separate repositories for each Puppet module. A few background items.

We'll be using the Roles/Profiles pattern. What I'm calling "modules" are the generic implementations of technologies. These are the modules I'm suggesting go into separate repositories. I'm OK with profiles and roles co-existing in a single repository.

We're coming from a semi-structured world where all modules lived in a single SVN Repository. Our current deployment method for Puppet code is a svn up on the Puppet Master.

We'll have multiple contributors to the code base in 2 different geographic locations. (Chicago and New York for now) The 2 groups are new to each other and haven't been working together long.

I think that's all the housekeeping bits. My reasons for keeping separate Git repositories per modules are not at all revolutionary. It's some of the same arguments people have been writing about on the web for awhile now.

Separate Version History

As development on modules move forward, the commits for these items will be interspersed between commit messages like "Updating AJP port for Tomcat in Hiera". I know tools like Fisheye (which we use) can help eliminate some of the drudgery of flipping through commit messages, but you know what else would help? Having a separate repo where I can just look at the revision history for the module.

Easier Module Versioning

With separate repositories, we can leverage version numbers for each release of the module. This allows us to freeze particular profiles that leverage those modules on a specific version number until they can be addressed and updated. With two disparate teams, this allows them to continue forward with potentially disruptive development, while other profiles have time to update to whatever breakage is occurring.

Access to Tools

Tools like Librarian-Puppet and R10K are built around the assumption that you are keeping your Puppet modules in separate repositories. I haven't done a deep dive on the tools yet, but from what I can tell, using them with a single monolithic repository is probably going to be a bit of a hurdle.

Easier Merging/Rebasing and Testing

The Puppet code base is primarily supported by the Systems team. The world of VCS is still relatively new to the Systems discipline. As we get more comfortable with these tools, we tend to make some of the same mistakes developers make in their early years. The thing that comes to mind is commit size and waiting too long to merge upstream. (Or REBASE if that's your thing) Keeping the modules in separate repositories tightens the problem space your coding for. If you need create a new parameter for a Tomcat module, a systems guy will probably

Create the feature branch

Modify the Tomcat module to accept parameters

Modify the profiles to use the parameters

Test the new profiles

Make more tweaks to the tomcat module

Make more tweaks to the profiles

Test

Commit

Merge

With separate modules, the problem space gets shrunken to "Allow Tomcat to accept SHUTDOWN PORT as a parameter". We've removed the profile out of the problem scope for now and have just focused on Tomcat accepting the parameter. Which also means you need a new, independent way to test this functionality, which means tests local to the module. (rspec-puppet anyone?) This doesn't even include the potential for merge hell that could occur when this finally does get committed and pushed to master.
Now I'm not naive. I know that this could theoretically happen even with separate modules, but I'm hoping the extra work involved would serve as a deterrent. Not to mention that gut feeling you get when you're approaching a problem the wrong way.

In favor of a Single Repository

I don't want to discount that there could be some value in managing all your code as a single repository. Here's the arguments I've heard so far.

It complicates deployments of Puppet Code

True that. Nothing is easier than having to execute an svn up command......except for running a deploy_puppet command. Sure you'd have to spend cycles writing a deployment script of some sort, but if that's a valid reason then we're just being lazy. I might be being terribly optimistic but it doesn't seem like a hard problem to solve.

In addition, I've always preferred the idea of delivering Puppet code (or any code for that matter) as some sort of artifact. Maybe we have a build process that delivers an RPM package that is your Puppet code. A simple rpm -iv or yum update and we've got the magic.

It complicates module development

Sometimes when people are developing modules, their modules depend on other modules. I extremely dislike this approach, but it is a reality. You would now have to check out two separate modules and all of their dependencies in order to develop effectively.

Truth is, this sounds like bad workflow. A single module shouldn't reach into other modules. In the rare event that it does (say your module leverages the concat module) then these dependencies shouldn't just be inferred by an include statement. They should be managed by some tool like librarian-puppet, because in reality, that's all it is. Every other language has an approach to solving dependency management (pip, gradle, bundler and now librarian)

What's Next?

With my thought process laid out with pros and cons, I still feel pretty strongly about separate repositories for each module. Another solution might be to create some scripts that manage the modules in a way that can fool the users that care into thinking that they're dealing with a single repository. But this tends to defeat some of the subliminal messaging I hope to gain from separate modules. (Even though that's probably a pipe dream)

I'll be sure to post back what becomes of all this and if the team has any other objections.

I'm a little late in posting this, but better late than never. The panel that I moderated for C2E2 finally has video footage up. It was a great discussion with a bunch of really great people. Scott Snyder was so incredibly humble and gracious. He's lucky I hadn't started reading Batman Eternal before the panel. I'm so in love with that book I'm not sure I could have prevented myself from bear-hugging him. Anyways, I digress. Check out this great talk.

Stumbling through the web, I found this book club called ReadOps. It's an amazing idea and our first book to read is In Search of Certainty by Mark Burgess, a seriously smart man. We're reading the book in sections and with a bit of effort, I was able to get through part 1. Below is the writeup I did for ReadOps. If you're in the IT Field, ReadOps might be worth checking out.

Part 1 of this book was rough, but I promise that it gets better in the later chapters. The principle issue I have is the amount of depth that Burgess goes into to setup his arguments. There are significant correlations between his work in physics and its history, but I don't find it useful beyond the 2nd or 3rd paragraph. Now with that being said there are a few points that I find absolutely stellar.

How We Measure Stability

The sections on stability really challenged my thinking on how we measure stability. In general, I've always measured stability throgh the ITSM Incident/Problem Management processes. But Burgess struck home for me when he says that what we're actually measuring in these processes are catastrophes, not stability.

If I had the right models for stability, I'd recognize the erratic memory usage patterns of the Java Virtual Machine (JVM). Those swings/fluctuations would give me an idea on the stability of the JVM. (I'm also mixing a concept he talked about in regards to scale, but that's a whole different thing) Instead, I don't pay attention to the small perturbations that lead up to the eventual OOM or long pause garbage collection. Instead the OOM triggers an incident ticket, which then gets tracked and mapped against our stability, when truth be told, stability was challenged much earlier in the lifecycle.

I'm not sure if ITSM has controls to deal with these types of situations or not. In my specific case above, an incident ticket might not have been warranted due to the fact that memory usage can be self-healing through garbage collection. But without defined thresholds and historical data to trend against, it would be easy to see miss an incident or situation where memory usage whent from 40% -> 75% -> 44%. Sure memory usage dropped significantly, but it's still up 4% from where we started. What do I do with this information? I guess that's where clearly defined thresholds come into play.

A Stability Measurement

With all that being said, I wonder if there's the possibility to measure stability into some abstract value or number. (Maybe this has already been done and I'm late to the party?) I think a lot about Apdex and how much I love it as an idea. But for me, as a Systems guy, its at the wrong scale and it introduces components that I have no control over. (Namely the client browser and everything that happens inside it) What would be incredibly useful though was some sort of metric for ServerDex. I'm imagining taking a range of values, deciding which ones fall outside the desired thresholds and applying some sort of weighted decaying average to it. (I'm literally just spitballing here) That could give Systems folk some sort of value to track against. It be nice to be able to take several of these measurements and combine them for an approximation of the stability of our system as a whole.

Wow, I can't believe that C2E2 is almost here! The con has definitely become one of my favorite events of the year and I'm glad that CNSC got the opportunity to work with them again.

This year I'm also excited to be moderating a panel entitled "Opening the Clubhouse Doors: Creating More Inclusive Geek Communities". If you're going to be at C2E2, for selfish reasons, I highly recommend you check it out.

I'm migrating to PostGres for one of my Django projects. (From MySQL) I'm writing this more as a note for myself, but if someone else finds it useful, go for it. If you've done this on a Mac, you may have seen the following errors.

Error: pg_config executable not found.

If you've found that error, then you may not have Postgres installed. If you do have Postgres installed then make sure the installs bin directory is in your path. If you don't have Postgress installed, the easiest way is Postgres.app. After installing Postgres drop to a terminal and add a new value to your PATH.

Don't feel discouraged when this fails again. Because it probably will. The message is extremely helpful if you've got your Ph.D.

clang: error: unknown argument: '-mno-fused-madd' [-Wunused-command-line-argument-hard-error-in-future]
clang: note: this will be a hard error (cannot be downgraded to a warning) in the future
error: command 'cc' failed with exit status 1

It may not be downgraded in the future, but today is not tomorrow. So lets hack this bad boy.

If you already had Xcode installed and it's version 4.2 or earlier, skip ahead to step X. If you downloaded Xcode in step 2, you'll need to install some additional compiler tools that were removed from Xcode. The best way is to use Homebrew. (You are using Homebrew RIGHT?)

brew install apple-gcc42

Once that's complete. If this works, you're done. If not (and it probably won't) move on to step4.

This post is REALLY late, but I think the topic is still relevant, even if the trigger events have faded in our memory

The Information Technology field is completely devoid of any ability at self-reflection. The whole damn thing, from companies to board of directors, to developers, to system admins. How easily and quickly we can wag our finger when someone else fails, yet when we ourselves fall down, there’s a “perfectly logical explanation”.

In case you were under a rock on last Friday, many of Google’s services went down for an extended outage. I know for our fast paced world of hyper-connectivity, 25 minutes without email or documents is the end of the world. There’s the entrepreneur who finally got his chance to pitch in front of a venture capital firm, but couldn’t get to his presentation. The college kid that was trying to print his assignment before making a mad dash to beat the deadline. I get it, these services impact our lives in major ways.

But it’s alarming to see how the people who should understand most, are the first to pile on. Yahoo just couldn’t help themselves and tweeted about the issue multiple times. They have since apologized but honestly,at this point who cares.

But as the Twitterverse collectively freaked out everyone in my office was calm as a cucumber. Sure we couldn’t access email, but we knew Google would fix the problem and be back up as soon as possible. How did we know?

Because it’s what we would do.

News flash. Sometimes people make mistakes. Sometimes process fails. Sometimes gaps we didn’t know about are found. Sometimes test cases are missed. As a developer, tester or system admin, have you never made a mistake? Have you never let a bug slip in to production? Have you never under-estimated the impact of a change? If you’re perfect, then this message isn’t for you. But if you’re like the other 99.999% (see what I did there?) of people in our field, I’m sure we can agree on a few things.

Google’s uptime is pretty damn good.

Google is run by some pretty smart people.

Even smart people can be fallible.

Downtime is a human tragedy. We should treat it with respect.

That last one sounds crazy, but seriously. For someone on that Site Reliability Team, the outage wasn’t a laughing matter. It probably doesn’t feel good to know that the Internet is collectively dismayed and disgusted by a mistake you made, even though 50% of people wouldn’t understand the mistake if you explained it to them. Instead of ridicule, we should encourage open dialogue about how mistakes like this are made, so everyone, not just Google can learn from them.

Outages are learning opportunities for everyone. Why did it happen? Was it a tools failure? I’m sure others would like to know if it’s a tool they use as well. Was it a process failure? Open dialogue about the failures of traditional IT Operations shops and their failures had a huge hand in forming the DevOps movement. Was it human error? Why did that person think the action they took was the right one? If it made sense to them, it will make sense to someone else, which means you might have a documentation or a training issue.

All of these problems are correctable but only if we feel comfortable talking about our failures. This constant ridicule and cynicism our industry has when someone fails threatens the dialogue necessary.

Google has shared some details about the outage, and I’m happy to say it seems to be a growing trend among companies, but what about at a lower more personal level?

I challenge those in our field.

Be fallibleBe open with your failuresGet to the heart of why the failure happened. Don’t just call it a derp moment and move on.Recognize when someone is trying to do these things and encourage it.

I was on the train today and the worst possible thing on the planet that could happen to me happened. My phone died. Normally I have a contingency plan for that, but all of them fell through and I was forced to ride the train in total silence, with nothing but my imagination to pass the time. This is the perfect example of wasted time. And despite all the iPhone’s ways of keeping me connected to the world, the single greatest accomplishment of the mobile era, is helping me reclaim those wasted moments in my life.

I’m a productivity nut. But how I categorize productivity might be different than the usual definition. Things I categorize as being productive that might be surprising are:

Reading books

Watching television

Playing video games

Why are these productive? Because living a good life also requires having some fun. And in this new, always on world, it can be easy to lose site of that. So in order for me to live my life balanced, I actually have to put these fun things in my to-do app because if I don’t, I may not make the time for them. So watching 30 minutes of the Daily Show during a train ride is a huge win for me. I’ve now made that wasted time, productive and without the added guilt of thinking “there are 40 other things I could be doing right now.”

In the same light, there are some things that I categorize as unproductive yet they still need to be done for various reasons.

Cleaning the house

Grocery shopping

Cooking dinner

These are things that HAVE to be done in everyone’s life. But they’re unproductive to me because

I can find other ways to get the same result. (Take out?)

The end result doesn’t phase or impact me in a really meaningful way. (So there’s dishes in the sink. I don’t mind, I’ll wash the dish I need when I need it)

Fortunately I have a wife that very much cares about these things. She saves me from my own slothfulness. But I make these tasks bearable by combining them with mobile devices to also make them productive. I’ll wash dishes, cook dinner and grocery shop all in the same day if I can also listen to my podcasts or audiobooks. Because those activities are deemed productive and therefore have saved me from what would be wasted time. (Although that time is greatly appreciated by the Mrs.)

I try my best to guard against this wasted time. It’s so bad that I carry 3 devices just to make sure it doesn’t happen.

iPhone 5 - My goto device for reclaiming wasted time. Works in just about any scenario. Sometimes excessive use prevents you from having all the juice you need when you need it. So in case my iPhone dies, I have as a backup ….

iPad 3rd Generation - Next best thing to the iPhone. I don’t have the 4G model, but if my phone has enough juice I can tether. This allows me to do some work if I need to or just catch up on some digital magazines (Wired, Newsweek, Time) or read my RSS Feed at a more comfortable scale. But the 3rd Gen iPad is heavy (First world problem) and can be a pain to use if I don’t have a seat on the train. Not to mention the frigid Chicago Winters can make operating the touch screen a challenge on some mornings. For those days I have ….

Amazon Kindle E-Reader (2nd Gen) - The Kindle is a great lightweight device. Battery life is great and with buttons I can operate it with gloves on in the cold. It’s also incredibly easy to use one handed.

Despite these gadgets, the perfect storm occurred today and I was forced to stand in silence, watching others be productive. I mentally created a mind map for this post, which made me feel a little better, (Especially since I’m actually writing/posting it) but all-in-all it was a frustrating experience.

Good or bad, the mobile revolution is pushing us to be busy bodies both in our work and our leisure. The conversations about mobile are always around being connected or disconnected from the world around. For me, it’s just about getting shit done.

Every so often you come across that task that you think is going to be insanely easy. But alas, you roll up your sleeves, get into the nitty-gritty and discover a twisted twisted web of minor problems. That's been my experience with the Postfix mailq.

I wrote a mailq log parser in Python. The mail queue is where Postfix and Sendmail dump emails that can't be sent or processed for various reasons. It's a running log of entries that can be dumped out in plaintext via the mailq command. I thought this task would be simple, but a few hurdles I ran into.

1. The log file is a variable line length.

A record could span multiple lines. That means you can't simply iterate through the file line by line. You need to figure out where a record beings and ends. So far from my testing it appears the line length is based on the reason the mail is in the queue. Which leads me to item #2.

2. Figuring out why the record is in the queue

This was trivial, but it was an odd design choice. An item could be in the queue for various reasons. There is a special character at the end of the queue ID that tells you the reason it's in the queue. * means one thing, ? means another and the lack of a special character means yet another. Once you've parsed the queue id out, it's trivial to check the last character, but why not just make it a separate field?

3. Different versions of Postfix have slightly different outputs

As soon as I ran a test in our QA environment I learned that different Postfix versions have slight modifications to the output of the mailq command. The annoying part is that the updates aren't substantive at all, but just change for change sake as far as I can tell. Now the email address is blanketed by <> characters. The count of the number of items in the queue are in the beginning of the file instead of the end. And the text describing that number changes its wording just a tad. "Total Requests" instead of "Requests in Queue". Not very useful.

4. The date doesn't include a year

I mean...really? And the date isn't formatted in a MM-DD-YYYY format. It's more like "Fri Jan 3 17:30". So now you're converting text in order to find the appropriate date. This post is timely too because the beginning of the New Year is where this is really a pain. The fix I'm using so far is to assume it's the current year and then test the derived date to see if it's in the future. If it is, fall back to the previous year. This assumes you're processing the queue regularly. It's an auditor's nightmare.

None of it was terribly difficult, but more more difficult than it needed to be. It's as if they wrote the mail queue to only be parsed by humans. I'll be working on my MailQReader for a little bit because I have a need at work.

I’ve heard a lot about the term DevOps. Mostly from employers or from technical recruiters looking to fill these roles. When I hear people talk about DevOps, they’re largely talking about Chef, Puppet , CFEngine or just more general configuration management. While I believe the whole Infrastructure as Code movement is extremely helpful, that’s not the end-all be-all of DevOps.

The more I learn about the DevOps movement, the more I realize it’s already falling victim to the bandwagon types in our industry. People are scrambling to build DevOps teams in addition to their development and operations teams, which defeats the purpose of DevOps entirely. DevOps is not a position, but a mindset. It’s an organizational structure. It’s a methodology. The idea of taking developers and operations staff and destroying the barriers and silos that exist between them is key to the DevOps movement.

People are finding different ways of doing this, but I think one of the best solutions might be embedding operations staff members into stream teams. This will go a long way to ensuring that the needs of the Operations staff are being considered during the development cycle and maximizing collaboration across the organization. It also pleases the separation of duties concerns of auditors, regulators and general compliance wonks.

Building a new DevOps team doesn’t work for a few reasons. First, it simply replicates the issues that currently exist, which is silos of staff members who often have disconnected goals and incentives. The goal of Operations is to keep a stable system. What better way to stabilize a system then to thwart change? The goal of developers is to deliver new features and functionality to the end user. You can’t do “new” without “change”, so these two goals are instantly at odds with one another.

If you add a DevOps silo to this picture, it will naturally land right between these two tribes. The DevOps team’s existence lends itself to the idea that it’s supposed to bridge the divide, but in reality it will become a dumping ground. Developers don’t need to worry about integration or its impact on the system, because that’s DevOps’ job. Operations doesn’t need to worry about how new deployments will impact production, because that should be vetted out by DevOps in the QA phases. Before long DevOps is reduced to simply serving as a QA Operations team and Release Engineering. This doesn’t solve the problem though. The poison still runs through the veins of the organization. You’ve just taken the antidote and diluted it.

Like a lot of problems in the world, the issue boils down to incentives. Not just monetary incentives, but the actual goals of the team and the company. The silos you build, the more you fragment the overall goals of the organization. Silos lead to blinders on to the goals of others. Imagine a team of 4 people building a car. The goal is to make a car that is “fast and cheap”. Then you split the group into 2 teams and give one group the responsibility of “fast” and the other group the responsibility of “cheap”. You can imagine the outcome.

Whether you agree with my approach to DevOps or not, what is undeniable is that DevOps is not a job title. To implement DevOps in your organization, it does require a very specific skill-set. You need developers who understand systems and the impact design and coding choices have on the servers. You need administrators that understand code at a deeper level than simple if/then/else constructs. But rebranding your people under the DevOps moniker, or worse, creating a new DevOps team is no solution.

Not a very helpful error if you're new to Python. I'm not 100% sure what the problem is here, but it appears that Python version 2.5.2 (and older) have a problem with list expansion combined with a keyword argument.

But running this same code on Python 2.6.8 (which is in the Redhat Repos) doesn't produce the problem at all.

So the easy fix is to upgrade your version of Python. I've reported the bug to the S3cmd team to address. My guess is they'll just require a newer version of Python. Their current version test only looks for 2.4 or better. Which is probably out date.

Thanks to the flexibility of virtual machines, you've probably found yourself with a clone of a production machine being deployed to a test environment. There are a variety of reasons to do this. Maybe you're preparing for an application upgrade, tracking down a particularly nasty bug or building a clone of your production environment for QA.

The fear is always "How do I prevent the clone from acting on production?" It's a very real fear, because it's easy to miss a configuration file. In an ideal scenario you'd have the test environment on a different network segment that has no connectivity to the production environment. But if you're not that lucky, then there's iptables to the rescue!

With iptables you can use the below command to prevent your test host from connecting to production.

iptables -A OUTPUT -m state --state NEW -d -j DROP

This command will prevent any new connections from being initiated FROM your test server to the server specified by . This is handy because it still allows you to make a connection FROM the banned production box to the target test box. So when you need to copy those extra config files you forgot about, it won't be a problem. But when you fire up the application in the test environment and it tries to connect to prod, iptables will put the kibosh on it.

If you're super paranoid, you can get execute

iptables -A OUTPUT -d -j DROP

This will prevent any outbound packets at all from going to .

Don't forget that the iptables command doesn't persist by default, so a reboot will clear added entries. To save the entries and have them persist, execute:

I was working on a quick and dirty mysql backup script today. It was nothing complex, yet for some reason I could not for the life of me get it to execute. Here's the script in its entirety, minus the obvious redactions.

I continuously received an "Access Denied" error message for the user. Just so I knew I wasn't crazy, I echo'd the command that was being executed, copied and pasted it and voila. Backups. WUT!?!

The current password was pretty long and contained spaces, so I figured, maybe the spaces were causing problems. I created a brand new user on sql.

grant select, lock tables on 'databasename'.* TO 'databaseuser'@'localhost' IDENTIFIED BY 'password with no spaces'

Same results. Works when I copy and paste the command, but doesn't execute through the script.

So to debug the thing, I removed the -p from my command line so that I'd be prompted for a password. DISCO! It worked. WUT!?!?

Now that I switched the password to be a password with no spaces, I decided for shits and giggles to remove the single quotes. Suddenly I'm in backup heaven.

I don't claim to be an expert on the evils of string variables in bash, but my understanding was that quoting the string, inside of a double quote would produce a literal single quote. Based on the output when I echo'd the command line variable, that's EXACTLY what was happening. But for some reason mysqldump just didn't care for that.

Odd, but solved. I saw a bunch of people reporting the same problem on the web with no answers, so I thought I'd post my experiences.