Luke Kanies Wants to Modernize System Administration

Luke Kanies spent some time as a system administrator. That job can have its own tedium -- keeping machines up to date, building new machines, and managing dozens to thousands of individual configurations. Clever administrators automate.

Luke is the primary developer of Puppet, a clever project designed to automate away the tedium of system administration and configuration management, letting you describe what you want rather than telling machines what to do. It's all part of a plan to drag system administration kicking and screaming out of the 20th century.

Luke recently spoke with O'Reilly about Puppet, system administration, and how to provide actual measurable business value to your organizations.

I want to talk to you about obviously Puppet, but more the state of
operations management, change management and all of that system administrative
stuff, when you have more than one machine. You're on record as saying that a
lot of the tools for monitoring are pretty good but a lot of the tools we have
for managing instances of machines or managing large banks of machines, server
farms, desktop farms, whatever are still stuck in a 20 year old
mentality.

Yep.

How so?

If you look at the way that most people think about management, they take a
SSH and a for loop mentality, and the few tools that don't do that, Cfengine
is probably the best example of a tool that most people seem to know about, a
lot of people will either start using it, or they'll look at it and decide that
it just isn't worth the headache. To them it does seem to alleviate the SSH and
for loop problem in that it allows you to kind of describe what you want your
machines to look like, but at the same time, it's really really low level. If
you'd like to provision a given service across multiple machines, multiple
platforms, which is a really common problem, [it] forces you to essentially
consider that same service to be a completely different service because you
have to deal with all these details, you know, "where is the file located and
what is the actual service name?" There's no ability to say, I want to step up
a level and ignore those details and really focus on the service itself.

SSH is a great example. Everyone runs SSH. It's produced by the same
application. It's always the same service. It's always the same configuration
files. But if you look at file location on an OS X they're on /etc. On
FreeBSD they're on /usr/local/etc, and on most Linuxes, they're in
/etc/ssh. If you look at the service name, sometimes it's SSH,
sometimes it's SSHD, sometimes it's OpenSSH, sometimes it's Open SSHD. So, you
want to be able to say, I'd like to talk about the SSH service and most schools
say, "I don't know what you mean. I've only got OpenSSH."

In terms of what we're looking at, the existing state of
administration, when you're calling this 20 year old feature, the thinking is,
I have a whole lot of shell scripts, that know how to execute files, and I have
to hard-code my paths to my configuration files and if I make a change I have
to scp or rsync all of these configuration files to the right bank of machines
and the resource concept in Puppet lets you step back and say "Run SSH Platform
system and details we'll take care of that." Elsewhere you say, "I want SSH
running, and I want it to look like this."

The majority of your configuration files across the machines are actually
the same. Look at the password file as an example. You don't think of a
password as a configuration file anymore, but clearly it is. 90% of those users
in that file are the same users, the system users. And so, we have these good
tools to manage the rest of those configuration files. The rest of the files,
the rest that are different for us. But for most of the configuration files, we
don't have those tools, so we're forced to write our own tools to manage the
differences and the similarities. You look at how people distribute config
files and what they'll do is they'll have 78 versions of a given file, this one
goes to this host, and this one goes to this [other] host and this one goes to
all the hosts in this data center, or something like that. For one, you have
this huge duplication right? Because you've got a lot of similarity among all
these files but the really, the more nefarious problem there is that you don't
actually you can look at the configuration that says this host gets
this file and what you don't know is what's actually going on.

You don't know why that host gets that file.

There's the why, but then there's also [the question], what's out there
that's actually different between this host file and the default file? You're
missing the intent, but you're also missing the information itself. All you
know is that host has a different configuration file. In fact, most people
what they do is say, "Here's the list of files that a host could
receive and it will be this ordered list that says look for a host specific
file, then a class specific file and an operating system specific file and then
the default file. And in that situation, you literally looking at the
configuration, you have no idea if a file even exists for this host, much less
what's going to be in that file.

It's completely non-deterministic. It's horrendous. Or, it's deterministic,
you just can't from a given place, know what's going to happen. You have to
look in three different configuration locations to determine what's going to
happen, and even then you have to once you say, ah, there's a host specific
file, you have to dish that to all the other files to see what's different
about this host than other hosts.

This is a direct consequence of being stuck in the "everything
system administration is shell script or a configuration file", [and] all we
can do is work with these specific files as they are. It's a result of that
mentality.

Exactly. Then of course, you can throw someone like NetInfo into the mix.
You've got all these great file tools. Somebody says, I've got a Mac and I
want to be able to manage some users and you're like, "Those aren't files." You
know? Or LDAP or anything like that! I've got APIs instead of files. I don't
know what to do!

We can make the argument that the Unix file structure is in fact an
API, but I take your point. That's a very good point. When did you have the
realization that managing or modeling these behaviors and these services and
these configurations as resources in Puppet was the right way to go? Was it
something you knew all along? Was it something you picked up from Cfengine? Or
did it gradually come up?

About five or six years ago, I had rewritten a tool called ISconf by a guy named Steve Traugott. He wrote a seminal paper I
believe in '99 for LISA that kind of went through and said if you're going to
have a managed infrastructure, there are nine course services that youre
infrastructure needs to provide in order to make your infrastructure work. He
did some work at Morgan Stanley and other places. What he called IS COMP which
is a very simple make-based tool.

His idea was to store all the work you didn't make, and then execute these
make-targets in a specific order and thus you could always recreate the
systems. So, this is my first interaction, my first exposure to somebody else's
tool to do this kind of work. So I experimented with this tool and decided that
what he had, which was kind of a typical combination of make, Perl, and shell
was just not quite sufficient. I rewrote the whole thing, but quickly found,
and all the work was being done in Make. So you've got a package to install,
and you go OK, so I want to install this package. You make a make target for
that and you're like, "well now I want to install another package. And the make
target looks almost exactly the same except for the package.

I started extracting out what it meant to install a package and I'd create
these sort of default make targets. So it'd be like package/percent and using
this GMake, you could do all this... what are they called? Like stanza
matching things. You could extract the replacement parameter per percent and
have your package stanza know what package to install, based on how the stanza
was called.

One place you'd say I want package Open SSH and another place
you'd have a package/percent stanza that knew how to extract Open
SSH from the stanza. This is kind of the start in that, in that it
was accidental because Make wasn't very powerful. The way to
communicate and the way to extract things in Make was really
simplistic. What I found was pretty quickly, I had stanzas for all
the major things I needed to manage. I had package stanzas and
resource stanzas and cron stanzas and host stanzas, and they knew they
knew how to do all this stuff.

The problem with this though, is that packages are easy. They're either
installed or not installed. Things like cron jobs or hosts are harder because
they have multiple [states]. You can't you know you don't want to try to
extract all the fields from a cron job from the make stanza main. That's just
not pleasant. I mean, already doing this kind of stuff in make is already
unpleasant. So I started doing this and building in one place I'd build a table
of all the resources I had, basically a data dump, a stash of all the cron jobs
that I'm ever going to install anywhere on my infrastructure. What my make
stanzas would do was just say I want that named cron. I'd name each of the cron
jobs and then I could extract one of the cron jobs and say, I want that one
installed, I want that one removed. I had this symbolic representation of a
resource. You know I wasn't using the term resource back then. But...it was a
resource. Then another place I'd say "...and this host gets that resource."

Were you able to write out instructions to the makefile? make
install ssh or make package openssh? Were you able to
describe in code, or maybe a shell script or configuration file, exactly what
to install and configure when you set up a new host?

ISconf is kind of weird, in that what you did with the host was you said,
"here are the n numbers of make stanzas it takes to turn you into what
you're supposed to be." If you're an Apache host and there are 25 make stanzas
associated with being an Apache host, and then there are stanzas for each of
those 25 make stanzas and they all get run in the correct order every time, so
they'll kind of do each thing in turn. It was a really complicated system. And
I wrote this paper in I believe '01 or, no it was '02 I think that basically
said, ISconf wasn't quite sufficient and it's because Make is a horrible API.
So what I did with my integrated ISconf engine, and initially most of the work
was still in ISconf but what I found is that all of these models that are
written, that I've found in packages ground up as hosts and resources
translated really well to Cfengine.

Cfengine has a similarly crappy API potentially, but if you want to
integrate Cfengine with an external tool, your only real integration point is
[shell script]. With Cfengine I had this similar kind of thing where I had this
database of resources and I'd collect what all the resources were and how they
were configured and I'd try Cfengine and then I'd make decisions about what
resources I wanted. If I'm in this class, then add these five hosts or these
five cron jobs to the host. If I'm in this class add these 5 packages to the
host. I realized after a few years of doing Cfengine and doing consulting,
that I had split. All my decisions were in Cfengine but all my actual work was
in another application entirely, and there really wasn't much consistency to
that application. I'd do cron jobs one way and I'd do packages another
way.

When I realized I wanted to fix that, what I really wanted to do was teach
Cfengine how to understand abstract resource types. I want to add a cron
resource type to Cfengine. I want to add a service resource type, or any of
those things. And then you look at the CF engine code and you realize that it's
60,000 lines of spaghetti C-code and that there are 25 unique syntaxes in the
language and things like that.

You have the one maintainer and he keeps pretty tight watch over
it.

He does. There was this great quote from him around 2005 where he said that
he thought version control was overhead. Of course he's a CS professor, right?
So he would know. That was kind of frightening in that you're like "Look, let's
collaborate on this" and he's like "That stuff'ss all unnecessary, even though
we're getting ... regressions and things like that." But then you also have
issues where he would say "This is how it should be." Anybody who
starts writing C especially, like most languages, starts with gigantic case
statements scattered throughout the whole sequence. Anybody who does very much
for very long learns that doesn't scale well. It doesn't work well.

You need to extract... what is it? Refactor conditional to
polymorphism.

Exactly. He thought that that was the stupidest thing he'd ever heard. If
you wanted to add a new resource type to CF engine, and Cfengine calls them
actions, then you had to find every one of these case statements and add
support for your actions in that case statement, throughout the whole system.
Which meant of course, there's no way you could ever write a redistributable
module that could just be dynamically loaded to the device then.

No. You're writing a patch against the whole thing.

Right. When you look at that and you [think] "yes I could refactor all of
Cfengine to make this work, or I could take a language that I'm more fluent in,
I'm just going to be more productive in in general because it's a much higher
level language than C." That's when I started experimenting with a separate
tool and with Puppet the core things that I came in with were I wanted to have
this core level, core idea of a resource. I wanted anybody to be able to add a
new resource type. It couldn't just be up to me what resource types existed. Of
course I had to extract I really wanted to separate the language the way we're
talking about resources from the line write that defines them, so that you
could add in your resource type and not have to modify the parser. Which seems
obvious, but in Cfengine world every action needed direct support in the
parser.

You mentioned the same thing with your make-based solution as well.

Yeah, you had to go in and modify make and you had to add your code and all
three places had to be updated at once.

You worry about shell script quoting and all sorts of other rules there.

In fact, I had this really fun [time]. I was running this on, at the time I
think it was Solaris and HP-UX and that's probably all. It was pleasant enough.
We were doing basically a bootstrap, where the first thing to make the stanza
involves gmake and then the next stanza just exited if gmake wasn't in use.
Then you had to have this massive bootstrap.

That was definitely one of my goals going into Puppet. Let's make bootstrap
really simple. If you've got Puppet installed, you shouldn't need to do
anything else. You know, there's authentication aspects, so you gotta go
through this whole keysigning process, but key signing in Puppet is really
really simple. The story, if you heard the Redmonk podcast with myself and
Nigel Kirsten from Google last week, he's talking about being at an Apple
conference and listening to Jeff Mcuen talk about Puppet and how awesome it is.
During the course of Jeff's actual talk, Nigel VPNs into Google and takes up a
virtual machine, sets up Puppet on it and is managing students by the end of
the conference. That's what you want.

There's a competitor to Puppet called bcfg2 by someone else and he
likes to say that it's very reasonable to get a bcfg2 installation up and
running in as little as three days. And I'm like, 3 days?

Three days is pretty good. It's an improvement anyway. We'll call it that.

The resources were a big part of what led me to discard the existing tools,
and I looked at a lot of those tools out there. You've got LCFG, which is probably the oldest and most
mature tool in use by the university of Edinborough. It's got some really great
features, but it also has it's just a bunch of shell scripts still. A bunch of
whatever scripts you want.

In the end. You've got SmartFrog out of HP, which is again really
interesting. It's got a great language, but there's no resources, there's no
higher level abilities at all. CERN has this tool called Quator, that was dead
for awhile, but apparently it's been resurrected recently by somebody now that
the Large Hadron Collider is actually producing data, I guess they needed to
worry about it again.

It kind of has the same thing. I went to all these tools and I kind of went,
I want abstraction. They went, "we've got great facilities for something or
other". Well that's not really sufficient.

I'm sure any large organization or company has some sort of home
grown, if not multiple home grown solutions to this.

That's my 90% competition, what people load internally. Fortunately most
people don't like what they have. You talk to most IT men and you go, "So what
do you use?" And they go, "Ah, God, I got this junky thing that I loaded myself
and ugh, I'm embarrassed. I could never publish it because it's such horrible
code." But, you can come to them and you can say, "I've got this great tool
that everyone loves."

In just three days you can be up and running!

Or in just an hour.

Then people seem to be a lot more interested in that and everyone's really
concerned that system admins are going to have so much ego attached to their
own implementation, but the truth is, that system admins are pragmatists in
almost all cases. Their goal is can I go home earlier? Is there someone that
can get more done in less time? You know they're not going to publish their
code anyway, and Puppet is actually it's a good enough tool it makes it easy
enough to write readable simple code, that people are more likely to publish
Puppet code, than they are their own tools.

I understand you're working on a repository of resources and models
that people can reuse and redistribute.

One of our goals for the year is to get what's recognizable, especially to
you, is a CPAN of Puppet models. Most
people's problems aren't unique. Everyone has some unique problems, but for a
given person's problems, most of those problems are shared by a large number of
other people. If you can download solutions to those problems, rather than
having to figure it out yourself every time, or not even figuring it out. It's
easy right? It doesn't take intellectual effort. It just takes you typing out
the stupid code, debugging it, making sure it all works across all your
platforms and then you hopefully never have to think about it again. Our goal
is to make it so you can download these things. We're going to figure,
sometimes it's going to make sense to allow people to show solutions on there
as opposed to just downloading them. I don't know about that kind of stuff,
but it'll be an interesting experiment anyway.

I imagine at this point that people are reading or listening to this
and thinking you know, wait a second. I can just make an image on an installed
server or I can just have a virtual machine and clone that. Why do we need this
configuration management to help me set up things? I suspect I know what you're
going to say, but can you respond to that?

This comes up a lot in the cloud thing. People say, you know I don't need
Puppet in the cloud because I've got virtual machines and I can just clone as
many virtual machines as I want. And that can work, but if you look at say your
kind of standard LAMP stack, what you're going to have is at the very minimum,
you're going to have a load director in front. You're going to have a web
server, an application server and a database server, and this is assuming a
single application. Now you've got four different images that you've got to
maintain. Anytime you want a new one, you bring up a new version of that image.
But if you ever want to change that image, let's say, bring out a new version
or you add a new user to your system, how do you add that user to your system?
Well you modify that image, you upload the image to EC2 or whatever and then
you reboot all of the machines on your network to get that new user.

What if you want to add a new user to all four of your images? Now you're
opening up all four of your images, adding that new user and then rebooting
your whole network, just to get a new user.

But I can just do this with a for loop over a shell script. Right?

Exactly. Now you're back to square one. The truth is that images make a lot
of things a lot easier, but when it all comes down to it, VMWare is great for
managing the outside of a box. I've been told this is a horrible analogy, but
the way I think of it is, all of these virtual machine systems -- they're
really good at producing and managing eggs, you know these self contained,
sealed eggs of functionality. But they're not very good about getting inside
the system. They can't get inside the egg and manage what's going on there.

You need a tool like Puppet to get inside the machine and say, alright now
that I'm up and I'm running, let's differentiate this host. Maybe, so you look
at most organizations what they do is get something like kickstart or jumpstart
or ignite to get the machine from bare metal to functional with as little as
possible. Kind of an operating system but that's completely undifferentiated.
Then that installation gets Puppet on there, and from there Puppet does all the
rest of the work.

The great thing about this is, you know a machine built six months ago has
all the changes that you've made to your configurations and it will still be
the same exact configuration as a machine built today. If you add a user to
your configuration, all the right machines get that user. You know if you
update a security package, all the machines get that security package. If
you're using images, your question is always, which image is that guy running?
Especially with real hardware, you have that issue of you don't really want to
reboot your real hardware all that often.

Virtual machines, in really simplistic cases, the images are great. If
you've got a cluster of 1000 nodes that are exactly the same, who cares, right?
You spend as much time as you want tuning that image and then you can reboot
your whole cluster at night and all the scientists who are using it or
whatever, they don't care, they're gone, and that works great. But if you've
got four or five different images, or worse, images that are not quite the same
but are pretty similar then you've got a real problem on your hands.

You can use Puppet not just for installation and set up, but for
continuing maintenance.

Oh yeah, you would. The idea with Puppet is that you would use it from the
day you turn the machine on until the day you turn the machine off. You should
never have to log onto that machine unless it's to consume the service that it
provides, rather than logging in to administer. And most importantly, the key
and [inaudible] does this too, but Puppet does it better, the key feature is
that you write this application that manages your entire network. You have this
infrastructure application that knows how to provision and maintain all the
services on your network. As you update that application, Puppet brings your
network into sync. If you change what it needs to be in your Apache Server then
Puppet changes your Apache server to meet that definition, or it provides
helpful suggestions saying, you know that package doesn't exist or whatever the
error is. So you should never have to worry about services being out of sync,
even if one machine was built one way and another machine was built another
way, or Johnny built this computer and Billy built this computer and therefore
they're configured differently. This application manages all of your services,
for their entire lifecycle.

And keeps them up to date.

Let's talk about operations. As I said before, you're very much on record as
saying we're stuck in the 80's mentality. I very much see how Puppet and other
tools like this are leading us away from that. Are there areas, besides
configuration management, change management where you see operations leading up
to "update or die"?

Think the biggest problem with operations right now is that there's a real
disconnect between the people who are doing all the work and providing all the
value and the people who are writing the checks. I was talking to some people,
including Tim O'Reilly earlier this week and you can pick up any system admins
book off the shelf and I defy you to find where it talks about metrics. How to
provide useful metrics to your bosses. How to provide useful metrics to your
executives indicating the value that you're providing. How much have you done
to demonstrate that you've reduced the error count for your network. To be
able to say, sure we've got 5.9 or whatever. That's nice! But how much have
really done to say, "Here's what we're doing. Here's what we've done."

Every time I've ever been provided, I mean even simple information -- I wish
this were a joke, I've never joined a company that knew how many computers it
had. Never. Let's talk about a really simplistic mess here. Just starting at
that metric, if you can provide that to your employers, that's a start. But if
you go on from there, to much better metrics. Like what kinds of things are we
doing? How many trouble tickets are we responding to? How many changes were
made? Were the changes successful? What is the root cause of all the failures
we're having? If you get that information, then you begin really validating
your presence.

That's a really simplistic metric but there are other really
great metrics you can provide. Obviously, all the changes you're
making, while you're making those changes, are they in response to
exceptions? Are they adding new features? Are they customer requests?
These are the kinds of metrics that executives need to know. They need
to know Why are you spending all of this time doing this? Where does
all of this time come from? I know system admins constantly
complain [and say] "Oh, I go to my boss and I need this or I need that
and they just won't give it to me." I ask them, "Well, how did you
try to convince them?" They say, "Well I said it very loudly."

What you need to do, is you show up with the same graphs that the sales guys
show up with, what the other organizations show up with, and say, "Look here's
data, showing what I need," there are very few bosses who are going to turn you
down.

One of my first big goals in Puppet is to build an ecosystem of tools that
allows me to really manage what's happening on the network, in a way that I can
explain, that I can provide a clean interface to not only my immediate bosses
in the IT organizations but to the high level executives saying "Look, we
deployed 1000 servers this week. We deployed 1000 services. We did this. We did
that. Here's an R&D draft of changes that are going on in the network.
Here's where we have this spike of errors because someone's forced us to do
this, this quickly." An upgrade that we recommend against. Things like that.
That direct feedback between, the people writing the checks and making the high
level strategic decisions and the people doing the actual work on the ground,
that's really missing today.

You look at application organizations [and] software organizations [and] a
lot of what they've been doing is, "How can we connect our software development
to our business needs?" You listen to Potworths Podcasts, and you know you read
their books. It's all about that. What does the business need? And how can we
best meet their needs? And then demonstrate meeting those needs.

Are we as operations people -- I'm not really an operations person
anymore, thank goodness -- are the operations people not doing this because
they don't know what's necessary? Because it doesn't fit our engineer brain
personalities? Because we've never really had the tools? Or is it some
combination of all three.

I really think it's a combination of the three. Especially as you get
feedback. You go in and you say, you come to a conclusion, you're convinced
it's right and the first time you get told, no that's not right, justify it. I
shouldn't have to, it's just obvious. I've just convinced you via strong
logic.

Here's my engineer brain. It's obvious, just trust me.

Right. There's this real belief especially in the engineer community, and
especially especially in the system admin community that you shouldn't need to
sell something. I shouldn't need to convince you via anything other than
argumentative logic. To some extent that's true, but data isn't necessary
sales. Using good graphs, using good information and using good research to
demonstrate isn't really sales. Even if it is sales, it's probably one of the
most important things that system admins can do. I'm not that good at it, but
if you do a really good job of selling the value of your organization to the
leaders of your organization, you'll be more successful than any tools you
could ever write.

I'd love to tell you differently, but if you sell well internally, you'll
get access to buy whatever tools you want. If you don't sell well internally,
then you're going to be starved for resources, you're going to be fighting with
users instead of enabling them, and the truth is, if you're a system admin, and
you're not there for your users then why are you there? If you're not there for
the organization, then why are you there?

It's great to build all these computers, but they're not hiring you to build
computers. They're hiring you to solve their business needs.

Do you agree with the assertion that IT is just a sunk cost and that
it doesn't really matter?

Oh, no! Not at all. If that was the case, then you couldn't differentiate on
key you know. Clearly organizations like Google, Amazon -- Amazon especially
because it's an operator. They're not a sunk cost. They've spent all this money
on IT and not only did they save tons of money on their organization -- they've
got a fabulous organization operationally -- but their organization was so
fabulous that they were able to start selling it to other people. There's no
way, if they had viewed their IT as a cost, then they would have stopped
investing in it when it was sufficient. And there's no way to have EC2. And
who's talking about Amazon? Why is everyone talking about Amazon.com now? It's
certainly not their storefront right? It's their EC2. It's their S3. That's
where all their press is coming from today. They basically brought the cloud
to the forefront, and it was because they had the vision to not view
IT as a cost.

Tim likes to talk about IT operations as a competitive advantage, and that
and in many other cases, it's a clear cut example of somebody said, "If we do
it right, and we're out to compete, and it's not just a question of reducing
costs or a question of how cheap can I make IT. It's a question of, 'How can I
make IT work for me? How much can I make my services deploy faster with fewer
exceptions, requiring less human input and a faster ability to respond to
issues.'" If you've got all those things and you invest in that, then when
things do need to scale, then you're ready.

Why don't people see this now? Is it again, the lack of
salesmanship? Or, no one's really asked that question before? Or, everyone
tries to put their operations in a corner and says, "Okay, we hope we're able
to compete on this, but we're going to keep it a secret now what we're doing,
like all the clients on Wall Street."

I really just think it an uninteresting and unsexy space, and people don't
want to talk about it. I've been trying to convince investors and developers
and anyone who will listen, anyone who will let me talk for five minutes, that
this is an interesting problem and that it's worth spending time on. i
seriously get crickets. I was at the Web 2.0 conference in '05, mostly so I
could learn how to steal some of their neat ideas and use them in system
infrastructure and I told them what Puppet did, and they were like, "Wait what
kind of consumer application is this?" And you go, "No, no no. It's the stuff
that makes web applications work." And they go, "Well that's just
unnecessary."

It's not sexy. It's not fun. It's not interesting. But, because they're not
interested, it's harder to sell to them. Because it's harder to sell to them,
the system admins feel like they shouldn't have to sell, and because system
admins don't try very hard to sell, what they're doing is uninteresting and the
executives go, "Well that's not a very interesting sales pitch."

"Solved problem. They have their shell scripts and make files and
they're happy."

Certainly there's an expense there too. Everyone who I talk about Puppet to,
especially anyone who's an executive or who's in sales, says "Great! So I can
install Puppet and then fire a bunch of people." And you go, "No what happens
if you install Puppet is that your service, the service your IT organization
provides you, gets ten times as good." What ten times as good means, is they
can actually start selling to you. If you look at what it takes for a system
admin in an organization to implement the tools it takes to do a good job of
demonstrating the value of IT, it's a ton of work. It's a ton of integration
and it's often today it's a ton of development.

How do you get the resources to do all that work and to do all that
development? Well, you have to do sales. How do you sell that to the
organization without any data? It's really hard. You have these kind of nasty
feedback loops, where it's up to some person internally. Usually the only way
to get out of this loop is to have at least one really kick ass system admin to
just say, "You know what? I'm not going to do what I'm supposed to do. I'm
going to do what I know I should do, but I'm going to focus on sales. I'm going
to focus on convincing my organization that this is awesome. Then you need
somebody in the managerial chain that says, "I see the value in that. I
understand the logic of the sale you're trying to make and I'm going to give
you the time and space that you need to get it done." As long as that time and
space is a year or six months, it's reasonable. They're not going to give you
5 years, you know.

The best example I know of is Zed Gibbler from Morgan Stanley in the 90s,
built this cool tool called, I believe it's called Aurora. It's not a tool,
it's an infrastructure. It was published in the proceedings of LISA in, I
think, '96. They're still using it to this day and they haven't really changed
it much. Somehow, he got the right, internally, to just rebuild their entire
infrastructure based on this great design and as a result, they've had a
completely automated infrastructure for ten years. And I think they're the only
bank that has this across the board, automated infrastructure. And I talk to
the bank and I go, "God, that would be awesome." They've had so much success
with that now, they're looking to replace it. They're trying to build a new
version of this to last the next ten years, and there's no discussion
internally. I mean, everyone knows just how you want to do it. No one talks
about bringing in big commercial companies to solve the problems because
they've had so much success building this like they have before that the sales
job is much easier. Once you get that first part of success, then the rest of
your sales come that much easier.

After we get this part solved, selling IT, selling operations to
people with checkbooks, what's the next step for system administration, for
operations, for integration and configuration? What happens then? What's the
next problem to solve?

I think the next problem to solve is making things more manageable. Every
system admin can sympathize with being given a tool to manage and trying to
figure out, how can I automate this thing? Every Oracle person will tell you,
"Oh you cannot automatically install Oracle" Or you can only do it under these
certain circumstances. And you just learn, you know I'm not going to listen to
you. I'm going to try to find my own way to automate this tool. Right? Oracle
has these, these and I don't remember what they call them now, but Oracle has
these things that they claim can do automactic installs that they claim only
work under certain circumstances. Every tool has this problem. Every new
application introduces this kind of complexity.

As we get better tools, as we get more sophisticated ecosystems and tools,
so that instead of your monitoring system being a silo and your configuration
management system being a silo, they start to communicate and you start to get,
not an autonomous system, but you start to be able to say "Okay, Puppet
upgraded SSH last night. And SSH is broken on all my machines. I wonder if
there's a relationship there." Having that kind of work being done
automatically, once you have all that, you want to have tools that provide
better management interfaces in the first place.

I was just talking to a pretty big software company this week. They're the
first vendor to ever call me and say, "How can I make my software more
manageable?" This has been a dream of mine. I want every vendor to ask
themselves this question, "How can I make my software more manageable? How can
I make it easier to plug my software into an integrated management system?" As
you get a completely automated system, as you get the ability to manage more of
your network, more of your infrastructure, what you'll find is, partially
you're going to always choose your software based on what it does for you.
Right? But you're also going to be choosing your software by how manageable it
is.

Once you can start choosing between two essentially equivalent pieces of
software in a relatively commoditized space, one of them is easily manageable
and plugs right into your infrastructure, or possibly even ships with the
puppet code to manage that piece of software. One of them does not. One of them
is just like, well you've got a [choice]. And you've got to put the tarball in
that place because it was written by DJB and he's really insistent. When you
have those choices, it becomes an easy choice.

At that point, you should need fewer people on the ground doing the nuts and
bolts infrastructure kind of development that says, "Here's how you manage that
software, and here's how you manage this software." Then it really becomes a
question of how pieces of software relate, how you plug them together and it
looks a lot more like a jigsaw puzzle than a needle and thread, kind of sewn
together mummy like it is right now.

Or, here's a pile of wood. Here's a jigsaw. Go ahead and build your
own puzzle.

Exactly. I think when that happens, you'll naturally see a split in system
admins between the people who will swap out hardrives, and are really on the
ground nuts and bolts and the people who are much higher level and tend to be,
right now they're the infrastructure architects and the infrastructure
developers, but in the future will be all about integrating these pieces rather
than trying to reverse engineer how to manage them.

Every company is reverse engineering those right now. But they won't
have to in five years.

That would be the hope wouldn't it?

I don't ever want to do that again, personally.

Exactly! I keep listening to system admins talking about it on the Puppet
channel and I go, "Ugh, I'm so glad I'm not doing that anymore." If you're a
sysadmin listening, this is one of the ways out. Write your own automation
software to allow other people to solve their problems, and they will hopefully
give you enough of a living, so that you don't have to solve it yourself
anymore.

Five years later we'll interview you!

Exactly.

Luke, I really appreciate
your time. This has been fascinating, and I believe and hope that it's
useful to our readers and listeners.

I hope so too.

Tags:

You might also be interested in:

5 Comments

Looks to me like this was transcribed, and transcribed quite badly. Low balancer? Surely Luke's talking about a Load Balancer. Are we talking about an 80's mentality or a 20's one? This interview really needs to be properly edited - it's not just a small blog, it's O'Reilly.

I must agree with Ian Morrison's comments. This has been very poorly transcribed. For example, the sentence "I had rewritten a tool called IS Comp by a guy named Steve Chodder" would actually refer to ISconf written by Steve Traugott! Some editing would help the flow of this article too. Come on O'Reilly you normally produce brilliant articles!