1. That new car smell

We all remember that time. You know when you drove out of the dealership with your brand new wheels. Driving down the road, you feel on top of the world. You start dreaming of all the fun times you’ll have in your new car. For days and weeks afterward you walk out to your car, open the door & sit inside. It all feels special. You kind of hang out there for a few minutes enjoying the smell before you drive off. Right?

Let’s be vigilant to remember the same thing happens, or rather is happeing in technology all the time. As we automate our infrastructures with Ansible, Puppet & Chef, deploy continuous integration with Jenkins, Travis or Codeship, we should give pause. Each of these tools has it’s own syntax, it’s own bugs, it’s own community, it’s own speed of development & change, it’s own life.

2. A lot of rushing

Google tells me the synonyms of agile are, nimble, lithe, supple & acrobatic. So in a fast moving world it’s no wonder agile is so big. Anything that allows us to respond to customers quicker & evolve our product faster is a good thing. Yes it is.

Over the years I’ve worked with a lot of clients & customers. Some right out of the gates are in a hurry. There is a sense of urgency even from the initial meeting. Although not in every case, sometimes these are the sign of the perpetually late. They end up throwing money around, throwing technology around, and all in a desperate attempt to plug a leaking ship.

3. Hosting, what’s that?

For many of the startups I work with today, they’ve never deployed on anything but Amazon. There was no rack of computers in a closet & a T1 line, circa 1997. There was no rackspace hosted servers or a colo in New Jersey circa 2005. Right from the beginning it was all on-demand computing.

This shift has surely brought a lot of benefit. But no one can argue it isn’t still very new. And with newness there is a learning curve. And bugs & surprises.

4. More complexity in troubleshooting

The wild ride really begins when you’re troubleshooting performance problems. Running your database on RDS you say? How the heck do I get to the terminal and run “top”? Can I do an iostat?

And what does iostat output really mean in multi-tenant Amazon, where your disk is an EBS volume across an unknown & unfriendly network. Who knows why it just slowed to a crawl, then sped up dramatically a few minutes later.

Even fetching the relevant logfile can be complicated. For all the problems the cloud eliminates, it sure introduces a few of it’s own. And who is the expert, and how to find them?

There are new message queues like NSQ & programming languages like Markdown, Coffeescript & Clojure. Even Java. Are people still building web apps in Java. No please no!

While it’s wonderful to see such an explosion of innovation, I look at this from an operations perspective. In five years, when the first & second wave of developers at your startup have left, picture yourself trying to find talent in a long since out-of-fashion language like Dart or Swift. What’s more how do you untangle the mess you’ve now built?

Application infrastructure is not something we learned in my college, and it’s definitely not something I will learn anytime soon in my current job (I work as a mobile developer for a mid-sized startup). I also think it’s not something you can just goof around with in your own computer. Do companies prepare their software engineers when hiring infrastructure engineers, or do they all expect you to know your skills and tools? Also: Is automation killing old-school operations
For example, My guess is that Facebook has a huge infrastructure team making the site usable and fast for as many people as possible. Where can you learn that skills, or get prepared for that time of job? Do you think it is possible to self-learn those skills?

Here’s my take on some of this. Since the invention of Linux, experimenting with infrastructure has been within reach. In the present day there are some even better reasons to experiment & teach yourself about this important aspect of devops & backend server management.

Early Linux circa 1992

Before Linux (in the 80’s we’re talking about) it was a lot harder. Into the 90’s Linux came on the scene and you could cobble together parts, video, motherboard, memory, ide or scsi bus & disks & build a 486 tower. You could then start building linux. I mean because of course everything had to be hand rolled (compiled by hand & debugged usually)!

Present day virtualization

What to learn

Start learning Vagrant. It automates the provisioning of virtual machines on your own desktop. You can boot those linux boxes to your hearts content, network between them, hack them, run services on them, build your skills.

I’d also recommend digging into docker. It is the lightening fast younger brother to Virtualization.

Truth is I have heard whispers of this before. I was at a meetup recently where the speaker claimed “With more automation you can eliminate ops. You can then spend more on devs”. To an audience of mostly developers & startup founders, I can imagine the appeal.

1. Does less ops mean more devs?

If you’re listening to a platform service sales person or a developer who needs more resources to get his or her job done, no one would be surprised to hear this. If we can automate away managing the stack, we’ll be able to clear the way for the real work that needs to be done!

This is a very seductive perspective. But it may be akin to taking on technical debt, ignoring the complexity of operations and the perspective that can inform a longer view.

2. What happens when developers leave?

I would argue that ops have a longer view of product lifecycle. I for one have been brought in to many projects after the first round of developers have left, and teams are trying to support that software five years after the first version was built.

That sort of long term view, of how to refresh performance, and revitalize code is a unique one. It isn’t the “building the future” mindset, the sexy products, and disruptive first mover “we’re changing the world” mentality.

It’s a more stodgy & conservative one. The mindset is of reliability, simplicity, and long term support.

3. What’s your mandate?

That word I believe is “risk”. Devs have a mandate from the business to build features & directly answer to customer requests today. Ops have a mandate to reliability, working against change and thinking in terms of making all that change manageable.

5. ORM’s and architecture

If you aren’t familiar, ORM’s are a rather dry sounding name for a component that is regularly overlooked. It’s a middleware sitting between application & database, and they drastically simplify developers lives. It helps them write better code and get on with the work of delivering to the business. It’s no wonder they are popular.

There is broad agreement among professional DBA’s. Each query should be written, each one tuned, and each one deployed. Just like any other bit of code. Handing that process to a library is doomed to failure. Yet ORM’s are still evolving, and the dream still lives on.

And all that because devs & ops have a completely different perspective. We need both of them to run modern internet applications. Lets not forget folks. 🙂

1. performance & scalability

As your grows like Birchbox, your customer growth curve may begin to look like a hockey stick. That’s a good problem to have. Will your web application be able to keep up with the onslaught of traffic those customers bring?

Getting performance and scalability just right, will mean fewer site crashes during those key moments when all eyes are on your site.

2. Operations is key to architecture

Developers will always have strong opinions on architecture. However they may be heavily influenced by their own mandate, features, deliverability & deadlines. So it’s no surprise that they may sometimes choose to build on ORM’s, the middleware brought to you by Hibernate, Cake PHP, Active Record & the like.

And while these technologies seem a necessity in todays modern architectures, they play havoc with your long term scalability. Strong technical operations teams mean a better vision in this area. Heading off your reliance on these technologies will mean managing technical debt before it takes down your country.

3. Operations informs strategy

Did you build in those operational switches to turn off the heaviest code, when your site gets overloaded? Operations strategy can help you see these problems on the horizon before they overwhelm you.

Have you considered building a browse only mode for your site? If you’ve ever visited Facebook or Yelp after hours you may have been greeted with the message “We can’t save your comments. Please try again later”. A small innocuous message to end users doesn’t disrupt their enjoyment of the site terribly. But from an technical operations perspective it’s huge. It means teams can perform backups, upgrades and maintenance without interrupting day-to-day activity on the site.

5. Operations means technical strength

At the end of the day, getting technical operations right, means you can move from strength to strength. It means building on a solid foundation the likes of Google, Facebook, Foursquare & Etsy. It means you can evolve & grow with your customers, and meet their needs confidently.

In this second part, let’s dig into deploying applications in the cloud, and day to day operations skills. There’s a lot of material here. We recommend picking a few questions out of the bunch and focusing on those questions, rather than trying to cover all of them.

1. Deploying in the Cloud

Deploying applications into virtual or cloud datacenters involves understanding and evaluating providers. Many just deploy on Amazon EC2 as it is far and away the largest cloud hosting solution, with the most robust offering.

There are probably two things that set Amazon apart from other cloud infrastructure solutions. EBS or elastic block storage being one. Although the others have storage solutions, and Rackspace is working on their own virtualized storage, Amazon seems to be the furthest ahead with their offering. It is fully virtual, allows arbitrary chunks of storage to be attached to instances, and allows instances to boot of ebs volumes.

The other major point is that since Amazon has grown so large, so quickly, it has more datacenters, in more geographically dispersed areas than other providers. Since these are organized into logical resources, and can be accessed through API, it makes your application infrastructure truly virtual.

Everyone who has managed operations, has worked with vendors at one point or another. For example if you’ve worked with Rackspace you know that it’s pretty easy to get a human on the line. Amazon on the other hand allows you to do-it-yourself for everything, and only later added on a support service option. So their service pattern and history are different.

There isn’t really a right or wrong answer to this question, but it’s a nice starting point to discussion. It can also help illustrate a candidates communication skills, and how specifically they walk through solving a problem. What problem they choose as an illustration, and how they work through to a resolution is an important indicator of operations experience.

[quote]
Pros and cons of Amazon versus Rackspace, configuration management & automation and cloud management solutions like Scalr and Rightscale… these and other skills are a important for a cloud deployment expert.
[/quote]

o What is puppet and chef?

Puppet is a configuration management system which allows ops teams to build templates for servers, and deploy many servers based on those templates. It further allows centralized control of configuration, to automate the management of a large number of servers.

Chef grew out of frustrations of Puppet, and is a sort of next generation configuration management system.

The term infrastructure as code may be thrown around. Since all cloud resources can be provisioned through API calls, everything in server deployment can be *theoretically* done via code, from spinup of servers, to installing packages, to configuring, code checkout, seeding databases and more.

o What are some of the pros and cons of configuration management for operations?

Pros include allowing a smaller team to automate the deployment of a large fleet of servers, standardization, and consistency. Cons include complexity when needing to do surgical, urgent changes, and complexity when coming into an existing environment that you’ve inherited.

o How is rightscale different? What does it provide?

Rightscale is a layer on top of your cloud provider. They provide a common interface and dashboard from which to deploy servers. Templating, automation, and multi-cloud support make it a great solution for teams that have less technical expertise on staff or less hands to manage things.

2. Day to day skills

o What type of programming experience do you have?

The answer is that every ops guy or girl should be able to code, just as every developer should have some basic operational experience. Should and does are often two different things, so ask for some examples.

o shell scripts

Bash, csh, Perl and Python are all part of the Linux administrators toolbox. Writing backup scripts, log rotation, automating routine tasks and so forth are all common needs of an operations expert.

Regular expressions are a part of Unix and used in scripting to search files, cronjobs, and ETL jobs. Ask for some basic examples.

o What is continuous integration?

The old model of code deployment was called waterfall, and allowed long careful planning, coding of new features, testing, and finally deployment. The cycle could take weeks or months and iterative change took a lot of time. Continuous integration also known as agile deployments, allows a much more frequent in some cases many times per day deployment of changes.

o What are metrics good for?

Just like in website visitor tracking, and business analytics, server level analytics and tracking is possible. Collecting server metrics such as load averages, memory, disk and cpu usage over time can be invaluable. When an application slows or server stalls, checking historical metrics can often quickly reveal problems or causes.

What are some examples? nagios, ganglia, cacti, munin, opennms

o What is unit testing?

This allows for software to be build in small testable compontents. When the compontents are coded, tests are also written that test whether they are operating properly, and whether dependencies are also installed and working.

[quote]
Metrics, monitoring, load testing, firewalls, security & patching, Saas, Paas and IaaS there is a wide swath of skills needed to be competent as a web operations engineer. You’ve got your work cut out for you!
[/quote]o What is load testing?

By performing some benchmarks, load testing can make estimates about how the application and code will perform when more users are hitting it.

o Security & networking

Sometimes a systems administrator is a generalized admin and sometimes there is a networking specialist on staff who doesn’t allow anyone else to touch that domain.

o What are firewall rules?

Unix services use port numbers to expose those services to the world. Since all servers on the internet are identified by IP addresses, firewall rules are defined around IP addresses or groups of them, and the ports they’re allowed to access.

o What is DNS?

DNS stands for domain name services. This is the sort of yellow pages of the internet. DNS allows a server name to be converted to it’s underlying IP address. It’s a very important service for any network, and generally includes many backup servers for when the primaries experience problems.

o What is a virtual private network?

A VPC provides a network link between a physical datacenter or your offices network, and your cloud provider. It allows you to elastically grow your existing datacenter using virtual resources, while treating those new boxes more like servers in your existing datacenter. IP addresses and subnets are controlled by your existing network rules and admins.

o Why is security important in web operations?

Since your business assets are primarily stored in digital form, the security of those assets depends on the security of your computer systems. Passwords, firewalls and encryption are all relevant.

o Why is patching software important?

Since security is a moving target, and vulnerabilities are constantly being discovered in software, patching and updates are important. Staying fairly current in applying patches means you network and systems will be more secure.

o What is intrusion detection?

Bugs in software open up vulnerabilities and ways into systems. Intrusion detection attempts to detect that such intrusions and avoid further damage.

o What is Saas – Software as a Service?

An example is dropbox, and other so-called hold-my-data type solutions fall into this category.

o What is Iaas – Infrastructure as a Service?

This is raw iron, the virtualized datacenters, hosting providers such as Amazon, GoGrid, Joyent, and Rackspace.

o What is Paas – platform as a service?

Solutions such as heroku, squarespace, wpengine and engineyard fall into this category. Some provide a platform such as the WordPress CMS, with arbitrary scaling options. Others like Heroku and EngineYard allow Ruby applications to be deployed without the need for a lot of fuss at the operational level.

First things first. This is not meant to be a beef against developers. But let’s not ignore the elephant in the living room that is the divide between brilliant code writers and the risk averse operations team.

It is almost by default that developers are disruptive with their creative coding while the guys in operations, those who deploy the code, constantly cross their fingers in the hope that application changes won’t tilt the machine. And when you’re woken up at 4am to deal with an outage or your sluggish site is costing millions in losses, the blame game and finger-pointing starts.

If you manage a startup you may be faced with this problem all the time. You know your business, you know what you’re trying to build but how do you find people who can help you build and execute your ideas with minimal risk?

Ideally, you want people who can bridge the mentality divide between the programmers eager to see feature changes, the business units pushing for them, and the operations team resistant to changes for the sake of stability.

DevOps – Why can’t we all live together?

The DevOps movement is an attempt to bring all these folks together. For instance by providing insight to developers about the implications of their work on performance and availability, they can better balance the onslaught of feature demands from users with the business’ need for up-time.

Operations teams can work to expose operational data to the development teams. Metrics collection and analytics aren’t just for the business units anymore. Employing tools like Cacti, OpenNMS or Ganglia allow you to communicate with developers and other business units alike about up-time, and the impact of deployments on site availability, and ultimately the bottom-line.

Above all, business goals and customer needs should underscore everything the engineering team is doing. Bringing all three to the table makes for a more cohesive approach that will carry everyone forward.

How to spot a DevOps person – Finding the sweet spot

The DevOps person is someone with the right combination of skill, knowledge and experience that places him or her in the sweet spot where quality assurance, programming skills and operations overlap.
There are also a few distinguishing characteristics that will help identify such an ideal candidate.

Look for good writers and communicators

Imagine the beads of sweat forming when a developer tells you: “We’ve made the changes. Nothing is broken yet.”
This is like stepping on glass because it implies something will actually break. The point is savvy developers should be aware that the majority of people do not think along the same lines as they do.
Assuming your candidate has all the required technical skills, a programmer with writing skills tends to be better at articulating ideas and methods coherently. He or she would also be less resistant to documentation and be able to step back somewhat from the itty-bitty details. Communication, afterall is at the core of the DevOps culture where different sides attempt to understand each other.

Pick good listeners

Even rarer than good writers are good listeners. Being able to hear what someone else is saying, and reiterate it in their own terms is a key important quality. In our example, the good listener would probably have translated ‘nothing is broken yet’ into “the app is running smoothly. We didn’t encounter any interruptions but we’ll keep watch on things.”

Lean towards pragmatists and avoid the fanatics

We all want people who are passionate about something but when that passion morphs into fanaticism it can be unpleasant. Fanaticism suggests a lower propensity to compromise. Such characters are very difficult to negotiate with. Analogously in tech, we see people latch on to a certain standard with unquestioning loyalty that’s bafflingly irrational. Someone who has had their hand in many different technologies is more likely to be technology agnostic, or rather, pragmatic. They’ll also have a broader perspective, and are able to anticipate how those technologies will play together. Furthermore a good sense of where things will run smoothly and where there will be friction is vital.

Pay attention to extra-curricular activities

Look at technology interests, areas of study, or even outside interests. Does the person have varying interests and can converse about different topics? Do they tell stories, and make analogies from other disciplines to make a point? Do they communicate in jargon-free language you can understand?

Sniff out those hungry for success

As with any role, finding someone who is passionate and driven is important. Are they on-time for appointments? Did they email you the information you requested? Are they prepared and communicative? Are they eager to get started?

Hiring usually focuses on skills and very well-crafted resumés but why do you still find some duds now and then? By emphasizing personality, work ethics, and the ability to work with others, you can sift through the deluge of candidates and separate the wheat from the chaff for qualities that will surely serve your business better in the long run.

One of the great things about the Internet is how it has made it easier to put great ideas into practice. Whether the ideas are about improving people’s lives or a new way to sell and old-fashioned product, there’s nothing like a good little startup tale of creative disruption to deliver us from something old and tired.

We work with a lot of startup firms and we love being part of the atmosphere of optimism and ingenuity, peppered with a bit of youthful zeal – something very indie-rock-and-roll about it. But whether they are just starting out or already picking up pace every startup faces the same challenges to scale a business. Recently, we were reminded of this when we watched Inc’s video interview with Birchbox founders, Hayley Barna and Katia Beauchamp. Continue reading “Scale Quickly Like Birchbox – Startup Scalability 101”

Deploying new code that includes changes to your database schema doesn’t have to be a process fraught with stress and burned fingers. Follow these five tips and enjoy a good nights sleep.

1. Deploy with Roll Forward & Rollback Scripts

When developers check-in code that requires schema changes, that release should also require two scripts to perform database changes. One script will apply those changes, alter tables to add columns, change data types, seed data, clean data, create new tables, views, stored procedures, functions, triggers and so forth. A release should also include a rollback script, which would return tables to their previous state. Continue reading “5 Tips for Better Database Change Management”

Software development has always made use of libraries, off-the-shelf components that are shared between different projects. These allow you to stand on the shoulders of others and build bigger things. Frameworks do the same thing, they provide a context from which to build on. Ruby on Rails for example provides a great starting framework from which to build web applications, managing sessions in an elegant way. Continue reading “5 Scalability Pitfalls to Avoid”