Adam Wiggins on Building Heroku on Top of Amazon EC2

Recorded at:

Bio Adam Wiggins is an entrepreneur, Rubyist, open source advocate, and - let's be honest here - a serious bad-ass. His current venture is Heroku, the instant deployment platform for Ruby and Rails. He blogs about technology and business at: http://adam.blog.heroku.com/

QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community. QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.

We have no physical servers and this is the first company, I have started a bunch of companies along the years, this is the first one where I have ever done that I have spent a lot of my career in colocation facilities with screwdrivers and rackmount servers. While it was fun at the time I am glad that we have been able to do something different this time. Just as we started the company it was basically a happy coincidence that it was pretty much the day we started the company it was the day that Amazon EC2 went into public beta. And we said let's give it a try and of course starting a new company it's easy to do that, no one is depending on you yet, and we started using it, and we always assumed that we need to either get our own servers at some point or either have a mix or just switch completely and it worked so well that we never looked back.

While there is still weaknesses what you can get out of cloud infrastructures compared to traditional systems, the benefits are so massive, that I feel that more than offsets it. So we have no physical servers but we're one of Amazon's largest customers we have a huge quantity of instances provisioned through them, and we have build a lot of management tools to manage those so we think about managing our systems now, we are doing our ops, it's all software based we can automate things, it's all about building images, how you spin those up quickly, capacity planning is different because you can scale up and down so quickly as needed it's a whole different world, I am really enjoying it and I think it definitely points the way to where everyone is going to be in the future.

Sure, I think of cloud in just the simplest form as being virtualization as a service so all the cons of virtualization are the ones that you have in a cloud as well, which is that you are cutting your server's resources into smaller pieces, which is actually ok for most purposes the one place that tends to be pretty weak is IO and so like certain kinds of IO intensive operations with SQL databases being one of the main area can be a challenge to optimize and there is tricks for that. And one of my co-founders Ryan Henry luckily has come up with some pretty crazy magic for making databases fast in the cloud.

But aside from a few cons like that, the benefits are just massive, not having to think ahead, plan out six or eight or ten weeks ahead for what servers you need so that you can order them and get them installed and get them in the co-lo and once upon a time I think that six weeks ahead was no big deal, what is going to happen in six weeks? But we are in an era of web scale now, the speed with which you can go from zero to a hundred thousand is astonishing.

I especially see that with our customers, and usually this is something throughout like a social media application like a Facebook application or things like this, this has happened over and over again where someone builds an application and in like two days they put it on a platform and they go from zero traffic and zero users to millions and millions of page views a day. One app in particular went to almost a million users in like a week, it was just incredible. And they built this in two days, and so it used to be that spending six weeks to plan out your capacity and building that app wasn't a big deal, you were spending years building your software, and it took months and years to grow upward, but now that you can build software so quickly, you can grow so quickly, now suddenly that six weeks is just a horrible anchor dragging you down.

So the capacity planning aspect of just being able to be really dynamic with expending your capacity as you grow or scaling it down as you shrink. In circumstances when you have a seasonal business where you've got a sudden spike up and then you need to scale down afterwards, like everyone needs to deal with that. So having that capacity is great. Another huge benefit I want to mention is what I am going to call the transient nature of cloud resources and this is actually something that is, I am really glad we started on EC2 towards the beginning, because they used to be less reliable then. And that may sound like a con, and we did see it as a con at first it's like they had no persistent disk store, the EBS thing was only introduced later. And servers went down a lot.

And of course that is part of the contract with them, or not with the contract, but the way that it works, so you can lose a server any time, the servers are transient, and the data on it is transient. So we had to architect a cloud right from the start to deal with that. And so we built something that was self healing, that had no single points of failure, and it taught us to build something that was, you set up a server and you have all the files in the right place and you kind of leave it alone for three years, don't touch it, no one jiggle it! Instead it's been jiggled every single day so you need to build something that's really resilient to that, that all the state for the server is built out from templates and so forth the recipes that describe how the server to be configured.

So because we have built a cloud that was based on this basically very transient infrastructure, we have something that is really solid, in terms of dealing with there is always going to be a time when you are going to loose a server. It's going to start misbehaving and we are going to build something where that is just no big deal, we go into our little operations panel we click the little x next to the server it just deletes itself out of the list, everything all the routes, magically adjust themselves; so the transient nature of it seams like a con at first but I would list it as a benefit.

I would have to say that cloud is really a panacea that just solves everything, I am just kidding. Yes, it's easy to do that, people get excited in technology, we see the next new thing that is coming around the horizon, we recall the last time the next thing was coming around the horizon, and we remember how that really revolutionized the way we did things, it reduced costs, it reduced our stress, it let us grow more, it let us do cool fun things, and it's great I am glad that everyone gets excited about the next thing when they see it coming around the corner and cloud is a great example, everyone is talking about it but it's just begun its adoption for serious use. I do really believe it holds huge promise.

Obviously I started one of the very first companies to be able to have the title cloud, because I saw that and because I believed in it. And that's even more true today as more and more companies are getting into it and it's becoming a more mature market and people are starting to put serious applications on it. That said, there is a lot to be done to fully take advantage of that, I think it's always the case when you have a new technology people see the promise of it and then they try to adapt their old applications and their old processes to using that and maybe the fit isn't great but they push on through anyway because they see, and there is a pretty good arch in the time it takes for a technology to become fully articulated and those benefits to really come out.

So I think in the next year, two three years, we are relay going to see it start come into its own but it's not just take exactly what you currently have transplant into a cloud and now everything is great, it's a different way to think about things. And your applications need to be developed a little bit more differently, and the underlying software infrastructure, things like your database and memcache, or other kinds of caching layers or your message bus and all that, certain things that worked really well in your traditional environment don't work so great here, but then there is new stuff, new projects coming along that are designed to take advantage of the horizontal scalability and the transience and the shardability that you get in the cloud world.

So that stuff I have been surprised how quickly that is happening maybe because everyone is so excited about it, and honestly I wouldn't necessarily say if you have got a huge sprawling enterprise application you got to start pouring it over today I think looking for small slices to bring over where it makes sense, is a good idea and if you are starting a new application you should be thinking about cloud. Because if you don't, if you are starting today, by the time you are ramped up you are going to be behind the curve if you stick with traditional stuff.

The big one by far is EC2. EC2 is a simple interface for creating, running and destroying server instances. In that respect that is the basic thing you would expect out of any cloud computing provider including an internal and a lot of companies are building cloud departments to build their own internal cloud products and that should be the core of what it is that they are offering. So that is really our focus. The other one is EBS is handy, if you happen to have data you happen to write onto disk and have it stay around, although as I said we built our cloud originally to not need that at all, to not rely on any data that has been stored to disk permanently.

The one that I really like, I think that it's kind of subtle in its benefits compared to something like EC2, but we are using more and more and I definitely recommend everyone to use, which is Asset Stores and so Amazon's Asset Store is S3. And having this kind of data store where it's basically a key value store for large binary data, where they take care of it, and basically large binary data is a hard problem. But they are solving it once for everyone and there are similar products as well, like Rackspace has their cloud files, that is kind of up and coming, so it's not specifically S3 that I am advocating although it's obviously a great service.

But in general Asset Stores are a great way to solve something that is always a point of pain, people get that binary data whether be images or pdfs or something like that and then they throw it on the file system, the file system's don't scale, throw them in my databases as a binary blob, but SQL databases are just not designed to handle binary data, ok what do I do? Asset Stores to me is really the ideal solution to that problem. I encourage everyone to use that even for a small application.

No. One of the things that we tried to emphasized is that Amazon is a vendor that we are using, but we don't expose that in any way to our users, we are offering a platform that is standard Ruby stack, and a standard Linux place to run your application and a standard SQL database. We use Amazon because they have the best most mature cloud offering as of now, that may change in the future, and we might choose to change cloud providers or mix in traditional hardware or use some kind of heterogeneous mix.

And that should be totally invisible to our users, they are a vendor to us, so in the same way that it certainly affects your experience in buying a car what kind of metal, what kind of steel the car manufacturer used to build the car, but you don't really specifically care where they bought it. You just care that you get a good end product. So we treat Amazon in the same way, while I extol their virtues as much as possible in terms of what our product is we are abstracting that away.

That's a great question; I think it's an interesting thing that I think we are one of the few companies that's doing something where we are both a consumer of a cloud product and also offering a cloud product ourselves. We are basically consuming infrastructure as a service and turn it around and providing a platform as a service. That's interesting just because you've got these two very new areas that are combining together basically two new layers. In terms of challenges I think more in terms of enabling factors which is really like one of the reasons we are able to provide a really new kind of platform is because some of the core infrastructure that we are relying on works in any way, so I think of it more in terms of that it's enabling like it wouldn't be able to provide this product and have it look exactly the way it looks now if we didn't have access to this stuff.

I suppose the challenges would be that yes, we are definitely still at the cutting edge, and we are trying to make something; obviously people don't want a platform that is anything other than very stable and reliable and rock steady and we worked very hard on that and that we have achieved a lot of success but when you are at the cutting edge of technology and there are some challenges there, some things are changing a lot and we need to make sure we keep a handle on that in a way that our users applications keep running without a hitch.

That is a really interesting topic. Certainly looking at the way that our operations are different in this venture than they have been in past ventures that I have worked on, is itself pretty interesting where we are using cloud services rather than our own colocated hardware, but then you go even higher level of abstraction of course part of the benefit you get with Heroku is you are getting sort of an outsourced sys admin, that is this automated system that does a lot of the rote system administration tasks in an automated fashion for you.

But I think the role of ops people is just as important as ever it's just going to change it's going to shift a little bit where their attention is so the management process becomes less about let me call it my co-lo facility, let me send my guys down there, to shuffle some hardware around instead it becomes "Now I am managing this from software, I have got control panels, I've got ways that I can get overviews of all that is going on system activity and monitoring things" and when it comes to times and say "Oh this server is misbehaving" it's more a matter of "Ok, call this API in order to do something different with the server rather than send someone somewhere to take care of it, so moves it up to a higher level.

And I think that is going to be really empowering for ops and sysadmin people because you can deal with this thing on a larger scale, you can manage more at the same time, and you can think about it at a higher level of abstraction so it becomes basically a higher level managerial role rather than more of an in the trenches like "Let me get out my screwdriver" kind of role.

Obviously monitoring and management are really the key part of keeping and maintaining a running system. So internally we have built huge amount of those as well as used existing stuff sometimes it's a challenge to use when you think of whatever is considered standard ops management tools or metrics tools like Nagios something like that, a lot of times they have these built–in assumptions, like that the IP address of your server it's going to be the same for a reasonable amount of time.

And we find a lot of times that it can be a challenge to adapt those things sometimes, and so we have ended up using those where we can but also building our own stuff in a lot of cases, and we find that having the higher level of abstraction that you get from using cloud service has actually made it a lot easier to build that stuff then you might think. You might think building a bunch of monitoring tools form scratch is lot of work and it is, but I think it's less work than you might think because of the power of cloud which is definitely nice.

And so we have got a huge amount of those tools, graphing and looking at our server instances and getting access to them, we also run multiple clouds so we have our production cloud, we have a staging cloud, we actually have a number of users there including really large traffic ones, we can deploy changes there and look for any stuff. And it's really we can manage and monitor that in the exact same way because these tools that are designed to deal with multiple clouds servers. And then developers can spin their own individual clouds and run those and you can do the same kind of load testing, we are using the exact same tools, we use in production we are using those for the same clouds, which in itself is really powerful.

We certainly encourage people to use external stuff and Pingdom and many others that offer lightweight external services for keeping track of what's going on with your site and we also have some internal tools that allow you to get access to logs get insight into some other things like traffic, and then we have a bunch of add-ons. NewRelic is one of my personal favorites, which is a great performance monitoring tool it installs as an add-on, as well as exception monitoring tools, and other things like that.

But I think probably one of the biggest challenges that converges both of those two things is when you are running a platform, if an application has an error, whose fault is it? It's not an easy question. It's actually the case that sometimes it's like well the user didn't run their migration and they have a bug in their code or something like that. You need to make sure that is displayed to them in a way that says "This is your error you need to take action on it".

But at the same time occasionally things go wrong in our system, and in that case we need to actually make sure that they know "Don't worry this wasn't your fault, we've been notified, here's what's going on, just sit tight and we'll take care of it". And teasing those two apart is actually very tricky I think we've done a pretty good job so far, we need to work on that but it's been a really interesting part of building the platform. NewRelic is an addon, so you just type heroku-addons :add newrelic you can do all the different plan levels I think you get bronze for free we've got a deal with them. And that just pops it right in or you can do it yourself if you want obviously it's just you install the plug-in and go get an account with them. But certainly the add-on makes it much much easier.

The web based IDE which was part of our original vision for the platform which was very all encompassing; and in the process of building that we discovered pretty quickly as a matter of a few months that people were a lot more excited about the deployment aspect of it and the way that we actually discovered this is that there was an import tab in the web editor which was "Upload a tarball of your code" We intended to be a one time operation like you import an app and then it's there and then you just edit it in the web editor and what we have discovered is lots of people would do a REST POST or whatever they wrote like a command line command to tar up their whole app and then push it into this form because what they loved was that they could just push their app up and it's just running but they still wanted to use their local tools, they want to use Eclipse or Textmate or whatever their local revisions control and all that stuff, they just liked the idea of this platform where they could put their code there and it just runs.

And I think it's always the case when you start a start-up you need to discover what the market really wants, where the real thirst is. And it turned out that while a lot of people loved our web based editor it is still running and people do still use it, there was a lot more interest in the deployment aspect of it so we got really focused on the production side, the scaling side and all that kind of stuff, switched to a model where it's more API driven you push code up with git which is more elegant version of what these people were doing with the manual tar-ing And we still have the web editor running at herokugarden.com, it's mentioned in the first page of the O'Reilly "Learning Rails" book, it is a great way to just get started and I have to deal with setting up the local environment but we really encourage people that if they do have the ability to learn the version control and use their own local tools that Heroku.com as production platform is our product.

Yes, absolutely, that process is crucial. The bigger your site is the more that's the case. That's actually one of my favorite aspects of using Heroku as a user developer which is that you can basically make exact duplicates of your application. And so you can have one that says integration and prototype or show system manager, and then you can have another app which is set up as a git remote, which might be staging and you might have another which is production and of course that's maybe where your main domain name points.

And what you can do is have different permissions on each one of those, so you can have, some people, all the developers have access to the integration application they can be pushing stuff out, stakeholders can look at that, but then someone else, someone that owns the production deployment for example, maybe they are responsible for looking at the difference between that and staging saying we've got new features, let me review these, and now we're getting more serious QA there, and then of course when that is ready you can pull that from staging and push it to production and it's all a very smooth workflow, and you are guaranteed that each of those applications are going to be exactly identical, setups.

And it's often the case that no matter how big and well organized an organizations staging servers are just so often just a little different and very often the further back in the chain you go particularly to the developer setups like the don't have real good access to what the code is going to be like when it runs in production. The way that Heroku works it's very easy to set up a new app and push out to that and any developer can have their own running version of it and see exactly what this is going to be like in production. So I think that's one of the main benefits of cloud and certainly of a platform like Heroku is the ability to really have this many environments as you need that are all identical to each other.

As a practitioner I have been suggesting Deployment Templates instead of using a service like Heroku. Coming from an enterprise background I believe (maybe wrongly) that they are better off directly running off AWS EC2 instead of something like Heroku which is

1) Only Rails specific2) Doesn't cater to the spectrum of stacks enterprises use3) Seems better for Startups and SMEs perhaps with just RoR stack

Also I have come across requests to use something like Heroku over a VPC. Incidently Ruby Deployment Templates are the subject of my next book "Amazon Cloud Computing with Ruby".

You're correct: using a platform-as-a-service provider only makes sense when the stack of runtime technologies the provider offers matches what you're using for your app.

For Google App Engine, this means Python or Java, and Bigtable. For Microsoft Azure, this means C# and SQL Server. For Heroku, this means Ruby and PostgreSQL.

As existing PaaS providers expand the choices available in their technology stacks, and as new providers appear on the market, the result will be a proliferation of choices for language, database, and other components in the technology stack.

Infoq is best known as a debate platform between practitioners. And hence in the spirit of that I would like to present my closing arguments.

When we look at something like AWS and then look at GAE, Azure... There is a big difference in the way companies would go about adopting them. If a billion $ company adopts all of Azure, Heroku and GAE it would in effect display focus on technology. While we believe technology is an enabler (means) to achieve business ends. The above step brings back focus on IT & Technology. It also then goes ahead and back affects the company and divides it into an organization with teams focusing on technology i.e. the corollary of Conway's Law kicks in ... structure of the organization reflects in the structure of the code/system and vice versa. Which is acknowledged to result in substandard results.

AWS Adoption on the other hand is more of an operational process of integrating an external cloud with an internal cloud/systems and the focus seems to be away from technology but rather business and operations. And it also doesn't divide a company in terms of technologies. Which I find very attractive from the perspective of a Corporate&Technology Strategy Consultant.

Though I acknowledge the success of the Heroku platform is second to none. I wanted to present my points so that readers can understand the implications of the choices.

If you want to take advantage of these benefits you have to accept that you will need to live with a few restrictions. I think this where corporate environments start to find things difficult. Corporations tend to take defensive positions and like to maintain complete control... even if it means spending millions of dollars to do so.