Friday, November 21, 2008

You know what else Twitter is good for? Quotations. As good friend and colleague, Piyush Ranjansaid:

Recession is [a] good time for innovation. Best innovation happens in crunch situations. Otherwise people just throw money at the problem!

At Cleartrip, as with any other company, we're always looking for ways to reduce our costs. Now with the travel industry in India hit hard, and the general economic slowdown, it's very interesting for us techies to innovate in ways that will help us cut costs.

One the recent things we've been working on, and the stuff I want to talk about here, is the area of bandwidth savings. The benefits of bandwidth savings are usually twofold. Firstly and obviously, it reduces our bills. Secondly, it usually manifests itself as a better experience for the user, since he doesn't have to wait as long for stuff to transfer from our servers to his browser every time he moves between pages.

Now, our site is a fairly dynamic site. Almost every page in the site (well, the significant ones at least), are so dynamic, they cannot even be cached for a couple of minutes. So, caching HTML doesn't suit us. What we can cache is everything else - images, CSS and JavaScripts. Also, since we try to adhere to web standards as much as possible while being as Web 2.0 as possible (whatever that means), our use of CSS and JavaScripts is pretty heavy.

Phase 1

So, a couple of months ago, we made our first set of optimizations. We noticed that our images change very rarely - some of them hadn't changed since we first made them. There was no point sending down copies of the images to our users frequently. These could be cached pretty aggressively. We decided on an arbitrary time of 1 month for the caching of images.

Next came the JavaScript and the CSS. This posed a considerable problem. We keep doing JS and CSS changes very frequently on our site - as frequently as a couple of times a day. We couldn't afford to have them aggressively cached at the client end. We needed the agility to update our users' cache with the new versions of our code. Again, arbitrarily, we decided that the cache time for JS and CSS files would be 2 hours. That way, it wouldn't be in users' cache very aggressively, while remaining in his cache for about the length of his usage session. At the same time, it would take at most 2 hours for our changes to propagate to all users.

So, with this configuration change rolled out (it's not at all hard to do if you understand caches well) we sat back and monitored our savings. Turns out, the savings were pretty amazing. Hrush, our founder, even blogged about how our data center providers were surprised to the extent that they said that it was "just not possible" that we have such low bandwidth consumption for the traffic we get!

The problems

How ironic our data center guys said that, because just around that time, we were looking at how we can optimize bandwidth usage even more. And why did we decide to optimize further?

Because you can

Some of our files never changed at all - core JS libraries, for example. Why bother downloading them again after 2 hours?

Because this setup was hard to work with. Remember I said we patch stuff to production very frequently? That it takes two hours to propagate meant that if it was a bug fix, it might be broken for two hours even after we've put up the patch.

Because it gets very hard to make changes that have dependencies with other resources. For example, if the JS needs certain CSS changes, and certain markup changes, in what order do you put the changes up on the server?

This usually meant that we had to plan in advance and we would still have some backwards-compatibility cruft in our JS to manage this 2 hour transition. Sometimes it would take an entire day to make even a minor patch since it had to be batched with breaks of two hours. One quick-fix would be to rename the file and all references to it, but it's easy to see how painful that is to do on a file-by-file basis, not to mention that we'd run out of file names pretty soon!

Phase 2

One thing for sure, though. If we could somehow manage the file name issue, we can solve all the problems I listed out. That's because every time we made a patch with a different file name, it would instantly be available to all users, irrespective of their cache status. This gives rise to another interesting possibility: we didn't have to restrict ourselves to the 2 hour window anymore. We could potentially increase our cache expires time to 20 years for all that matters. That should give us MUCH more saving than we get currently, and our users will download as little as necessary.

Except, we didn't know yet how to solve the file name problem gracefully. Ideally, there shouldn't be a human being deciding which files have changed, what the file names are, and going all over the place and changing them. That would make it tedious and error prone, not to mention boring. Then of course you have to think about how these file name modifications would work with source control. You don't want to mess up your clean source code management history.

The solution to the source control mess thing was rather easy. We fake the file name. Here's what we do. We take a file name and append a random number to it. This will make the client believe that the file name has changed, and will negotiate with the server for the file. Meanwhile, at the server, we could have a rewrite rule that transforms the file name to something that maps to a real file on our server. Sounds simple enough. Tried it, and it worked like a charm.

Now, to crack the real problem - how do we generate these numbers in a sensible way. The number essentially had to be such that it would never repeat (at least for any given file), it would be global in that two pages shouldn't be using different numbers for the same resource, and when the number changes it should instantly reflect site wide. Now that we knew how the number should behave, the quest was on to come up with a mechanism to generate these numbers.

The solution

After a lot of thinking, we had a shameful duh moment, and it suddenly all made sense. We didn't need to invent these numbers at all. We just needed to use our source-control revision numbers! The revision numbers match all the characteristics of the number we want. Why bother with complex systems to generate and track numbers when it was already available, even if very disguised.

I'll save you the implementation details about how we made this available site wide, and how we made it possible to have instant global changes to this number. That definitely wasn't the tough part, and I'm sure you'll figure out the details. Who knows, maybe Piyush might just release a plugin for Rails to do it automatically for you. However what surprises me is that it's very hard to find such gems of knowledge on the net. I'm now beginning to think that maybe this design pattern should be used for distribution of all static resources on the web. We're definitely not the first to invent this pattern - why is no one else talking about it?

We've only just rolled this out on cleartrip.com and are still to get a decent sample of data to see how it has impacted our bandwidth consumption. But any fool could guess that our bills should reduce significantly with this change.

One of the things that is interesting is that project managers, traditionally, are brought on because you have a team of yahoos - and this is just as true in construction, or in building an oil rig, or in any kind of project as it is in the making of anything - making a new car at general motors, or designing the new Boeing 787 dream liner - as it is in the software industry. Project managers are brought in because management says: "Hey, you yahoos! You're just working and working and working and never get the thing done and nobody knows how long it's going to take." If you don't know how long something's going to take and you can't control that a little bit then this really sucks from a business perspective. I mean; if you think of a typical business project - you invest some money and then you make some money back. The money you make back - the return on investment - might be double the amount of money you invest and then it's a good investment. But if the investment doubles because it took you twice as long to do this thing as you thought it would then you've lost all your profit on the thing. So this is bad for businesses to make decisions in the face of poor information about how long the project is going to take and so keeping a project on track and on schedule is really important.

It's so important that they started hiring people to do this and they said: "OK, you're the project manager - make sure that we're on track." These project managers were just bright college kids with spreadsheets and Microsoft project and clipboards. They pretty much had to go around with no authority what so ever and walk around the project and talk to the people and find out where things were up to and they spent all their time creating and maintaining these gigantic gantt charts - which everybody else ignored. So the gantt charts, and the Microsoft project, and all those project schedules, and all that kind of stuff, was an artifact created by a kind of low level person. Although it might be accurate depending on how good that low level person was, but it was still an output only thing from the current project: Where are we up to? What have we done? How much time have we spent? What's left? Who is working on what?

Then, for some reason, these relatively low level people, who were not actually domain experts, (if they were at Boeing they don't know anything about designing planes, if they were on the software team they're not programmers - they're project managers, and they don't know anything about writing code) they start getting blame when things went wrong and they started clamoring for more responsibility, more authority to actually make changes and to actually influence things and say: "Hey, Joe's taking too long here - we should get Mary to do this task, she's not busy." The truth is that they started getting frustrated because they were low level secretarial-like members of their teams and they wanted to move their profession up the scale so they created the project management institute - or whatever it's called - and they created this thing called... ah, I don't even know! But they created a whole professional way to learn to be a professional project manager and they decided to try to make it something a little bit fancier than just the kid with the clipboard that has to maintain these gantt charts all day long. You can tell this is what happened because the first thing project managers will tell you about their profession is that the most important thing is that they have the authority to actually change things and that they are the ones that actually have all the skills that can get a project back on track, or to keep a project on track, and therefore they need to have the authority to exercise these skills otherwise they'll never get anything done, they'll never be able to keep the project on track, they don't just want to be stenographers writing things down.

The trouble is, they don't actually have the domain skills - that's why they are project managers. If you are working on a software project you know how to bring it in on time and you've got to cut features, and you know which features to cut, becuase you understand software intrinsically and you know what things are slow and what things are fast and where you might be able to combine two features into one feature, where you might be able to take a shortcut. That's the stuff a good developer knows, that's not the stuff a project manager knows. In a construction project it's the architects and the head contractors who know where shortcuts can be taken and how to bring a project in on time not the project manager. The project managers don't have any of the right skills to affect the project and so they inevitably get really frustrated and everybody treats them like secretaries, or treats them like 'annoying boy with clipboard', when they really don't have a leadership role in the project - and they're not going to be able to because they don't have the domain expertise. No matter how much they learn about project management, no matter how many books they read, or how many certificates they get, no matter how long they've been doing project management: if they don't know about software, and software development, if they don't have that experience, they are always going to be second class citizens and they're never going to be able to fix a broken project.