Today is the 10th anniversary of Ubuntu‘s first release. This is slightly nostalgic as I was employed by them at the time. I was actually the very first employee of MRS Virtual Development (the legal name of the entity at the time), Robert Collins (mentioned in the above-linked article) being the second. The first two things Mark wanted in his company were a good bug tracker and a good source control system. My involvement with Mozilla as a volunteer at the time is actually how I came to be involved with it, as I’d been heavily involved with the Bugzilla bug tracking system at that point. Robert had been heavily involved with GNU Arch, an up-and-coming source code management system which was eventually forked by Canonical to become Bazaar.

The biggest thing I remember about my time working for Canonical (as the corporate entity eventually became known) was that I spent 2 weeks at a time in London approximately every 2 months. I spent almost a quarter of that year in London. This, of course, was pretty hard on me since it was hard being away from my family so much.

Although I thought the Ubuntu OS was a fantastic idea, and loved the way it was being built (and I still use it to this day on most of my computers), in the end, I didn’t really fit in well with the other people working on it. Almost everyone else Mark had hired to work on it came from a Debian background, which I had had almost zero involvement with prior to this experience, and the culture was very different from anything I’d ever dealt with before. I much preferred the culture among the volunteer community at Mozilla. Fortunately, Firefox was released about 3 weeks later, and the Mozilla Foundation suddenly had money as a result. I left Canonical and was hired by Mozilla the day before Firefox 1.0 released, and I am still at Mozilla today. This means I will also be celebrating my 10th anniversary working at Mozilla in just a few weeks.

Much of this post is taken from a message I posted to the developers list a few days ago, so my apologies in advance to anyone reading it again. I’ve expanded on a few things and added the information about the upcoming meeting, so it’s probably worth re-reading.

To Frédéric: Thanks again (and again!) for all your hard work over the years! As stretched as I’ve been for time myself it has been a true godsend to have you picking up my slack the last few years. You will be missed!

To everyone else: For the time being, I’ll be handling approval requests, so if you have something up for approval and it’s not getting attention, I’m the one to pester.

This is sort of the end of an era for the Bugzilla project… Both Frédéric and Max (who left to work at Google a couple years ago and stepped down from his position earlier this year for lack of time) have been with the project for much longer than most people ever stick with a single employer in IT-related jobs (of which an open source project of this magnitude has a lot of similarities). For an open source project, that’s outright amazing, as people tend to come and go a lot in most projects. It’s kind of surprising that I’ve been around longer than them, but I’m kind of a “lifer” in some ways, and in reality I’ve had a good break from the project for the last few years because Max and Frédéric have been mostly taking care of everything while I’ve been busy with other things.

So it’s time to begin a new era. Since I’ve had a good break to clear the monotony I’m going to be trying to get more involved myself again (which I’ve been saying ever since Max left, but I have a lot more incentive now). I’d also like to kickstart a new team to lead the project, and kind of re-organize if you will. We have a number of positions within the project for various functions, which we’ve never really paid attention to as people moved on. So some questions we’ll be asking at our upcoming meeting are:

What positions in our existing structure do we have open?

Do we still need them all?

Are there new positions we have a need for that we should create?

Who should fill them?

Do the existing holders of positions that we still need and haven’t been vacated want to keep doing them?

We also have some other “reinventing the project” type topics while we’re at it. There’s a number of things we’ve been talking about doing for a long time that we never really moved on, and some of the big elusive dreams (the big UI overhaul!) have actually been making progress as well, lately. When we’re in the middle of big changes like this, I think it’s a good time to review where we are, get everyone on the same page, and tackle some of these things we keep talking about.

We also have a lot of new useful technology at our disposal since the last time we had a project meeting. We’re going to experiment with using Google Hangouts for the meeting this time, and using their feature to stream the Hangout via YouTube for those who want to watch without participating. We’ll also keep our usual meeting IRC channel open so people who don’t have a Google+ account and don’t want to get one can still participate and ask questions via IRC.

The preliminary agenda and participation instructions have been posted at https://wiki.mozilla.org/Bugzilla:Meetings. The meeting will be held on Wednesday, July 17th, at 14:00 UTC. And before anyone complains about the time, this was the best time to avoid inconveniencing the largest number of people. The Bugzilla Project has a global pool of contributors, and we have active contributors in a wide variety of time zones. The time chosen puts the meeting in the middle of the night for the fewest number of people. Those on the west coast in the US will probably have to get up a little early, and those in eastern Australia will be up a little later.

A lot of the emails and comments I’ve gotten since Frédéric’s announcement have been really positive, so I’m encouraged by the number of people who are still committed to keeping Bugzilla vibrant! We’ll see you at the meeting on Wednesday!

It’s been a few years since I’ve posted anything here… Mozilla IT has a blog now, and most of my work-related posts have been going there. Most of my personal stuff has been going on Facebook or Twitter. Just the way things go as technology evolves I suppose.

When an animal dies that has been especially close to someone here, that pet goes to Rainbow Bridge. There are meadows and hills for all of our special friends so they can run and play together. There is plenty of food, water and sunshine, and our friends are warm and comfortable. All the animals who had been ill and old are restored to health and vigor. Those who were hurt or maimed are made whole and strong again, just as we remember them in our dreams of days and times gone by. The animals are happy and content, except for one small thing; they each miss someone very special to them, who had to be left behind.

They all run and play together, but the day comes when one suddenly stops and looks into the distance. His bright eyes are intent. His eager body quivers. Suddenly he begins to run from the group, flying over the green grass, his legs carrying him faster and faster.

You have been spotted, and when you and your special friend finally meet, you cling together in joyous reunion, never to be parted again. The happy kisses rain upon your face; your hands again caress the beloved head, and you look once more into the trusting eyes of your pet, so long gone from your life but never absent from your heart.

I recently did up a diagram of how our Bugzilla site was set up, mostly for the benefit of other sysadmins trying to find the various pieces of it. Several folks expressed interest in sharing it with the community just to show an example of how we were set up. So I cleaned it up a little, and here it is:

Click the image for a full-size (readable) version

At first glance it looks somewhat excessive just for a Bugzilla, but since the Mozilla Project lives and dies by the content of this site, all work pretty much stops if it doesn’t work, so it’s one of our highest-priority sites to keep operating at all times for developer support. The actual hardware required to run the site at full capacity for the amount of users we get hitting it is a little less than half of what’s shown in the diagram.

We have the entire site set up in two different datacenters (SJC1 is our San Jose datacenter, PHX1 is our Phoenix datacenter). Thanks to the load balancers taking care of the cross-datacenter connections for the master databases, it’s actually possible to run it from both sites concurrently to split the load. But because of the amount of traffic Bugzilla does to the master databases, and the latency in connection setup over that distance, it’s a little bit slow from whichever datacenter isn’t currently hosting the master, so we’ve been trying to keep DNS pointed at just one of them to keep it speedy.

This still works great as a hot failover, though, which got tested in action this last Sunday when we had a system board failure on the master database server in Phoenix. Failing the entire site over to San Jose took only minutes, and the tech from HP showed up to swap the system board 4 hours later. The fun part was that I had only finished setting up this hot failover setup about a week prior, so the timing couldn’t have been any better for that system board failure. If it had happened any sooner we might have been down for a long time waiting for the server to get fixed.

When everything is operational, we’re trying to keep it primarily hosted in Phoenix. As you can see in the diagram, the database servers in Phoenix are using solid-state disks for the database storage. The speed improvement when running large queries that is gained by using these instead of traditional spinning disks is just amazing. I haven’t done any actual timing to get hard facts on that, but the difference is large enough that you can easily notice it just from using the site.

I have an HTC Evo running Android 2.2 (Froyo). I absolutely love the thing. The only downside is the short battery life. One really cool feature it has is the WiFi HotSpot utility, which lets you turn the phone into a WiFi access point, sharing the 3G or 4G internet connection to the wifi. It turns out that the Evo has a really good WiFi antenna in it, too. It has better range as an access point than my Linksys WRT54G at home (even without the fancy antennas sticking out of the back). This really comes in handy when I’m at my parents’ house up north in the boonies, where there is no broadband internet available, and the cell coverage is spotty at best. I have to wander the house in search of a spot that manages to maintain a signal long enough to do anything useful with it. Gone are the days of trying to hold your laptop in some weird spot for the data card to get enough signal. Now once I find a spot where the phone can keep a signal, I can just leave the phone there, and go wherever I want with the laptop to sit in comfort and use it via the WiFi hotspot.

Now, being that I’m out in the boonies, even the best spot in the house to leave the phone at still has a spotty connection that comes and goes. Staying somewhere within visual range of the phone so I can watch the signal strength display (and going and waking the phone back up when it puts the screen to sleep) is a pain. I’d love it if there was an app I could run which would serve up the current phone signal strength and data status over http on the phone’s internal wifi interface (even if it’s on some random port, since you can’t use port 80 without rooting the phone). For the variety of stuff that’s in the marketplace, it would surprise me if such a thing didn’t already exist. But it’s also hard enough to describe that any keywords I can think of to search with give me several dozen unrelated hits. So, Lazyweb, anyone know of such a thing?

We’re finally at the point where I can say we’re ready to upgrade Bugzilla @ Mozilla this weekend. We’re aiming for Sunday evening (probably 6pm PST). I’ll post again when I know how long it’ll be down for (and that’ll be included in the eventual downtime notice on the IT blog as well).

There’s a staging copy set up at https://bugzilla-stage-tip.mozilla.org/ and I would appreciate people playing around with it and finding anything that might be broken before we get it to production. Before filing bugs, make sure to check the detailed status linked from the red box at the top of every page to make sure it’s not already listed (and you can also see my progress on cosmetic issues and so forth, there).

It will be down for a while at some point tonight when I reload it with an up-to-date snapshot of the production server (and that’ll be my test to find out how long it’ll take to upgrade it, too). I’m super excited because this has been a long time coming.

At Mozilla, since our server farm has gotten so big, we’ve gotten reloading and upgrading machines kind of down to a science. We have a PXE server in each colo, and installations and upgrades run over the network. It’s a great system until you get to one of the following two situations:

The machine you need to upgrade is in someone else’s colo, with no local PXE server you have control over and no one on-site to do CD flipping for you. -or-

The machine you need to upgrade *is* the PXE server.

We have situation #2 coming up soon, but I had a machine with situation #1 that I recently experimented on to get this all working without PXE or a CD.

We had a major push to get the entire RHEL4 portion of our infrastructure either reloaded or upgraded to RHEL5 about a year ago. There were several machines that didn’t get upgraded for one of a few reasons at the time. Machines that were due to be replaced soon anyway, machines that ran the RHN Proxies (because RHN Proxy didn’t run on RHEL5 yet at the time), and machines that had software running on them that for some reason didn’t work on RHEL5 yet at the time (Zimbra with clustering support in our case). And also machines that were in remote locations with no PXE and nobody to flip CDs.

RHN has a remote kickstart capability. I’ve experimented with it before, but never had too much success getting it working. I probably just didn’t play enough. But going on the general concept of how it worked, I discovered it was quite easy to get booted into the Anaconda installer from grub… Just copy the isolinux directory off the CD/DVD into /boot, and set up an entry in grub.conf for it that looked very similar to the one in pxelinux.cfg on the PXE server.

The big problem comes with where to have it locate the install media. Conventional wisdom says I’ll get the most bang for the time spent if I put it right on that machine’s hard drive. Only problem: Anaconda doesn’t know how to talk LVM until after it gets stage2.img loaded, and it needs to already have access to the install media to load that. Almost all of these machines have LVM for the main partitions. In the end, I ended up putting an “allow from” line in the apache config on our main kickstart server to allow it to be accessed from this machine’s IP address via the Internet, and just loaded everything over the Internet. Slowed it down quite a bit, but it worked… almost.

Once it got to the point of locating an existing installation to upgrade, it died with “Upgrading between major versions is not supported.” Say what? Last year we upgraded a good couple dozen RHEL4 boxes to RHEL5 in place and it worked just fine (with a few caveats – a few things break, but it’s easy to clean up in %post in your kickstart). Well, RHEL is currently 5.3, it was 5.1 at the time when we did it before. I’d long since discarded the 5.1 images off the kickstart server. So I downloaded the 5.1 DVD again, and staged it on the kickstart server again, adjusted the kickstart file to install 5.1 instead of 5.3, and set it off to do its thing again. This time, success. The machine is now upgraded to RHEL 5.1. Now I just have to use yum to update it the rest of the way to RHEL 5.3. yum and glibc need to be upgraded first, then everything else.

Since I know someone will try to point it out (someone always does), yes, it is generally better to cleanly install RHEL5 from scratch than to try to ugprade from RHEL4 in place (this is probably why RedHat disabled it on the newer installers). Some situations make that not easy (like this one, where I have no PXE available and nobody local to the machine to flip CDs for me, not to mention the lengthy amount of time the machine would be down if I had to restore a few dozen GB of backups remotely over the Internet after a clean reload). This is another one of those times that makes me wonder why Red Hat can’t be as easy to upgrade as Ubuntu is, which not only has a way to upgrade from one release to the next in place without even needing to boot directly into an installer, but fully supports and encourages that upgrade method. Red Hat has some learning to do here.

We only have 6 machines left on RHEL4 now, and one of those is a build machine that can just go away once we’re no longer having to support RHEL4. Almost there.

Most people reading this probably already know that Mozilla utilizes a network of volunteers hosting downloads of Firefox (and other Mozilla products). All of these volunteer sites are listed in a database with a numeric weight that shows, relative to the other sites in the database, how much traffic they can handle. When you click the download link for Firefox on the mozilla.com site, you get sent to download.mozilla.org, which picks one of the sites out of that list at random (the chance of a given site getting picked is its weight divided by the sum of the weights of all of the available sites) and redirects you to that site to download the file.

When you think about the sheer number of people using Firefox these days (These 9-month-old stats say we have 60 million active daily users – I’m sure it’s probably grown since then), and Firefox’s built-in application update functionality that notifies users that a new version is available and installs it for them, that means when we release a security update, we’re going to have at least 60 million downloads of it (mostly via the automatic update service) within that first 24 hours after release. The amount of bandwidth required to host that many downloads in one day is staggering. We don’t have that much bandwidth available in Mozilla’s datacenters, which is why we rely on our network of mirror sites for the downloads. Each one of these sites may only be able to handle a small number of downloads, but when you add them all together, there’s a lot more capacity than our datacenters have.

Recently, as the number of Firefox users continues to grow, even our network of download mirror sites was starting to feel the pinch from the sheer volume of downloads during our security releases. During both the Firefox 3.0.6 and 3.0.7 releases, we ended up having to enact a throttling mechanism on the update service in order to slow down the number of downloads being requested to a point where we weren’t completely burying all of our volunteer download sites, many of whom also host downloads for other open source projects besides Firefox. When a Firefox application would check to see if there’s an update available, a percentage of users were told there wasn’t one, even though there really was. If you manually picked “Check for Updates” from the Help menu, you always got it, though. It was only the automatic checks that were throttled. This mechanism is always our last resort. When we have a security update, we want it in the end-user’s hands as quickly as possible. Delaying it for a day for a percentage of users is completely counter to that goal, and so we try everything to avoid having to use it.

On Wednesday this last week, when the firedrill started that became the Firefox 3.0.8 release, I sent out an email to all of our download mirror admins, warning them that Firefox 3.0.8 was imminent. I also pointed out how we had ended up needing to throttle updates during the 3.0.6 and 3.0.7 releases, and saying I still didn’t think we had enough capacity on the download mirror network to handle the release. Paraphrased, “If you know anyone who’s not mirroring Mozilla yet, and would like to, get them in touch with me.”

The community came through in shining colors. In the 48 hours following that email, we increased the capacity of the download mirror network by more than half. We left the peak traffic period of the first full day of Firefox 3.0.8 downloads about 6 to 8 hours ago, and I’m quite happy to report that we never had to throttle the updates at all for the Firefox 3.0.8 release. Every Firefox browser that checked in to see if there was an update available got its update notification. Not only that, but I had mirror admins telling me on IRC during the peak traffic hours “hey, my site can still handle more traffic, go ahead and bump my weight up some.” Quite a welcome change from all the reports of dead servers during the last two releases.

Now, to be fair, we did have one other thing going for us. This release happened in the afternoon on a Friday. This effectively splits most of the download traffic between the home users on Saturday and the business users this coming Monday. Given the way we performed today, though, I’m pretty confident we could have handled the release happening on another weekday anyway.

But, the weekend lull brings up one other point that was raised on IRC by Mike Beltzner yesterday… As I mentioned above, when we have a security update, the goal is to get it into the hands of the end user as fast as possible. Mozilla’s QA signed off on the release in the early morning hours on Friday. The bits were released out to the download mirror sites shortly afterwards. Enough of those mirror sites had picked up the files by late morning to handle the normal release traffic, but we had to wait until mid-afternoon to release because of a long-standing tradition based on capacity planning to schedule the release at a time of day that will cause the fewest simultaneous downloads to avoid overloading the download mirror network. Quite simply: the Pacific Ocean covers a lot of timezones. Enough of them that there’s a general lull in internet traffic when it’s daytime hours over the Pacific Ocean. If we schedule the release to happen as the west coast of North America is going offline for the day, then we start picking up traffic one timezone at a time as Asia and then eastern Europe start coming online for the new day.

In the interest of getting the update out to the end users as quickly as possible, wouldn’t it be great if we had enough capacity on our download mirror network that we didn’t have to wait for the lull in Internet traffic caused by the Pacific Ocean to do a release? If we released in the early morning hours in the US, we’d have a large number of timezones online for the day at release time and would get a much larger percentage of the users in the first few hours after release.

So, how about it, community? You guys pulled off some awesomeness this last week. Let’s see if we can do a little bit more, and be able to handle a release without regard for the time of day it happens! If you know anyone who might be willing to host downloads for us, have them check out our Mirroring Instructions Page. All the info about how to set up and get included in the download pool is listed there.

Here’s some numbers: For a normal Firefox security update, I’ve been saying that we need an availability rating of 35000 or higher to handle the release traffic. That number is the sum of the weights of the available mirrors that currently have the release files. There’s a loose perception of that number being tied to the amount of available Mbit of download bandwidth, but that’s not really accurate, and depends on a lot of factors. During both of the Firefox 3.0.6 and 3.0.7 releases, that number was hovering around 26000 most of the time. For most of the Firefox 3.0.8 release so far, it’s been somewhere between 45000 and 55000. I’m betting we could probably handle the traffic that would be generated by a morning release if we got that above 65000 consistently.

Well, to Canada at least. Canadian cable network YTV started broadcasting the English dub of the original Pretty Cure series yesterday morning on Canadian television.

From the reviews I’ve seen, it sounds like they did a pretty good job of localizing it without dumbing it down, and still faithfully tell the original story. It would make my day if some TV network in the US picked it up.