Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius – and a lot of courage – to move in the opposite direction. — Albert Einstein

Menu

Mizmo came to #fedora-admin yesterday to see about getting drupal with a specific plugin that puts a more web-forum type interface on top of mailman. This spawned a big discussion about a wide array of things. I’m posting a bit about it here to get it more exposure and also to try and separate out the different threads that ran through it.

Part 1: Fedora Infrastructure

First, something that has become very apparent within Fedora Infrastructure but isn’t so apparent to people outside. Infrastructure is starved for people. And unfortunately, simply throwing more people at infrastructure doesn’t help as much as in other parts of the project. Here’s what happens: Within infrastructure, we have a very few people who are trusted to do work on all of the infrastructure boxes (the so called, sysadmin-main group). These people can log in to all but one of the servers, make changes as needed, access the database, vote on change requests during release freeze, and basically have rights to fix any problems that may crop up. With great power comes great responsibility and new members to this group need to be present in the general Fedora sysadmin community for a good while doing a lot of things before they come to be in this group.

Outside of this group we have several others that have varying degrees of power over varyious critical items. We have the sysadmin-noc group that monitors all the servers and has a limited ability to diagnose and help fix routine issues that may crop up (although they often need to call on someone with more access to perform actual fixes), the sysadmin-hosted group that can work on the servers related to keeping fedorahosted up and running, sysadmin-web group that can work on the main app servers that make infrastructure services go, syadmin-cvs that deals with the cvs server for fedora packages, sysadmin-db (which in practice is the same as sysadmin-main due to having access to sensitive information). We also have satellite groups in the form of committers to the applications that we write (the packagedb, fas, python-fedora, bodhi, elections, and mirrormanager committers). These applications are written by infrastructure coders to meet needs identified by the infrastructure group for Fedora. In addition, there are a few groups that interact very closely with infrastructure — the releng team deploys the release and needs to coordinate closely with infrastructure on mirror space and times that we can make changes, some of the groups that develop applications that we run (members of the transifex, fedoracommunity/moksha, and zikula community) work with us to help solve issues and bugs in our deployments and to greater and lesser extents, help us to maintain the apps.

So, where’s the bottleneck in infrastructure getting new people? There’s actually three places, two of which are related:

We need more people involved who want to solve specific coding problems for infrastructure. These people would need to be willing to be a jack-of-all-trades. They need to be members of infrastructure that get involved with upstream projects. Sometimes there might be a performance issue that we need to have addressed. Sometimes we might identify a security problem and need to get a fix out quickly. Other times we might identify a high value feature that would help fedora contributors and need someone to develop it. The people doing any of this work would need to be able to sit down and involve themselves with both infrastructure and any upstreams to get commit rights or, at least, be trusted enough to get their patches looked at and added. They’d need to be able to dive into unfamiliar code, get an idea of how it works, and produce working patches.

We need more people to maintain (not just deploy) applications. In many ways these people end up doing the same sorts of jobs as the people in #1. However, the emphasis is slightly different. Where the first set of people are primarily coders, these people are primarily system administrators. Now, in Fedora Infrastructure we do have a lot of system administrators who come by to be a part of the team. Where we end up with problems is that most of them aren’t able to commit to being part of the team over a long period. This is difficult for us because we end up being able to deploy more projects but as we do, the people who are committed to maintaining the services get more and more stretched. For a non-sysadmin, a question that often comes up here is — well, but isn’t deploying the application the hard part? After it’s deployed, it should just work, right? The answer to that is almost always no. All software has bugs — so there’s always going to be the need to do updates. Updates are not always backwards compatible so there’s always going to be the need to test updates and update configuration and code that we’ve built to help us manage the software when we do them. Critical bugs (often, security related) do get discovered after a piece of software has been deployed which makes for some late nights rushing around fixing an issue that must be applied to the production instance ASAP. As the service gets used more (or other software running on the same hsot gets used more) we can run into scaling issues that weren’t apparent when we first deployed. All of these things contribute to the maintainance burden of deploying a new service to be used and all of them are helped immensely by having someone who can maintain the software as a long term commitment.

We need to get more people who can gateway changes to many things. These are the members of the core teams, sysadmin-web, sysadmin-db, cvsadmins, and ultimately, sysadmin-main. This ties in heavily with #2. in order to be sponsored into one of these groups you need to build up trust within the sysadmin community. You’ll be given access to services that can bring down Fedora in any number of ways from simply making mistakes that cause outages to maliciously causing problems for core services so we have to trust that you’ll do the right thing with your responsibilities. Building trust is not an instant thing. It takes many man-hours of hard work, being in the right place to do something helpful, and generally showing that you are not just someone with valuable talents, but also someone that is responible enough to use their talents for the benefit of everyone and not just a few.

Over the past couple years we’ve tried a variety of things to alleviate these issues, none with a great deal of success. Commitment is hard when you have mouths to feed. It’s hard to be effective at working on issues when you don’t have access to deploy your apps in a production environment. Feel free to bring some suggestions (the best suggestions come with prototypes! :-) to #fedora-admin or the infrastructure@fp.o mailing list and we’ll see if next year we can look back and say this was the year we figured out how to grow infrastructure at a sustainable rate.

Group 1 seems to be the group that could most be filled with “opportunistic” volunteers (by that I mean people who can’t commit on the long term but are willing to help *right now*).

I wanted to join the infrastructure group some time ago (I even sent my introduction to the list), but I quickly realised I would never be able to commit to the required amount of time. Most of the time, I can’t even assist to the meetings.

As such, I chose one of the infrastructure projects which seemed fun to hack on, sufficiently important to the infrastructure, in need of help, and which provided some nice challenges: Bodhi (more of a random choice though, all other projects seemed to fit these points). I then started contributing small patches, I got commit access, and now I’m trying to implement some of the changes asked by QA in the Bodhi TG2 rewrite. But even then, finding time to sit and work on it becomes increasingly hard.

All in all, I (and I think many volunteers) could do group 1 (obviously within the limits of my Python knowledge, which even if increasing daily is still very far from the expertise you guys have) if the expectation is that I will be *reasonnably* available (as opposed to an on-call worker who will be expected to be always available), but not any of the others.

I really don’t see how volunteers could commit on the long term for maintaining apps. This is something we do on our free time, and even if it’s something we like and actually *want* to do, other things will inevitably get in our way ($dayjob for instance, or family, friends, other activities,…) :(

I think you’ve managed to make a success of your infrastructure involvement and I see a few other people who have done so as well by following the same basic path :-).

I do know that it is possible for volunteers to go through the group 2 phase and make it into group 3… I did it as a volunteer. jds2001 and nigelj are two more recent examples. However, I agree that it is very hard. At one point I had to let infrastructure know that I had a new job and had to cut my involvement a bit short. Luckily for me, I was hired to work on infrastructure full time shortly after that. Perhaps what we need most is to get multiple people involved with the projects so that when one person has a shortage of time, someone else can step up to the maintainance tasks.

That’s true. One thing about FES, though is that we need to provide new coders to the distribution as well as to infrastructure. So we need to make FES able to stand on its own and throw coders at problems that need fixing inside of the distro as well. (For instance, the FES ticket for addressing bundled libraries within Fedora hasn’t been touched :-(