A fellow developer has started work on a new Drupal project, and the sysadmin has suggested that they should only put the sites/default subdirectory in source control, because it "will make updates easily scriptable." Setting aside that somewhat dubious claim, it raises another question -- what files should be under source control? And is there a situation where some large chunk of files should be excluded?

My opinion is that the entire tree for the project should be under control, and this would be true for a Drupal project, rails, or anything else. This seems like a no-brainer -- you clearly need versioning for your framework as much as you do for any custom code you write.

That said, I would love to get other opinions on this. Are there any arguments for not having everything under control?

18 Answers
18

I would say that the minimum that source control should contain is all of the files necessary to recreate a running version of the project. This even includes DDL files to set up and modify any database schema, and in the correct sequence too. Minus, of course, the tools necessary to build and execute the project as well as anything that can be automatically derived/generated from other files in source control (such as JavaDoc files generated from the Java files in source control).

@EdWoodcock: You're right, getting the order correct can be a real pain, but on sometimes you want to re-create a particular state of the database, or optionally apply certain changes when testing rather than dropping/recreate the whole thing. I find it varies by project.
–
FrustratedWithFormsDesignerNov 18 '11 at 16:08

1

Point taken, there's a level or pragmatism required for that one.
–
Ed WoodcockNov 18 '11 at 16:16

2

@JayBazuzi: Workstation setup guides (in source control) should outline the necessary tools and dependencies, as well as how to set up and where to get the tools from. Maintaining a usable toolkit is important, but is not the purpose of source control. I suppose if you REALLY wanted to, you could add the installer file/.msi and some instruction files, but that might not be feasible in may workplaces. Would you really want to check in VisualStudio Pro 2010 or IBM RAD, XMLSpy, etc, into your source control system? Many workplaces have controlled deployments for these tools.
–
FrustratedWithFormsDesignerNov 18 '11 at 21:56

1

@artistoex : That's splitting hairs. It's generally assumed that the build box has the same libraries as the dev boxes do. If the two differ, there's something wrong with the IT manager. All you (ideally) would need is just the source code. Some projects this isn't applicable, but for most it should be.
–
Mike SDec 9 '11 at 10:24

Make sure that the "certain documentation" isn't dependent on a particular tool. I've run into a number of projects that used something like SunOS version of Frame to do docs, they checked in all of the ".mif" files, but not the resulting .ps or .pdf files. Now that SunOS and Frame are relegated to the dustbin of history, a lot of design docs only exist as treasured paper copies.
–
Bruce EdigerNov 18 '11 at 16:22

2

@BruceEdiger In that case I would personally want both the output and the tool specific information. If the tool disappears, you at least still have a static electronic copy :)
–
maple_shaft♦Nov 18 '11 at 16:26

And don't forget to put all database code in Source Control as well! This would include the orginal create scripts, the scripts to alter tables (that are marked by what version of the software uses them, so you can recreate any version of the database for any version of the applications) and scripts to populate any lookup tables.

The only things that I do not put under source control are files that you can easily regenerate or are developer specific. This means executables and binaries that are composed of your source code, documentation that is generated from reading/parsing files under source control, and IDE-specific files. Everything else goes into version control and is appropriately managed.

Hard won experience has taught me that almost everything belongs in source control. (My comments here are colored by a decade and a half developing for embedded/telecom systems on proprietary hardware with proprietary, and sometimes hard to find, tools.)

Some of the answers here say "don't put binaries in source control". That's wrong. When you're working on a product with lots of third party code and lots of binary libraries from vendors, you check in the binary libraries. Because, if you don't, then at some point you're going to upgrade and you'll run into trouble: the build breaks because the build machine doesn't have the latest version; someone gives the new guy the old CDs to install from; the project wiki has stale instructions regarding what version to install; etc. Worse still, if you have to work closely with the vendor to resolve a particular issue and they send you five sets of libraries in a week, you must be able to track which set of binaries exhibited which behavior. The source control system is a tool that solves exactly that problem.

Some of the answers here say "don't put the toolchain in source control". I won't say it's wrong, but it's best to put the toolchain in source controlunless you have a rock solid CM system for it. Again, consider the upgrade issue as mentioned above. Worse still, I worked on a project where there were four separate flavors of the toolchain floating around when I got hired -- all of them in active use! One of the first things I did (after I managed to get a build to work) was put the toolchain under source control. (The idea of a solid CM system was beyond hope.)

And what happens when different projects require different toolchains? Case in point: After a couple of years, one of the projects got an upgrade from a vendor and all the Makefiles broke. Turns out they were relying on a newer version of GNU make. So we all upgraded. Whoops, another project's Makefiles all broke. Lesson: commit both versions of GNU make, and run the version that comes with your project checkout.

Or, if you work in a place where everything else is wildly out of control, you have conversations like, "Hey, the new guy is starting today, where's the CD for the compiler?" "Dunno, haven't seen them since Jack quit, he was the guardian of the CDs." "Uhh, wasn't that before we moved up from the 2nd floor?" "Maybe they're in a box or something." And since the tools are three years old, there's no hope of getting that old CD from the vendor.

All of your build scripts belong in source control. Everything! All the way down to environment variables. Your build machine should be able to run a build of any of your projects by executing a single script in the root of the project. (./build is a reasonable standard; ./configure; make is almost as good.) The script should set up the environment as required and then launch whatever tool builds the product (make, ant, etc).

If you think it's too much work, it's not. It actually saves a ton of work. You commit the files once at the beginning of time, and then whenever you upgrade. No lone wolf can upgrade his own machine and commit a bunch of source code that depends on the latest version of some tool, breaking the build for everyone else. When you hire new developers, you can tell them to check out the project and run ./build. When version 1.8 has a lot of performance tuning, and you tweak code, compiler flags, and environment variables, you want to make sure that the new compiler flags don't accidentally get applied to version 1.7 patch builds, because they really need the code changes that go along with them or you see some hairy race conditions.

Best of all, it will save your ass someday: imagine that you ship version 3.0.2 of your product on a Monday. Hooray, celebrate. On Tuesday morning, a VIP customer calls the support hotline, complaining about this supercritical, urgent bug in version 2.2.6 that you shipped 18 months ago. And you still contractually have to support it, and they refuse to upgrade until you can confirm for certain that the bug is fixed in the new code, and they are large enough to make you dance. There are two parallel universes:

In the universe where you don't have libraries, toolchain, and build scripts in source control, and you don't have a rock-solid CM system.... You can check out the right version of the code, but it gives you all kinds of errors when you try to build. Let's see, did we upgrade the tools in May? No, that was the libraries. Ok, go back to the old libraries -- wait, were there two upgrades? Ah yes, that looks a little better. But now this strange linker crash looks familiar. Oh, that's because the old libraries didn't work with the new toolchain, that's why we had to upgrade, right? (I'll spare you the agony of the rest of the effort. It takes two weeks and nobody is happy at the end of it, not you, not management, not the customer.)

In the universe where everything is in source control, you check out the 2.2.6 tag, have a debug build ready in an hour or so, spend a day or two recreating the "VIP bug", track down the cause, fix it in the current release, and convince the customer to upgrade. Stressful, but not nearly as bad as that other universe where your hairline is 3cm higher.

With that said, you can take it too far:

You should have a standard OS install that you have a "gold copy" of. Document it, probably in a README that is in source control, so that future generations know that version 2.2.6 and earlier only built on RHEL 5.3 and 2.3.0 and later only built on Ubuntu 11.04. If it's easier for you to manage the toolchain this way, go for it, just make sure it's a reliable system.

Project documentation is cumbersome to maintain in a source control system. Project docs are always ahead of the code itself, and it's not uncommon to be working on documentation for the next version while working on code for the current version. Especially if all your project docs are binary docs that you can't diff or merge.

If you have a system that controls the versions of everything used in the build, use it! Just make sure it's easy to sync across the whole team, so that everyone (including the build machine) is pulling from the same set of tools. (I'm thinking of systems like Debian's pbuilder and responsible usage of python's virtualenv.)

The use-case for source control is : What if all our developers machines and all of our deployment machines were hit by a meteor? You want recovery to be as close to checkout and build as possible. (If that's too silly, you can go with "hire a new developer.")

In other words, everything other than OS, apps, and tools should be in VCS, and in embedded systems, where there can be a dependency on a specific tool binary version, I've seen the tools kept in VCS too!

Incomplete source control is one of the most common risks I see when consulting -- there's all sorts of friction associated with bringing on a new developer or setting up a new machine. Along with the concepts of Continuous Integration and Continuous Delivery you ought to have a sense of "Continuous Development" -- can an IT person set up a new development or deployment machine essentially automatically, so that the developer can be looking at code before they finish their first cup of coffee?

Drupal uses git so I will use git's terminology. I would use subrepos for each module to be able to pull down module updates from drupal's official repos, while still preserving the structure of individual deployments. That way you get the scriptability benefits without losing the benefits of having everything under source control.

Configuration files, if they include configuration options that are different for each developer and / or each environment (development, testing, production)

Cache files, if you are using filesystem caching

Log files, if you are logging to text files

Anything that like cache files and log files is generated content

(Very) Large binary files that are unlikely to change (some version control systems don't like them, but if you are using hg or git they don't mind much)

Think of it like that: Every new member to the team should be able to checkout a working copy of the project (minus the configuration items).

And don't forget to put database schema changes (simple sql dumps of every schema change) under version control too. You could include user and api documentation, if it makes sense to the project.

@maple_shaft raises an important issue with my first statement regarding environment configuration files in the comments. I'd like to clarify that my answer is to the specifics of the question, which is about Drupal or generic CMS projects. In such scenarios you typically have a local and a production database, and one environment configuration option is the credentials to these databases (and similar credentials). It's advisable that these are NOT under source control, as that would create several security concerns.

In a more typical development workflow however, I do agree with maple_shaft that environment configuration options should be under source control to enable for an one-step build and deploy of any environment.

@maple_shaft In the context of the question (drupal project or gereric CMS web project) "one-step build and deploy of any environment" is a highly unlikely scenario (will you put production database credentials in with everything?). I'm answering to the question, not providing general guidelines on what should be put under version control. - But your downvote is welcome :)
–
Yannis Rizos♦Nov 18 '11 at 16:11

Anything that you need to work and can change needs to be versioned some way or another. But there is rarely a need to have two independent systems keep track of it.

Anything generated in a reliable way can usually be attached to a source version - therefore it doesn't need to be tracked independantly: generated source, binaries that are not passed from a system to another, etc.

Build logs and other stuff that probably nobody cares about (but you never know for sure) are usually best tracked by whoever is generating it: jenkins, etc.

Build products that are passed from one system to another need to be tracked, but a maven repo is a good way to do it - you don't need the level of control a source control provides. Deliverables are often in the same category.

Whatever remains (and at this point, there should be little more than source files and build server configuration) goes into source control.

Source control is a change tracking mechanism. Use it when you want to know who changed what and when.

Source control is not free. It adds complexity to your workflow, and requires training for new collegues. Weigh the benefits against the cost.

For example, it can be tough to control databases. We used to have a system where you had to manually save definitions in a text file and then add those to source control. This took a lot of time and was unreliable. Because it was unreliable, you could not use it to set up a new database, or to check at what time a change was made. But we kept it for years, wasting countless hours, because our manager thought "all things should be in source control".

Source control is not magic. Try it, but abandon it if it doesn't add enough value to offset the cost.

Are you serious? Source control is bad because it requires training for new colleagues? Are you actually saying you'd prefer to work long-term with people who don't know how to use source control and aren't willing to learn? Personally I'd rather flip burgers.
–
ZachNov 19 '11 at 1:40

1

My point is, even if you're only using it for some things (cough source code cough), your colleagues should already know how to use it, so training them shouldn't be increased overhead in using it for something else.
–
ZachNov 19 '11 at 1:55

The SDK even though it's the same directory and if I make a patch to the SDK then a should make it another project since it would be per framework instead of per app

3rd-party libraries such as
. Leftovers from migration, backups, compiled code, code under other license (perhaps)

So I don't do a hg addremovefor instance since a make a new clone every once in a while when the SDK updates. That also makes me do a complete backup every time the SDk updates and check that a new version cloned from the repository is well.

I do not agree with one piece of the accepted answer provided by @FrustratedWithFormsDesigner less because he advocates not placing into version control the tools necessary to build the project. Somewhere in source control (adjacent to the code being built) should be the build scripts to build the project and build scripts that run from a command line only. If by tools he means, IDEs and editors, they should not be required to build the project whatsoever. These are good for active/rapid development for developers and setup of this type of environment could be scripted as well or downloaded out of another section of SCM or from some type of binary management server and setup up such IDEs should be as automated as possible.

I also disagree with what @Yannis Rizos states about placing configurations for environments in source control. Reason being is that you should be able to reconstruct any environment at will using nothing but scripts and it is not manageable without having configuration settings in source control. There is also no history of how configurations for various environments have evolved without placing this information into source control. Now, production environment settings may be confidential or companies may not want to place these in version control, so a second option is to still place them in version control so that they have a history, and give this repository limited access.