It seems that more source control systems still use files as the means of storing the version data. Vault and TFS use Sql Server as their data store, which I would think would be better for data consistency as well as speed.

So why is it that SVN, I believe GIT, CVS, etc still use the file system as essentially a database, (I ask this question as we had our SVN server just corrupt itself during a normal commit) instead of using actual database software (MSSQL, Oracle, Postgre, etc)?

EDIT: I think another way of asking my question is "why do VCS developers roll their own structured data storage system instead of using an exisiting one?"

What do you think most databases use as their basic backing? Most use files (a few use direct access to hard disks, however). You can have all the features of a database by using "just files".
–
Joachim SauerNov 3 '12 at 15:56

1

@JoachimSauer Fair point, though of course you'd have to create a database yourself then. Which is silly if your desired feature set is close to the ones of existing solutions and don't have very good reasons to not use any of those.
–
delnanNov 3 '12 at 16:01

Until persistent fast memory becomes a reality, the persistence afforded by disk drives (hence files) is the only real alternative.
–
Oded♦Nov 3 '12 at 16:14

2

@delnan Transactional support and internal consistency. We are now restoring our SVN repository from tape b/c the SVN server didn't properly write to all the files it was supposed to. Also searching huge volumes of data. My point is, why try an re-invent the wheel.
–
AndyNov 3 '12 at 17:00

6

Every major operating system comes with a file system built in. All these file systems have the same basic functionality (files, folders, persistance of same). Basically, a database is one extra dependency that the end-user needs to install and keep updated. Source control isn't most people's primary business (unless you are sourceforge or github). VC is often installed on servers through the command line by the newest member of the team. Ease of installation and setup is important.
–
GlenPetersonNov 3 '12 at 19:04

7 Answers
7

TL;DR: Few version control systems use a database because it isn't necessary.

As a question for a question answer, why wouldn't they? What benefits do "real" database systems offer over a file system in this context?

Consider that revision control is mostly keeping track of a little metadata and a lot of text diffs. Text is not stored in databases more efficiently, and indexability of the contents isn't going to be a factor.

Lets presume that Git (for argument's sake) used a BDB or SQLite DB for its back-end to store data. What would be more reliable about that? Anything that could corrupt simple files can also corrupt the database (since that's also a simple file with a more complex encoding).

From the programmer paradigm of not optimizing unless its necessary, if the revision control system is fast enough and works reliably enough, why change the entire design to use a more complex system?

TLDR? You answer was twice as long and the question was real short as it is!
–
BradNov 3 '12 at 16:29

18

@Brad The three words following the TL;DR are the abridged version of the answers, not a statement that the question is too long and he didn't read it before answering.
–
delnanNov 3 '12 at 16:31

6

@Andy Mercurial also has "grep in history", and git likely has it as well. It's also lightning fast already. As for leaving things to expert: The people who develop VCSs are experts.
–
delnanNov 3 '12 at 17:10

3

Just want to add in that I do see your point; if VCS writes bad data, it doesn't matter if its writing that data to a file or database. The flip side though is that file based repos probably are writing to more than one file at a time and normally there's no transactional support for that so if one file writes but another fails, your VCS is now corrupt, vs mutiple table writes within a database transaction will commit for fail as a unit. I feel as though a group of devs building database software has more experience with this than the people writing SVN... but maybe I'm wrong.
–
AndyNov 3 '12 at 19:19

3

Your choice of git "for argument's sake" is an important point here: git has a very good model for writing its objects, but many tools don't. With git, if the computer is powered off in the middle of a commit, you'll have written some of the objects to the filesystem and they'll merely be unreachable. With other VCSs, you may have appended the changes to half the files (and confusion ensues.) You could argue that other version control tools are poorly designed (and you'd be right), but when you're writing a VCS, it's a lot easier to just use a SQL transaction and let it do the right thing.
–
Edward ThomsonNov 4 '12 at 1:10

You seem to be making a lot of assumptions, possibly based on your experience with SVN and CVS.

Git and Mercurial are basically like SVN and CVS

Comparing git and CVS is like comparing an iPad and an Atari. CVS was created back when dinoaurs roamed the Earth. Subversion is basically an improved version of CVS. Assuming that modern version control systems like git and Mercurial work like them makes very little sense.

A relational database is more efficient than a single-purpose database

Why? Relational databases are really complicated, and may not be as efficient as single-purpose databases. Some differences off the top of my head:

Version control systems don't need complicated locking, since you can't do multiple commits at the same time anyway.

Distributed version control systems need to be very extremely space efficient, since the local database is a full copy of the repo.

Version control systems only need to look up data in a couple specific ways (by author, by revision ID, sometimes full-text search). Making your own database that can handle author/revision ID searches is trivial and full-text searches aren't very fast in any relational database I've tried.

Version control systems need to work on multiple platforms. This makes it harder to use a database that needs to be installed and running as a service (like MySQL or PostgreSQL).

Version control systems on your local machine only need to be running when you're doing something (like a commit). Leaving a service like MySQL running all the time just in case you want to do a commit is wasteful.

For the most part, version control systems never want to delete history, just append to it. That may lead to different optimizations, and different methods of protecting integrity.

Relational databases are safer

Again, why? You seem to assuming that because data is stored in files, version control systems like git and Mercurial don't have atomic commits, but they do. Relational databases also store their databases as files. It's notable here that CVS doesn't do atomic commits, but that's likely because it's from the dark ages, not because they don't use relational databases.

There's also the issue of protecting the data from corruption once it's in the database, and again the answer is the same. If the filesystem is corrupted, then it doesn't matter which database you're using. If the filesystem isn't corrupted, then your database engine might be broken. I don't see why a version control database would be more prone to this than a relational database.

I would argue that distributed version control systems (like git and Mercurial) are better for protecting your database than centralized version control, since you can restore the entire repo from any clone. So, if your central server spontaneously combusts, along with all of your backups, you can restore it by running git init on the new server, then git push from any developer's machine.

Reinventing the wheel is bad

Just because you can use a relational database for any storage problem doesn't mean you should. Why do you use configuration files instead of a relational database? Why store images on the filesystem when you could store the data in a relational database? Why keep your code on the filesystem when you could store it all in a relational database?

There's also the fact that open-source projects can afford to reinvent the wheel whenever it's convenient, since you don't have the same kinds of resource constraints that commercial projects do. If you have a volunteer who's an expert at writing databases, then why not use them?

As for why we would trust the writers of revision control systems to know what they're doing.. I can't speak for other VCS's, but I'm pretty confident that Linus Torvaldsunderstands filesystems.

Why do some commercial version control systems use a relational database then?

Most likely some combination of the following:

Some developers don't want to write databases.

Developers of commercial version control systems have time and resource constraints, so they can't afford to write a database when they have something close to what they want already. Also, developers are expensive, and database developers (as in, people who write databases) are probably more expensive, since most people don't have that kind of experience.

Users of commercial version control systems are less likely to care about the overhead of setting up and running a relational database, since they already have one.

Users of commercial version control systems are more likely to want a relational database backing their revision data, since this may integrate with their processes better (like backups for example).

Yikes; you say I make a lot of assumptions, and then proceed to do so as well. Really, its not possible to have two people commit to the same repo at the same time, so VCS doesn't need to worry about that? I don't assume SVN or others don't have atomic commits;but I also doubt that they only need to touch one file to record a commit,and I'm also not sure they take advantage of transaction support offered by the file system (assuming the file system the VCS is installed on even offers that). There a quite a few other assumptions you've made, but I don't think dozens of comments is the way to go
–
AndyNov 4 '12 at 18:36

1

@Andy And my point is that you can handle those exact same scenarios without a full-blown relational database. If two people commit at the exact same time, the server can do one after another. That's not a complicated feature to implement. If you want to do that with a local user, just have a lock file. When you start a commit, get a lock on the file. When you end a commit, release the lock. If you want to allows commits to multiple branches at once, use a lock file for each branch. Sure, SQLite would do this for me, but it's not necessary.
–
Brendan LongNov 5 '12 at 18:08

1

@BrendanLong Great points. Appreciate the discussion. Just to be clear, I think there are advantages and disadvantages to both kinds of backing stores, I don't believe there's just one correct answer. However I was kinda suprised there seems to be only three (four if you count Vault and Vercity separately) that use SQL and the vast majority were not, that's all.
–
AndyNov 6 '12 at 14:48

Actually svn used to use BDB for repositories. This was eventually gotten rid of because it was prone to breakage.

Another VCS that currently uses a DB (SQLite) is fossil. It also integrates a bug tracker.

My guess at the real reason is that VCSes work with lots of files. Filesystems are just another kind of database (hierarchical, focused on CLOB/BLOB storage efficiency). Normal databases don't handle that well because there's no reason to -- filesystems already exist.

Didn't know Fossil used a database for all data storage, thanks for that info. As far as BDB (or even SqlLite) goes, I'm not sure how that stacks up in reliabilty vs. thinks like Oracle or MSSQL.
–
AndyNov 3 '12 at 19:22

1

BDB wouldn't exactly count as reliable -- like SQLite it's an in-process database. That said, I think the reliability of Oracle/MSSQL/MySQL/Postgres, depending on how you configure them, is not much different from filesystems. The main problem is that RDBMS are not built for the hierarchical & graph structures that VCSes commonly work with. And in that case, filesystems just win.
–
Mike LarsenNov 4 '12 at 1:36

3

@Andy: Fossil was created by the creator of SQLite. It's not really that surprising :-)
–
Jörg W MittagNov 4 '12 at 2:28

1

@Andy: i'd trust SQLite much more than Oracle or MSSQL. It's no wonder that it's the most used SQL database out there, by a huge margin. Also it's the one ported to most different architectures, each one with it's own set of challenges, making the shared code incredibly bullet-proof.
–
JavierNov 5 '12 at 2:11

3

@Andy, that's what transactions are for. No matter at what point you kill a good DB engine, a given transaction is either committed or not. SQLite's implementation of atomic commits (sqlite.org/atomiccommit.html) is a particularly sophisticated one.
–
JavierNov 5 '12 at 18:53

A filesystem is a database. Not a relational database, of course, but most are very efficient key/value stores. And if your access patterns are well-designed for a key-value store (eg, the git repository format), then using a database probably doesn't offer significant advantages over using the filesystem. (In fact, it's just another layer of abstraction to get in the way.)

A lot of the database features are just extra baggage. Full text search? Does full text search make sense for source code? Or do you need to tokenize it differently? This also requires that you store full files at every revision, which is uncommon. Many version control systems store deltas between revisions of the same file in order to save space, for example Subversion and Git (at least, when using pack files.)

The cross-platform requirements make using a database more challenging.

Most version control tools are built to run on multiple platforms. For centralized version control tools, this only affects the server component, but it is still difficult to rely upon a single database server since Unix users cannot install Microsoft SQL Server and Windows users may be unwilling to install PostgreSQL or MySQL. The filesystem is the least common denominator. However, there are several tools where the server must be installed on a Windows machine, and thus require SQL Server, for example SourceGear Vault and Microsoft Team Foundation Server.

Distributed version control systems make this more challenging still, since every user gets a copy of the repository. This means that every user needs a database to put the repository into. This implies that the software:

Is limited to a subset of platforms where a particular database exists

Targets a single database backend that is cross-platform (eg, SQLite).

Targets a pluggable storage backend, so that one could use whatever database they wished (possibly including the filesystem).

Most distributed version control systems, therefore, just use the filesystem. A notable exception is SourceGear's Veracity, which can store in a SQLite database (useful for local repositories) or a relational database like SQL Server (possibly useful for a server.) Their cloud hosted offering may use a non-relational storage backend like Amazon SimpleDB, but I do not know this to be true.

Just as a devil's advocate comment perhaps, most people who ask these types of "why not use a database" questions appear to mean "why not use an RDBMS?" with all the ACID compliance and other issues involved. The fact that all file systems are already databases of their own ilk having already been discarded.
–
mikebabcockNov 6 '12 at 16:18

I would say it's because the primary data structure of a version control system is a DAG, which maps to databases very poorly. A lot of the data is also content addressable, which also maps to databases very poorly.

Data integrity isn't the only concern of a VCS, they are also concerned with version history integrity, which databases aren't very good at. In other words, when you retrieve a version, you not only need to make sure that version has no current flaws, but also that nothing in its entire history has been surreptitiously altered.

VCS are also a consumer product in addition to an enterprise product. People use them in small, one-man hobby projects. If you add the hassle of installing and configuring a database server, you are going to alienate much of that part of the market. I'm guessing you don't see a lot of Vault and TFS installations at home. It's the same reason spreadsheets and word processors don't use databases.

Also, this is more a reason for DVCS, but not using a database makes it extremely portable. I can copy my source tree onto a thumb drive and reuse it on any machine, without having to configure a database server process.

As far as corrupting during commits, VCS uses the exact same techniques as databases to prevent simultaneous access, make transactions atomic, etc. Corruptions in both are very rare, but they dohappen. For all intents and purposes, a VCS data store is a database.

"maps to databases very poorly" Yet Vault and TFS do just this. "Data integrity isn't the only concern of a VCS, they are also concerned with version history integrity, which databases aren't very good at." I fail to see how storing version history lends itself into files over a database especially since I've named products that do just that. ". Corruptions in both are very rare, but they do happen." None of those results in the first page talk about the Vault server database being corrupt. The one link that even talks about the Vault software the problem is the WC got corrupted.
–
AndyNov 3 '12 at 19:12

"For all intents and purposes, a VCS data store is a database." Well... that's my point. Why not just stick the data in a real database system instead of rolling your own?
–
AndyNov 3 '12 at 19:13

2

@Andy Yes, it's a database, but not all databases are substitutable for one another. Each database has a certain view on the world (for example, SQL DBs basically implement the relational model). As this answer details, the data a VCS stores and the way that data is used doesn't fit the relational model. I'm not sure if some NoSQL db does better, but they're rather new and are yet to prove their superiority (I recall reports of serious integrity issues for some). And then there are all the other issues atop of that.
–
delnanNov 3 '12 at 20:30

DAGs are only used in DVCS (unless you consider a linear history an exceptionally simple DAG, which it is, but that's not really a helpful abstraction.) When your history is linear, with monotonically increasing changesets, a SQL database makes a lot more sense.
–
Edward ThomsonNov 3 '12 at 22:45

Monotonically increasing version numbers don't make a lot of sense for VCSes. I've used a fair number of them, and the ones with centralized version numbers (CVS & SVN being the 2 I'm most familiar with) tend to be a pain to merge with. And even those use DAGs when they attempt to do merging. Just because their storage representation isn't based around it doesn't mean it isn't used.
–
Mike LarsenNov 4 '12 at 1:43

This answer is just bad. The only really true point is lowering the number of dependencies. Both backing systems should be on par as you should be doing proper backups, debugging DB applications is no more difficult than debugging applications that write files, and text editor is always available? I don't even know what your point is there, as the VCS isn't itself going to use a text editor, and there ARE other DB servers out there (Sqlite, Postgre, MySql, etc.) so that if you WANTED a db backed solution lack of a db server shouldn't be a factor.
–
AndyNov 5 '12 at 17:46

1

@Andy ...the programmers are going to use to use a text editor. You know, text editing is still available as a secondary function even in your favourite IDE.
–
ZJRNov 5 '12 at 19:39

1

@Andy sqlite is the only possible alternative to text files, given the vast amount of distributed scenarios modern DVCS serve. (idk, maybe you might have missed the "distributed" part of DVCS) Anything else would be too cumbersome (configuration + firewalling + license) or even silly to be distributed. Then again doing a worst case scenario postmortem to an sqlite might prove hard.
–
ZJRNov 5 '12 at 19:40

@ZJR How is editing code in a text editor relevent to the backing store of a VCS? Are you suggesting manually editing, say SVN's database? Also my question is not limited to DVCS, so I don't know why you're harping on it.
–
AndyNov 6 '12 at 14:52