Why Perforce is more scalable than Git

Posted six years ago

Okay, say you work at a company that uses Perforce (on Windows). So you're happily tapping away using perforce for years and years. Perforce is pretty fast -- I mean, it has this "nocompress" option that you can tweak and turn on and off depending on where you are, and it generally lets you get your work done. If you change your client spec, it synchronizes only the files it needs to. Wow, that's blows the mind! Perforce is great, why would you ever need anything else? And its way better than CVS.

Suddenly you have to clone something with git, and BAM! The world is changed. You feel it in the water. You feel it in the earth. You smell it in the air. Once you've experienced git, there is no going back, man. Git is the stuff man. You might have checked out firefox -- but have you checked out firefox ooon GIT?

So many really obvious things are missing in p4. Want to restore your source tree to a pristine state? "git clean -fd". Want to store your changes temporarily to work on something else? "git stash". Share some code with a cube-mate without checking in? "git push". Want to automatically detect out of bounds array accesses and add missing semicolons to all your code? "git umm-nice-try"

Branching on git is like opening a new tab in a browser. It's a piece of cake. You can branch for EVERY SINGLE BUGFIX. And you wrote the code, so you get to merge it back in, because you are the expert.

Branching on Perforce is kind of like performing open heart surgery. It should only be done by professionals: experts in the art who really know what they are doing. You have to create a "branch spec" file using a special syntax. If you screw up, the entire company will know and forever deride you as the idiot who deleted "//depot/main". The merging is done by gatekeepers. Hope they know what they're doing!

Now, if you have been using git for a few days you might discover this tool called "git-p4". "AHA!" you might say, "I can import from my company's p4 server into git and work from that, and then submit the changes back when I am done," you might say. But you would be wrong, for a number of reasons.

git-p4 can't handle large repositories

Really. It's just a big python script, and it works by downloading the entire p4 repository into a python object, then writing it into git. If your repo is more than a couple of gigs, you'll be out of memory faster than you can skim reddit.

But that problem's fixable. I was able to hack up git-p4 to do things a file at a time in about an hour. The real problem is:

Git can't handle large repositories

Okay this is subjective because it depends on your definition of large. When I say large, I mean about 6 gigs or so. Because your company's source tree is probably that large. If you have the power, you will use it. Maybe you check in binaries of all your build tools, or maybe for some reason you need to check in the object files of the nightly builds, or something silly like that. P4 can handle this because it runs on a cluster of servers somewhere in the bowels of your company's IT department, administered by an army of drones tending to its every need. It has been developed since 1995 to handle the strain. Google also uses Perforce, and when it started to show its strain, Larry Page personally went to Perforce's headquarters and threatened to direct large amounts of web traffic up their executives' whazzoos until they did something about it.

Git has none of that. The typical git user considers the linux kernel to be a "large project". If you've looked at Linus's git rant on Google code, take a listen to see how he sidesteps the question of scalability.

Don't believe me? Fine. Go ahead and wait a minute after every git command while it scans your entire repo. It's maddening because its long enough to be annoying, but not enough time to skim Geekologie.

The solution

You know what? I don't think many people really use distributed source control. The centralized model is here to stay. Most git users (especially those using Github) use the centralized model anyway.

Ask yourself this: Is it really that important to duplicate the entire history on every single PC? Do you really need to peruse changelist 1 of KDE from an airplane? In most cases, NO. What you really want is the other stuff: easy branching, clean, and stash, and the ability to transfer changes to another client. The distributed stuff isn't really asked for, or needed. It just makes it hard to learn.

Just give me a version control system that lets me do these things and I'll be happy:

Let me merge changes into my coworker's repos, without having to check them in first.

If you have a 6gb repo or you're checking in object files, you are pants-on-head retarded anyway. Git follows the philosophy of providing "enough rope to hang yourself". If you do actually hang yourself, you're probably doing the world a favour (ok so the metaphor breaks down here).

::Want to store your changes temporarily to work on something else? "git stash".

--> This has been possible with the "p4 shelve" command since P4 2009.2.

::Share some code with a cube-mate without checking in? "git push".

--> There are ways to do this, but creating a branch for every person or code fix isn't a typical way of doing business in P4.

Re: Branching, git vs. P4

::Branching on Perforce is kind of like performing open heart surgery. It should only be done by professionals: experts in the art who really know what they are doing. You have to create a "branch spec" file using a special syntax.

--> This really has never been true. Branch specs are helpful but not required. If you understand branching strategy for your team/group/company, this isn't difficult at all. Merging, on the other hand, can be ugly if you do it wrong and submit the changes. That's true with any SC system.

::If you screw up, the entire company will know and forever deride you as the idiot who deleted "//depot/main".

--> You can't really delete a branch by branching. By merging, sure. This is what rollback is for.

I _love_ having the full history available, it's why whenever I expect to work on an svn project, I check out the specific folders with git-svn. I very often do whatchanged -p to check other people's checkins, perhaps grepping it, etc. And the log too. I haven't tried perforce, but doing svn log on an sf.net repo is slower than just loading their viewvc web page. More than enough time to loose track of the task at hand; git lets me check logs without getting me out of my flow.

(And for those of us outside the USA, being able to work offline is a must, but I can see how not everyone will care about that.)

Another place where Git comes up short is the inability to lock files. Many programmers seem to see that as a feature of Git. But, for anyone (artists, designers) who works extensively in binary files where changes can't be merged, the ability to lock a file for editing, or to know that someone else has already locked it for editing is the single most important feature of version control.

Wow. So Git handles Gnome, KDE and Android among others and you're saying Git can't scale. Your argument is based on large media files and a repository setup the way Perforce likes it. It's not a problem with Git. The problem is with the way you've configured your repository. I don't blame you since you're coming to DVCS from a centralized mind set. Change the way you structure your repository and you'll find things are actually much better. You might want to look into sub modules.

Finally, you're talking about disk space. Mind telling me why I need to have double the disk space available with Perforce just to be able to switch quickly between any 2 branches? I have actually run out of disk space just because of this and have lost valuable productive time trying to free up enough space to check out another branch. Never again.

We used git for a AAA videogame that had a good success. The repository grew up to 110-120GiB, and of course it got larger and larger as it got dirty on your computer. We had it mixed with SVN (for artists) and there were lots of binary files. With the right mix of SSD, common sense and configuring git worked just perfectly.

On the other hand, I'm using perforce right now. Turns out that even a simple merge, check-in or branching is slow. The client continously polls the server, sometimes crashes if you make it age and must rely on network and servers for every little thing you want to do. Yes, shelving relies on the server, the server even keeps track of what I have and what I don't, with the obvious desynchronization issues.

I'm gnashing my teeth at perforce - wants to download over a gigbyte of things that are already exactly in the place it wants to download them to because (unlike git) it doesn't scan the file system and doesn't md5sum large media files - it prefers to download them all over again.

I've had the same experience trying to use git as a front end to our company's giant p4 repo. Unlike most of the complains I've read, we don't store big binary blobs in the repo. One or two here and there, but most of it is source files. 600MB and 27k files worth of source code. Due to really bad design choices stemming from an uncouth history in SourceSafe :) things are pretty strongly interconnected, so it doesn't make much sense to just split them up into several repos. Git on that repo was just really frustratingly slow, even compared to p4 over a VPN. I've also never managed to get git-p4 to work.

I really want the local branches and lack of needing to check out files, and we gave up on a server update since 2002 (it was that or health insurance -- that bad), but it's just become a big time sink for me to even investigate it anymore.

I was heavily modding Fallout 3 with files from fallout3nexus.com. After a while I wanted to be able to switch back and forth to different mods in a way that FOMM (Fallout Mod Manager) was unable to do well.

So I thought, "hey! I'll just take a fresh install and make it a git repo." This worked to some degree but some of the mods had large files. Eventually when I went to switch to a different branch it just died with an out of memory error.

This is because git has to be able to store the whole file in memory to process it. My machine has 6GB of RAM (the one I was using) but on Windows most versions of git are 32-bit.

Bam. Dead in the water. I had to actually boot up Ubuntu on a live disc, apt-get install the 64-bit version of git just to swap branches. Fail; plain and simple.

It sucks to have a designer create a great tool like git only to have him also be too lazy to solve some edge cases for others.

* File sizes > RAM? This should be doable in a slower way only when needed.

* File sizes > 32-bit version capabilities? Again fix it but have it use the slower algorithm only when needed.

* 32-bit only version.... Seriously most new computers other than netbooks have 64-bit capability these days. Just make it the default

Being too stuck up to solve this problem that would obviously increase adoption of your tool just seems dumb. And for those who say >6GB repos and you're doing something wrong or don't have large repos or don't revision large files obviously haven't run across a business need to do so but when your paycheck requires it you'll be singing a different tune.

I used Perforce when I worked at Google and will likely use it again in my next company for which I just got hired. I like it but I know I am going to miss features from a DVCS. I used Bazaar at my last company and it was quite nice but also suffers from the same problem as git and I believe hg.

Often Git and Mercurial are sold as DVCS being the killer feature. To me branching being a first-class and easy operation is the killer feature, and that doesn't require DVCS. We use Accurev, which has first class branching and is server based. A full checkout of our codebase is about 10GB, although most people only checkout 4-6GB of that. The depot history goes back to 2004. With those sizes it works just fine.

That's exacly my case. We tried to migrate WebMethods repository containing lots of services (corporate scale, all currently used/deployed, and cannot be split into submodules/subtrees). It contanis like 100k files, and doing simple git status took about 10 minutes of disk IO while it was scanning for changes.

git is clearly designed for what I would call "small" projects like the Linux kernel. If you want to do another project, you do not add it to an existing git repository, you make another one. This best fits with pushing and pulling a single project. But if you have a large system that is composed of many such smaller projects, you have to use something other than the source control system to synchronize their dependencies.

To be honest, I think you should write a new article that refers to these comments here as well. Some even claim that having 6 Gigs source code is in the wrong here... what kind of projects produces 6 gigs of source code, not even java is that verbose... ;-)

>"I don't need distributed source control, so I know nobody out there will need it, as I don't see why they should. But they WILL need to move 6gb repos, because I do, so that's what normal people needs."

In short, different people different needs. I'm the happiest SCM user since I switched to git for my <6gb projects, which doesn't mean it does have to fit everyone and every possible project, for the same reason I don't use vim to edit jpg files.

Cheap branching is still broken and doesn't work properly under CVS, Subversion, or Perforce.

It works well under the DVCS tools such as Git and Mercurial (though a lack of branch naming is sometimes an issue depending upon the tool) - it works absolutely blindingly under Clearcase - unfortunately for Clearcase it is expensive IBM software, and the hardware constraints on that tool (particularly for dynamic views) make it compromises also.

Perforce, CVS, Subversion are cut from the same cloth however - they are lightyears behind the branching capabilities of DVCS's and also Clearcase which has had fantastic branching semantics available since the mid-90s.

I wish our data sets were small enough to check in to P4. Or does it handle a few TB of uncorrelated sensor data with ease? There's always an upper bound. P4 seems to hit the sweet spot for game design; but for raw code, I'll stick with git.

BTW, rather than store the data in the repo, we've started storing the git hashes with the data. Works nicely.

The whole point of git is that you work on only what you need to and leave the rest to the others. When you're working on what you need to, yes it is amazing to have all the history since day 1, especially since that day 1 code could have been written by somebody else thinking something else.

Why would you ever expect git to work well in a centralized usage scenario? Would you expect p4 to work well in a distributed use case? Honestly, dude...Apples and Oranges.

And what happens when that central server is inaccessible? or when you're travelling to a trade show with a demo and you have a really cool idea on the plane you'd like to try out? P4 can be a real pain in the proverbial wazoo in those circumstances.

I agree that Perforce is probably better for most people. Having used it at a previous job, I wish I could go back to it. However, we're using git here because the Perforce prices have gotten sky-high! $900/user just to get in the door is ridiculous. If I wanted to pay that kind of money, I'd get a real tool like ClearCase...

It's been a long time since I used git-p4, because, well, it couldn't handle the depot. But I've written a more efficient and more targeted importer (as well as a plugin mechanism for the core git); if you're building git yourself, you can pull "git://iabervon.org/git.git p4-clean" and try it. I use it at work with our large perforce depot and it does a good job for all of the parts of the depot I happen to work on. Exporting is left as an exercise for the reader (and if you do it, let me know), but it's great for figuring out what actually happened in the recent history and for previewing tricky merges so that you can check whether you're doing them right in p4 afterwards.

You'll want to get some p4api and set P4API_BASE to the directory where you untar it; this lets the plugin use the C++ bindings for perforce instead of running the command-line client.

Look at Documentation/vcs-git-p4.txt for how to configure it; you generally end up actually getting data simply with "git fetch origin" (or "git fetch" if you apply the bugfix I forgot to send back from work).

binary files are a known problem and are receiving some attention in git. There are some ideas in the cooking pot that may make a big difference. On the mailing list it was asked if anyone has a repository that could be experimented upon.

Thanks for this post. I've been hearing such great stuff about git, and like you the commands it offer seem absolutely killer. But I was concerned it would suffer from the same problems as all the other open source SCM's: it dies horribly with large files.

I'm in the games biz myself and we ran into these problems with svn. Once we got past a certain size team and asset base, it started to really choke. I wrote up a little postmortem at scottbilas.com about our experience with it (search for 'svn').

We tried really hard to make svn work because of the astronomical price of P4. A price that we all grudgingly pay again and again in this industry because everything else is so much worse.

My current plan is to clone the commands from git into our command line p4 extension tool we have (it does things like auto-creating Crucible code reviews and such). For example, 'stash' should be pretty easy to implement. Actually, it already exists. Search the p4 public depot for 'p4tar'. I haven't tried it out yet.

Anyway the other commands should be implementable with a tool on top of p4 using p4api.net. If I only had some spare time.. :)

There's a bit of confusion in this piece. Firstly, what systems like Perforce do is collect many projects in one place and give you a timeline for them. So when looking at "repository size", consider that you don't normally keep every project in the same repository with git.

Of course with the Perl Perforce repository, the size was something like 450MB in Perforce and 70MB in Git, once the crazy metadata format used by perforce's insane integration system were appropriately grokked.

I mean, don't get me wrong, I think Perforce is a great product - beats SVN hands-down in design and was around many years before - it's just too complex. Integration is badly modelled, hardly anyone understands it properly. So in that respect, Perforce doesn't scale to very large teams because the branching model is too hard to work with.

Yes of course Git doesn't do a lot of that product release cycle development / Software Configuration Management. It's unix: it does one thing and does it well.

Unenlightened thoughts personal thoughts on storing binary info in Perforce, or any other version control tool....

IMHO, a version/revision control tool, with all it's diff, 3-way-merging, and compressed delta storage goodies is at it's best when it's storing editable source. Storing binary data, especially binary data that can be recreated from the version controlled source, is not the ideal use for this kind of system. That said, I've done it too, because I also believe that every version of the source should include the tools used to process the source into product shipped to the customer. But I would like to consider the use of a different paradigm for the archiving of binary data, especially mongo BLOBs. I would like to consider a system more ideally suited to storing Big Honkin binary files, and have a reference to those BLOBs in the version control system. Now I wonder what would work.....

I must be missing something. Wouldn't a personal branch work just fine for this?

{quote}Make branching easy. {/quote}

Branching in Perforce is difficult for users who don't understand the nuances of client workspace mapping. When you understand how the repository is structured, and how your local hard drive is layed out, it becomes so much easier. If you don't know the structure of the repository, which contains the family jewels, please turn in your coder's badge. If you don't know how your own disc is structured, please turn in your computer.

{quote}Don't waste 40% of my disk space with a .git folder, when this could be stored on a central server. {/quote}

Good idea. I'm curious -- let's say we had a multi-Tb repository, with 80k files on just one tip, tens of thousands of branches, 1600 coders, 11 locations, 8 time-zones. If we were using GIT, and I wanted to work disconnected from the network for a couple days, what would be "gotten" onto my laptop?

If you have that large a repo, it's probably because you're stuffing large binary blobs into git. If you're stuffing large binary blobs into git, you need to look into the .gitattributes file so that git won't try to diff/compress said large binary files. It's got some heuristics to try and recognize them, but making its work a bit easier is sure to show you some gain.

There was quite a bit of research done a while ago investigating the size of the average dev team. The number was <10. Kind of surprising, but true. There are relatively few places in the world where enormous, cross-referenced project repositories are needed: Microsoft, Google, Siemens, Philips, government agencies, etc.

However, for 99% of the software developers out there, git (or one of it's DVCS brethren) just works. In those cases, the benefits of being entirely mobile, having near zero time cost for most actions, and the ability to easily experiment with the contents of the repository are game-changing wins. For the top 1%, there are tools like Clearcase and Perforce.

@masukomi... you can set up P4 proxies to help alleviate the pain if you have a lot of data to transfer. But you are right, it has to keep a database on a single server, the size of which is dependent on the number of clientspecs/branches.

actually.... p4 doesn't run on a cluster of servers. That's one of it's biggest shortcomings. It runs on ONE server, one really, big beefy freaking server if you have lots of stuff and users. Google, for example, was having serious problems with the speed of, well, everything p4, until they went out and bought one of the most powerful computers they could. Then all was well again.

So yeah, it's scaleable, but it's directly proportional to the size of the server it's on.

it seems as if there is a specific problem with Git, namely it doesn't handle large binary files well (large images, artwork, etc).

Has anyone actually taken this specific use-case to the Git developers on the mailing list?

Second, it seems like your problem could be solved by having a separate machine to run Git just for your Binary assets. When you need to make a build, you just dump all those files to the machine, have it version the directory, and then include that 'version' into your Git source repo.

I just wanted to let everyone know that this post is dead-on. I work at a software company that is entirely based on P4. The repositories are huge because they contain a lot of non-source files, like Photoshop, videos and such. Trying to push to git has been painful because it is massively slow on any large repository. The insert alone can take several hours.

Any web company with non-source code in their repo will run into the same thing. I'm surprised more people haven't pointed out this glaring problem with the git model.

You can't simply measure performance against the size of the repository and call that "scalable".

The number of simultaneous clients that can be doing operations is just as important, if not more so. P4 was notorious for holding locks far longer than necessary, and clients would queue up for minutes at a time (I rememeber syncs that would take more than half an hour on a fairly small repository because there were a hundred other clients trying to sync).

P4 does *not* in fact, scale well (although, I admit that more recent versions of P4 are better than what I was using is 2004).

I feel pretty confident your assessment of git would be different if you had 1,000 coworkers using your P4 repository at the same time.

2. Versioning artwork against code is just as important as versioning one code change against another or one artwork change against another, and having your artwork and your code in different version control systems, even when they're both structured around atomic changes (which Alien Brain isn't) causes problems.

So most teams just dump artwork, intermediate data files, and all sorts of things in the same p4 depot that their code is in. And it works like a champ. Except that p4 is missing so many of the cool features that git gives you.

As a CLI-proficient user of both git and p4 (also having hacked git-p4 to restore some sanity), I can state with full confidence that git beats p4's CLI like a redheaded stepchild eight days a week and thrice on Sundays. Any perceived benefits or "power" that p4 gains from being adept with binary blobs of redonkulous girth is irrelevant when the command line tool is worse than friggin' CVS and all of the GUIs suck.

"we store all of our code and data in p4 because it's the Right thing to do"

Well you may have identified another use case where Git is not ideal - really large binary blobs. I think the problem is Git has to checksum (sorry SHA1) all files it scans - and that would take some time on a 36GB file.

To be fair, Git has always been advertised as a SCM - i.e. a source-code management system - and for that use-case it absolutely rocks IMO. Personally I would still investigate a hybrid approach where you have the option of pulling just the source down to your lappy with Git, so if you are on the plane and you DO want to look at change-set 1 at least you can!

The one feature of a DVCS that I really really really like is the ability to use it as a sneakernet. Not all of the machines I develop on are connected to a network, or connected to the same network that the central/blessed repository is on.

Bypassing the central respository to share patches... meh. This I do not see as a feature -- if there's a central repository, it should be used as the mechanism of communication between developers.

On the other hand, "stashing" stuff is really nice. And branching (and merging) *should* be easy. I'm all over those two requests.

As for wasting my disk space... meh. Sometimes I care, sometimes I don't (disk is cheap, but disk fills up faster still). Having an option for git to use either a local or a remote (central/blessed) repository would be nice.

Disclaimer: I still use CVS, I've used Perforce (and liked it), and I use git (and like it), and I don't currently have an repositories that approach the sizes discussed in the article.

Perforce may scale well with regards to data size. In my experience it doesn't scaled well over a distributed network. Between having to check files out to work on them and tight integration with Visual Studio, if your link to the Perforce server does down you practically have to stop work.

My one experience of Perforce was doing work with another company remotely. Our VPN was unfortunately a bit dodgy. Combine that with Perforce lead to an incredibly frustrating experience.

Let me get this straight: you're saying perforce is faster than git for large projects? This surprises me because most git operations are completely off-line since all the data is local. I thought that operations which require network I/O are the slower ones. Care to back up your claim with a specific use case and some data? (It's an honest question btw, I don't use git or perforce so I'm not defending git here.)

"we store all of our code and data in p4 because it's the Right thing to do"

Well you may have identified another use case where Git is not ideal - really large binary blobs. I think the problem is Git has to checksum (sorry SHA1) all files it scans - and that would take some time on a 36GB file.

To be fair, Git has always been advertised as a SCM - i.e. a source-code management system - and for that use-case it absolutely rocks IMO. Personally I would still investigate a hybrid approach where you have the option of pulling just the source down to your lappy with Git, so if you are on the plane and you DO want to look at change-set 1 at least you can!

@zzz, FWIW, we store all of our code and data in p4 because it's the Right thing to do. We make video games PS3+BluRay == massive content. At any given time, our data works with our code. If I need to sync back a month to look at some issue, I need the specific data to be sync'ed back too.

Really, for us, p4 works great. It stays out of our way, it's faster than anything out there. It's not distributed, but we don't care about that.

I've done some testing on big p4 repositories. Specifically, 36GB. Git was awful. p4 continues to haul ass. A p4 sync takes less than one second, if no files have changed on the server. Just doing a git status was on the order of minutes.

Git, plain and simple, does not scale to large repositories. That's OK, I guess, it's not really designed to handle that use case.

Git was designed to be a version control tool - no a quasi- file-server 'repository' which is how most other tool like Subversion and Perforce are actually used. It was also not designed to track a whole set of unrelated projects - say a teams entire code-base, something that both Linus and then Randall made pretty clear.

The solution? Track each project as a single Git repository, and if you need to tie them together, create a master repository that included each one as a sub-module. The flexibility you gain from 'setting free' your individual projects is enormous, as it the smart use of a master repository that uses branches to create different mash-ups of your overall code-base.

Javascript was not designed to do asynchronous operations easily. If it were, then writing asynchronous code would be as easy as writing blocking code. Instead, developers in node.js need to manage many levels of callbacks.
Today, we will examine four different methods of performing the same task asynchronously, in node.js.

If you open the first few pages of O'Reilly's Beautiful Code, you will find a well written chapter by Brian Kernighan (Personal motto: "No, I didn't invent C. Who told you that?"). The non-C inventing professor describes how a limited form of regular expressions can be implemented elegantly in only a few lines of C code.

Burgeoning numbers of Ph.D's and grad students are choosing to study pornography. Techniques for the analysis of "objectionable images" are gaining increased attention (and grant money) from governments and research institutions around the world, as well as Google. But what, exactly, does computer science have to do with porn? In the name of academic persuit, let's roll up our sleeves and plunge deeply into this often hidden area that lies between the covers of top-shelf research journals.

Every company that I worked for has its own method of testing, and I've gained a lot of experience in what works and what doesn't. At last, that stack of conflicting confidentiality agreements that I got as a coop student have now all expired, so I can talk about it. (I never signed them anyway.)