Share this story

Git, the open source distributed version control system created by Linus Torvalds to handle Linux's decentralized development model, is being used for a rather surprising project: Windows.

Traditionally, Microsoft's software has used a version control system called Source Depot. This is proprietary and internal to Microsoft; it's believed to be a customized version of the commercial Perforce version control system, tailored for Microsoft's larger-than-average size. Over the years, Redmond has also developed its own version control products. Long ago, the company had a thing called SourceSafe, which was reputationally the moral equivalent to tossing all your precious source code in a trash can and then setting it on fire thanks to the system's propensity to corrupt its database. In the modern era, the Team Foundation Server (TFS) application lifecycle management (ALM) system offered Team Foundation Version Control (TFVC), a much more robust, scalable version control system built around a centralized model.

Much of the company uses TFS not just for version control but also for bug tracking, testing, automated building, and project management. But large legacy products, in particular Windows and Office, stuck with Source Depot rather than adopting TFVC. The basic usage model and theory of operation between Source Depot and TFVC are pretty similar, as both use a centralized client-server model.

Since 2013, Microsoft has been integrating Git into TFS, and today TFS and Visual Studio offer full support for centralized version control using TFVC and distributed version control using Git. With this first-party support for the system, Git adoption has spread within the company, most visibly in open source projects such as ChakraCore, the JavaScript engine used in the Edge browser, but also in closed source products—including, as it turns out, Windows itself.

We've written about OneCore, Microsoft's restructuring of Windows and unification of the operating system across phones, tablets, Xbox, PCs, servers, HoloLens, and beyond. Before OneCore, Microsoft had multiple incompatible forks of Windows, each with their own development streams, causing substantial duplication of effort. With OneCore, the common parts were brought together, and the unique customizations—things like Xbox's dashboard, HoloLens's 3D interface—cleanly isolated and layered on top.

Just as Windows' development had become complex and fragmented, so too did the company's internal systems for things like source control, issue tracking, testing, building, code analysis, and all the other tasks that fall under the application lifecycle management umbrella. And just as Windows' development was unified as OneCore, the company has embarked on an effort to unify its ALM and develop what it calls One Engineering System (1ES).

The cornerstone of 1ES is TFS, but for 1ES, the company wanted to do more than just standardize on TFS; it wanted to switch to a single version control system. TFVC, Source Depot, and Git were the obvious contenders, though other options such as Mercurial were also considered. In the end, the company standardized on Git.

However, this decision came with some complexity. The Windows codebase, for example, is large, with decades of history. It has millions of files, taking hundreds of gigabytes of storage. In a centralized version control system, this isn't too big an issue; only the central server needs to store all of this data, with each developer only needing to store the latest source code on their local systems. But decentralized systems don't work this way; by default, making a local working copy of a remote repository in Git requires replicating everything, including the decades of history. This is key to its decentralized nature—every repository contains all the history of all the files, making them all equal peers. For Windows, this meant that every developer would need to fetch millions of files and hundreds of gigabytes. The initial clone of the repository took hours, and even simple tasks such as checking to see if all files are up to date took many minutes.

Accordingly, Microsoft has been working to enhance Git to improve the way it handles vast repositories. Central to this effort is a new project released (in part) as open source Git Virtual File system (GVFS). The premise of GVFS is straightforward enough: rather than fetching all the data at once, only a bare skeleton of the repository needs to be populated up front. The virtualized file system subsequently retrieves additional data on a demand-driven, as-needed basis. Building one particular Windows component, for example, will cause GVFS to fetch the files that make up that component, along with anything that the component depends on, but it will stop short of fetching all the many hundreds of gigabytes the repository contains.

This work requires changes to Git itself, which Microsoft is working to contribute back to the Git project. This work is naturally open source. So too is a large portion of GVFS itself. But a key portion is not; while the code for fetching files and interacting with a remote Git repository is all open, the actual file system bit that runs in kernel mode is not.

FUSE for Windows on the horizon?

Currently, that file system driver is available as a preview with a restrictive license. Microsoft says that the driver isn't yet ready for prime time—you should only test GVFS in a virtual machine or similar discardable environment. The version available now is just a preview. But the driver itself may turn out to be useful for more than GVFS, and in so doing could fill a longstanding gap in Windows' functionality.

Developing file system drivers is pretty complex on any platform—if a file system driver crashes, you have the double inconvenience of crashing the machine with a blue screen or kernel panic and the specter of data loss due to screwing something up with how data is read from or written to the disk—but Windows makes it particularly awkward. That's because Windows has no first party, supported equivalent to FUSE ("file system in userspace"), a framework for developing file systems without having to write kernel code.

FUSE is available on macOS, Linux, FreeBSD, Android, and more. It can be used to develop full file systems that store data on disks, but just as often, it's used for "virtual file systems" of the very kind that Microsoft has created with GVFS. With GVFS, files are stored locally on a regular NTFS disk or remotely on a Git server. GVFS doesn't manage the actual on-disk layout of how that data is stored; it just provides a sort of intercept layer. If a program tries to open a file that hasn't yet been cached locally, GVFS will fetch it from the remote Git repository and store it locally on NTFS before allowing the open operation to proceed.

There are many FUSE file systems that work the same kind of way, transparently fetching files from, for example, cloud storage or remote systems connected by ssh, as well as copying them back to the remote system whenever the local file is modified.

Lacking FUSE, Windows has no good way of developing this same kind of virtual file system. This is unfortunate. Windows 8.1 included a neat way of using OneDrive: all your cloud files "appeared" local, but the data would only actually be fetched when you attempted to open the file. However, Windows 8.1 didn't use a file system driver for this integration with OneDrive. While attempts to open cloud files from within Explorer (and within certain applications) were properly intercepted, causing only a slight delay while the file was downloaded before it could be opened, Windows 8.1 didn't intercept attempts to open files made from the command-line or through low-level Win32 APIs. This made the OneDrive integration rather uneven: in some places, it worked as it should, transparently fetching and saving files as you worked with them, but in others it just produced error messages. As a result, Microsoft removed the feature in Windows 10.

(In contrast, Dropbox's new Project Infinite capability, which has recently become available to business users, does use a file system driver and so should offer much greater compatibility.)

Microsoft describes the GVFS driver as the "moral equivalent of the FUSE driver in Linux." If it truly is the moral equivalent of FUSE, it suggests that Windows will at last get the same kind of extensibility and scope for user mode file systems that Unix users have enjoyed for many years. It might even provide the basis for a better reimplementation of the OneDrive cloud storage feature that was taken away.

Microsoft isn't alone in facing scaling limits from existing version control systems; a few years ago, Facebook switched from a combination of Git and Subversion to Mercurial. Facebook felt that neither Git nor Subversion offered the scalability that it needed; it considered modifying Git, but that it would be easier to extend and improve Mercurial instead.

I'm impressed that Microsoft would standardize around Git if they didn't already have a solution ready to overcome the many hours download it would take to clone a repository. This new Microsoft is very foreign to me.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Ah, brings back memories of when I first decided to try GIT on a new project. At one point both me and a team member made a couple mistakes, caused a bunch of conflicts, and I spent hours trying to merge and fix everything.

As someone who uses git everyday, I hope that this patch is not accepted. This is not core to how people use git.

Just truncate the history already.

I have never seen a git repo that purposely rewrites the history, especially to that extent. If anything, that goes against how people use git. If you're talking specifically about not including the old history when converting to git, what do they do in 10 years when they run into this issue again? But anyways the issue seems to be not just about the history, but also the massive amount of files even at the current head only.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

Dude, MS hasn't had a clean sheet build of Windows since, well ever. A project that started in the 1980's as a graphical shell for DOS and then a separate (but largely compatible) branch as a quasi-clone of VMS isn't exactly the kind of thing where modern software engineering design principals would have been likely to be found.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

Legacy reasons, as explained in the article. Their previous Source Control required it and fixing it would probably break everything.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

Dude, MS hasn't had a clean sheet build of Windows since, well ever. A project that started in the 1980's as a graphical shell for DOS and then a separate (but largely compatible) branch as a quasi-clone of VMS isn't exactly the kind of thing where modern software engineering design principals would have been likely to be found.

Which brings an interesting question. Are we ever going to see another modern full-featured OS written from scratch ever again? Or have the OSs become too complex to develop from scratch?

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

Dude, MS hasn't had a clean sheet build of Windows since, well ever. A project that started in the 1980's as a graphical shell for DOS and then a separate (but largely compatible) branch as a quasi-clone of VMS isn't exactly the kind of thing where modern software engineering design principals would have been likely to be found.

Which brings an interesting question. Are we ever going to see another modern full-featured OS written from scratch ever again? Or have the OSs become too complex to develop from scratch?

Sure, why don't you start writing the kernel, while I set a github for it. /s

But seriously I guess there haven't been a need for it, linux is customizable enough that anything that is just simpler to adapt it for whatever use you need.

As a non-developer, i wonder what it's like to browse the Windows source code. Legacy support has always been paramount in Windows, so i bet you can find code back from windows 95 hiding somewhere. And that's not to count MS's penchant for coding around hardware quirks to accomodate all kind of exotic hardware.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

Dude, MS hasn't had a clean sheet build of Windows since, well ever. A project that started in the 1980's as a graphical shell for DOS and then a separate (but largely compatible) branch as a quasi-clone of VMS isn't exactly the kind of thing where modern software engineering design principals would have been likely to be found.

Which brings an interesting question. Are we ever going to see another modern full-featured OS written from scratch ever again? Or have the OSs become too complex to develop from scratch?

As someone who uses git everyday, I hope that this patch is not accepted. This is not core to how people use git.

Just truncate the history already.

I have never seen a git repo that purposely rewrites the history, especially to that extent. If anything, that goes against how people use git. If you're talking specifically about not including the old history when converting to git, what do they do in 10 years when they run into this issue again? But anyways the issue seems to be not just about the history, but also the massive amount of files even at the current head only.

This is very much a common pattern. After all, DVCS means that that history is copied all over the place. Do you really need to look at years old commits, let alone 20 or 30 year old commits?

It disturbs me that MS would seek to weigh down git with practices that are far outside of the mainstream.

I would gather that MS keeps old commit history because of backward compatibility. I've read about cases where application developers would find undocumented API calls, or use an API in an unintended manner, and MS would leave the functionality because it would break client apps. Without the history I'd imagine someone new would come along and rewrite a block of code not realizing they were breaking what now is expected functionality.

1. This seems like a step backwards for Git, and a step towards further centralization of it. I don't really particularly like this development.2. Why GVFS? This name is already in common use by the GNOME Virtual File System (albeit with different capitalization).3. There is already a FUSE equivalent for Windows: Dokan.

It says a great deal about MS that they saw non-centralization as a bug that needs to be fixed. Seems rather authoritarian to me.

Oh, get your cryptoanarchist bull out of here. Microsoft had a unique use case which very few people shared, and incidentally also the developmental expertise to solve it. Now you get new tools, which you don't actually have to use if you don't need. The hell is your problem?

As a non-developer, i wonder what it's like to browse the Windows source code. Legacy support has always been paramount in Windows, so i bet you can find code back from windows 95 hiding somewhere. And that's not to count MS's penchant for coding around hardware quirks to accomodate all kind of exotic hardware.

There are still exposed pieces that haven't changed from 3.1 and DOS that you can find if you know where to look, I'm sure there are a lot more in the internals.

1. This seems like a step backwards for Git, and a step towards further centralization of it. I don't really particularly like this development.2. Why GVFS? This name is already in common use by the GNOME Virtual File System (albeit with different capitalization).3. There is already a FUSE equivalent for Windows: Dokan.

It says a great deal about MS that they saw non-centralization as a bug that needs to be fixed. Seems rather authoritarian to me.

Oh, get your cryptoanarchist bull out of here. Microsoft had a unique use case which very few people shared, and incidentally also the developmental expertise to solve it. Now you get new tools, which you don't actually have to use if you don't need. The hell is your problem?

While I don't necessarily agree with the "authoritarianism" point, I don't agree with this either. "You don't have to use it" is rarely a valid argument in software due to the network effect - other people's choices to use a certain technology can have ramifications for you as well, and in many cases force you into the same choice.

The reality is that to many people who already don't understand the decentralized model of Git, this is just "something to make Git faster", and it will become commonly used, and Git hosting services will start recommending it, and then things will start being built against it, and now everybody's stuck with it.

While I don't believe that there's ill intention from Microsoft on this one (and that's a rare view for me to hold), I do think they should be very careful in how they present this to the public, and include the appropriate warnings about it being a specialized solution that shouldn't be generally applied.

If this is indeed viewed by people who don't share your decentralisation ideals as something that merely makes GIT faster, why stop them? Because Stallman decreed so?

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

That is mentioned in the article.

Windows is a single project, so uses a single repository.

In that project are multiple sub-projects that are worked on as independent units.

Git by default copies the entire project whenever you choose to work on a single subsystem. One of the changes MSFT has made is to copy only the subsystem files currently needed, then copying other needed files on demand.

This allows Windows to be managed as a massive, fully integrated project while development groups can focus on just that part that they work on.

Given the long history, you can expect a lot of spaghetti embedded in the source. The lack of FUSE being just one example of why this happens.

1. This seems like a step backwards for Git, and a step towards further centralization of it. I don't really particularly like this development.2. Why GVFS? This name is already in common use by the GNOME Virtual File System (albeit with different capitalization).3. There is already a FUSE equivalent for Windows: Dokan.

It says a great deal about MS that they saw non-centralization as a bug that needs to be fixed. Seems rather authoritarian to me.

Oh, get your cryptoanarchist bull out of here. Microsoft had a unique use case which very few people shared, and incidentally also the developmental expertise to solve it. Now you get new tools, which you don't actually have to use if you don't need. The hell is your problem?

While I don't necessarily agree with the "authoritarianism" point, I don't agree with this either. "You don't have to use it" is rarely a valid argument in software due to the network effect - other people's choices to use a certain technology can have ramifications for you as well, and in many cases force you into the same choice.

The reality is that to many people who already don't understand the decentralized model of Git, this is just "something to make Git faster", and it will become commonly used, and Git hosting services will start recommending it, and then things will start being built against it, and now everybody's stuck with it.

While I don't believe that there's ill intention from Microsoft on this one (and that's a rare view for me to hold), I do think they should be very careful in how they present this to the public, and include the appropriate warnings about it being a specialized solution that shouldn't be generally applied.

Learn to use the features you need. Encourage projects you work on to abandon Git features you dislike. This is simply the pressure you refer being applied by you.

The pressure is applied because someone chooses to apply the pressure. Nothing stops you from being that someone.

Each feature in Git is there because someone, somewhere found a need for it. This does not make them needed or commonly used. Learn to use the features you need, then take a little time to examine other parts so that you will at least be aware the feature exists should you join a project that has a need for the one of the exotic capabilities ... Windows for example...

Nice move Microsoft. It shouldn't be surprising as Microsoft has been using git for public facing projects for a while now. The source codes for Visual Studio Code, ASP.NET MVC, .Net Core, and Entity Framework have all been on git for years.

Likewise Team Foundation Server, Visual Studio, and VS online treat git like a first class citizen. Later versions of VS promote github integration and Microsoft has been migrating projects off a variety of platforms to github. So the writing has been on the wall for a while now. Still I am glad to see this embrace of git seems to be enterprise wide and covers internal facing repositories as well.

Mmm, I think this is probably because of how windows deals with lots of small files (>100,000s), which, incidentally, git use a lot. I don't know if it is a problem of NTFS vs ext4 or something deeper in the OS. But it is true that windows has performance issues in these scenarios.Just compare the time it takes to clone linux, the kernel, (https://github.com/torvalds/linux) in a windows vs a linux machine. Imagine that with 300 GB of source code files.

I'm curious why the entire Windows source code is in one monolithic repository. It's not broken down into myriad subsystems, modules, etc? There is no need for one giant repository, it should be many (many) smaller and more focused repositories. If it needs to be integrated for a build have that happen centrally somewhere.

Quite odd.

They say they've tried submodules and other approaches and they don't work as well. Sometimes you need to make a change that spans modules, and you want that to be atomic, I guess.