I am wondering if there is any system that would change source code trees in subtle from version control systems in ways that are hard to discover (i.e. whitespaces at the ends of lines, perhaps even change some less variable names) to "watermark" the code so that it would be possible to find out who pulled it from the repository. Do you think that such a system would scare the developers from uploading the code online?

Edit:
To address the "that's a trust issue" argument, I'd like to point out that it's a hypothetical question and I was thinking of this as of a way of preventing a developer from leaking the source code in a huge company, where lots of developers have access to whole the tree.

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
If this question can be reworded to fit the rules in the help center, please edit the question.

3

Are you trying to prevent your developers from leaking your code to the outside world? That's a trust issue... If you can't trust them with your code, why are they in your network?
–
JohnMay 6 '13 at 19:53

5

You are trying to solve a personnel problem with technology. Moreover, you rely on them not being able to run the code through a simple formatter. If you can't find the leak you can always try decimation.
–
Deer HunterMay 6 '13 at 19:54

@d33tah Duude! Are you even reading the comments? All companies think about that, you didn't bring a novel idea. This has been extensively discussed here, here and here.
–
Adnan - AdiMay 6 '13 at 20:13

1

d33tah, if there's stuff sensitive enough to consider watermarking, it is usually compartmentalized and access narrowed to the fewest number possible.
–
Deer HunterMay 6 '13 at 20:16

2 Answers
2

Before this gets out of hand with the edits and the comments, I wanted to say this.

This is not a hypothetical situation. Almost all software companies face this problem, and it all boils down to a trust issue. This topic has been extensively discussed on other StackExchange sites, most of the answers can be summarized in:

1- Hire people you trust.

2- Make then sign an NDA.

3- Give them access only to the code they need

I'll just expand on the last point to address the situation about which you're wondering. In the case of Windows, I'm 99% sure than not all Microsoft employees have access to the Windows source code. Not even the employees who work on Windows have access to all of the Windows code. Each employee should have access to the parts of code he's responsible of maintaining.

Developers on Project A don't need access to Project B's code, and so on.

Ignoring for a moment the wisdom of manipulating the source code, ask yourself how this watermark should work. You need to have different files for every developer. This means some automated code transformation that doesn't change the meaning of the code, yet isn't reversible. Oh, and it shouldn't make the code difficult to maintain. At the end of the day, it's supposed to be valuable code, right?

Add whitespace? That can be trivially normalized. Change variable names? You mean obfuscate the source code in your own source tree?

And if the watermark ever fails: “Hey, you broke the build!” “On your checkout, maybe. Works for me.”

Next, think about it: you need to have different files for every developer. Collaborate on hunting down a bug? “I tracked the problem in the debugger, x isn't initialized.” “What x? I only have y and z in my copy.”

Even if you somehow found a way to watermark the source code, this can only be useful if developers have no alternate way of obtaining the source code. So having the build server produce a source archive is right out. That's not a show-stopper, but you are walking a thin line.

Now let's think for a minute about the wisdom of this approach. You're telling the developers that you trust them to write code (and not to do a shoddy job, or plant backdoors), but you don't trust them not to leak the code. First, this position looks pretty inconsistent, so it's not going to go down well with your typical developer. Second, you're making their job more difficult (see above) for no tangible benefit. Again, that's not going to be popular. The predictable end result is that your developers will take one look at your coding methods, and go and contribute their ideas to the competition, and leave you with your shoddy code.

Even if by some miracle you were able to implement a watermark system, how useful do you think it could be? If Alice's version of the source code is leaked, how do you know Alice leaked it? If Eve wants to leak the code, assuming she cannot get rid of her watermark or change it to Alice's, she can disguise her tracks by leaking Alice's code. She might access Alice's machine surreptitiously — so you need to encrypt all drives, train your developers against evil maid attacks, have stringent policies of not sharing editor macros or uncommitted patches… While some of these are good high-security measures, to block all attack paths, you also have to severely limit team efficiency.

If you have a significant body of code, what you can do is limit the access to the source code, so that each team only sees the modules it's working on. For large enough code bases, this is in fact good hygiene as it maintains independence between components. If restricting access to parts of the source code isn't working for you, you don't have nearly enough code or you don't need nearly enough confidentiality to even consider this watermarking scheme.