Re: Transparently encrypt repository contents with GPG

Heya,

On Thu, Mar 12, 2009 at 22:19, Matthias Nothhaft
<[hidden email]> > What I need is a way to
automatically modify each file
>
> a) before it is written in the repository
> b) after it is read from the repository

Have a look at smudging, you might not need to touch the git source
code at all ;).

Re: Transparently encrypt repository contents with GPG

Sverre Rabbelier venit, vidit, dixit 12.03.2009 22:34:

> Heya,
>
> On Thu, Mar 12, 2009 at 22:19, Matthias Nothhaft
> <[hidden email]> > What I need is a way to
> automatically modify each file
>>
>> a) before it is written in the repository
>> b) after it is read from the repository
>
> Have a look at smudging, you might not need to touch the git source
> code at all ;).
>

And people asked me not to be cryptic... even though the OP explicitely
asked for encryption, of course ;)

"git help attributes" may help: look for filter and set attributes and
config (filter.$name.{clean,smudge}) accordingly. smudge should probably
decrypt, clean should encrypt.

BTW: Why not use an encrypted file system? That way your work tree would
be encrypted also.

Wouldn't this trip over the randomness included in all encryption [to
avoid generating the same cyphertext for two separate identical
messages, which gives away some information], which would let git
think the file has been changed as soon as its stat info has changed
(or is just racy)?

Not to mention that this makes most source-oriented features such as
diff, blame, merge, etc., rather useless.

This gives you textual diffs even in log! You want use gpg-agent here.

Now for Sverre's prophecy and the helper I haven't shown you yet: It
turns out that blobs are not smudged before they are fed to textconv!
[Also, it seems that the textconv config does allow parameters, bit I
haven't checked thoroughly.]

This means that e.g. when diffing work tree with HEAD textconv is called
twice: once is with a smudged file (from the work tree) and once with a
cleaned file (from HEAD). That's why I needed a small helper script
"decrypt" which does nothing but

#!/bin/sh
gpg -d -q --batch --no-tty "$1" || cat $1

Yeah, this assumes gpg errors out because it's fed something unencrypted
(and not encrypted with the wrong key) etc. It's only proof of concept
quality.

Me thinks it's not right that diff is failing to call smudge here, isn't it?

This is not going to work very well in general. Smudging and cleaning
is about putting the canonical version of a file in the git repo, and
munging it for the working tree. Trying to go backwards is going to lead
to problems, including:

1. Git sometimes wants to look at content of special files inside
trees, like .gitignore. Now it can't.

2. Git uses timestamps and inodes to decide whether files need to be
looked at all to determine if they are different. So when you do
a checkout and "git diff", everything will look OK. But when it
does actually look at file contents, it compares canonical
versions. And your canonical versions are going to be _different_
everytime you encrypt, even if the content is the same:

So you will probably end up with extra cruft in your commits if you
ever touch files.

> Now for Sverre's prophecy and the helper I haven't shown you yet: It
> turns out that blobs are not smudged before they are fed to textconv!
> [Also, it seems that the textconv config does allow parameters, bit I
> haven't checked thoroughly.]

I don't think they should be smudged. Smudging is about converting for
the working tree, and the diff is operating on canonical formats. If
anything, I think the error is that we feed smudged data from the
working tree to textconv; we should always be handing it clean data (and
this goes for external diff, too, which I suspect behaves the same way).

I haven't looked, but it probably is a result of the optimization to
reuse worktree files.

-Peff

PS If it isn't obvious, I don't think this smudge/filter technique is
the right way to go about this. But one final comment if you did want to
pursue this: you are using asymmetric encryption in your GPG invocation,
which is going to be a lot slower and the result will take up more
space. Try using a symmetric cipher.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

The version controlled data, the contents, may not be suitable for
consumption in the work tree in its verbatim form. For example, a cross
platform project would want to consistently use LF line termination inside
a repository, but on a platform whose tools expect CRLF line endings, the
contents cannot be used verbatim. We "smudge" the contents running
unix2dos when checking things out on such platforms, and "clean" the
platform specific CRLF line endings by running dos2unix when checking
things in. By doing so, you can see what really got changed between
versions without getting distracted, and more importantly, "you" in this
sentence is not limited to the human end users alone.

git internally runs diff and xdelta to see what was changed, so that:

* it can reduce storage requirement when it runs pack-objects;

* it can check what path in the preimage was similar to what other path
in the postimage, to deduce a rename;

* it can check what blocks of lines in the postimage came from what other
blocks of lines in the preimage, to pass blames across file boundaries.

If your "clean" encrypts and "smudge" decrypts, it means you are refusing
all the benifit git offers. You are making a pair of similar "smudged"
contents totally dissimilar in their "clean" counterparts. That is simply
backwards.

As the sole raison d'etre of diff.textconv is to allow potentially lossy
conversion (e.g. msword-to-text) applied to the preimage and postimage
pair of contents (that are supposed to be "clean") before giving a textual
diff to human consumption, the above config may appear to work, but if you
really want an encrypted repository, you should be using an encrypting
filesystem. That would give an added benefit that the work tree
associated with your repository would also be encrypted.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Transparently encrypt repository contents with GPG

Junio C Hamano venit, vidit, dixit 13.03.2009 21:23:

> Michael J Gruber <[hidden email]> writes:
>
>> In .gitattributes (or.git/info/a..) use
>>
>> * filter=gpg diff=gpg
>>
>> In your config:
>>
>> [filter "gpg"]
>> smudge = gpg -d -q --batch --no-tty
>> clean = gpg -ea -q --batch --no-tty -r C920A124
>> [diff "gpg"]
>> textconv = decrypt
>>
>> This gives you textual diffs even in log! You want use gpg-agent here.
>
> Don't do this.
>
> Think why the smudge/clean pair exists.
>
> The version controlled data, the contents, may not be suitable for
> consumption in the work tree in its verbatim form. For example, a cross
> platform project would want to consistently use LF line termination inside
> a repository, but on a platform whose tools expect CRLF line endings, the
> contents cannot be used verbatim. We "smudge" the contents running
> unix2dos when checking things out on such platforms, and "clean" the
> platform specific CRLF line endings by running dos2unix when checking
> things in. By doing so, you can see what really got changed between
> versions without getting distracted, and more importantly, "you" in this
> sentence is not limited to the human end users alone.
>
> git internally runs diff and xdelta to see what was changed, so that:
>
> * it can reduce storage requirement when it runs pack-objects;
>
> * it can check what path in the preimage was similar to what other path
> in the postimage, to deduce a rename;
>
> * it can check what blocks of lines in the postimage came from what other
> blocks of lines in the preimage, to pass blames across file boundaries.
>
> If your "clean" encrypts and "smudge" decrypts, it means you are refusing
> all the benifit git offers. You are making a pair of similar "smudged"
> contents totally dissimilar in their "clean" counterparts. That is simply
> backwards.
>
> As the sole raison d'etre of diff.textconv is to allow potentially lossy
> conversion (e.g. msword-to-text) applied to the preimage and postimage
> pair of contents (that are supposed to be "clean") before giving a textual
> diff to human consumption, the above config may appear to work, but if you
> really want an encrypted repository, you should be using an encrypting
> filesystem. That would give an added benefit that the work tree
> associated with your repository would also be encrypted.

Exactly. This is why I suggested using cryptfs/luks in my first response
already.

But I don't know the OP's requirements, which is why I also told him how
to do what he wanted, even though it has the drawbacks you and Jeff (and
maybe I) mentioned. Maybe it's an attempt at hosting a semi-private repo
on a public (free) server?

Besides the non-text nature of encrypted content, the problem here is
that d(e(x))=x for all x but e(d(x)) differs from x most probably, and
hopefully randomly, unless you use the right version of debian's openssl
of course ;)

That being said:
git diff calls textconv filters with smudged as well as cleaned files
(when diffing work tree files to blobs), and this does not seem right. I
hope this is not happening with the internal diff, nor with crlf!

Since both the cleaned and the smudged version are supposed to be
"authoritative" (as opposed to the textconv'ed one) one may argue either
way what's the right approach. For internal use comparing the cleaned
versions may make more sense, for displaying diff's the checked-out
form, i.e. smudged versions make more sense.

But that is another topic which would need to be substantiated with
tests. It's not completely unlikely I may come up with some, but don't
count on it...

Re: Transparently encrypt repository contents with GPG

> Since both the cleaned and the smudged version are supposed to be
> "authoritative" (as opposed to the textconv'ed one) one may argue either
> way what's the right approach.

Smudged one can never be authoritative. That is the whole point of smudge
filter and in general the whole convert_to_working_tree() infrastructure.
It changes depending on who you are (e.g. on what platform you are on).
So running comparison between two clean versions is the only sane thing to
do.

You could argue textconv should work on smudged contents or on clean
contents before smudging. As long as it is done consistently, I do not
care either way too deeply, as its output is not supposed to be used for
anything but human consumption. Two equally sane arrangement would be:

(1) Start from two clean contents (run convert_to_git() if contents were
obtained from the work tree), run textconv, run diff, and output the
result literally; or

(2) Start from two smudged contents (run convert_to_working_tree() for
contents taken from the repository), run textconv, run diff, and
run clean before sending the result to the output.

The former assumes a textconv filter that wants to work on clean
contents, the latter for a one that expects smudged input. I probably
would suggest going the former approach, as it is consistent with the
general principle in other parts of the system (the internal processing
happens on clean contents).

Both of the above two assumes that the output should come in clean form;
it is consistent with the way normal diff is generated for consumption by
git-apply. You can certainly argue that the final output should be in
smudged form when textconv is used, as it is purely for human consumption,
and is not even supposed to be fed to apply.

Re: Transparently encrypt repository contents with GPG

Junio C Hamano venit, vidit, dixit 14.03.2009 19:45:

> Michael J Gruber <[hidden email]> writes:
>
>> Since both the cleaned and the smudged version are supposed to be
>> "authoritative" (as opposed to the textconv'ed one) one may argue either
>> way what's the right approach.
>
> Smudged one can never be authoritative. That is the whole point of smudge
> filter and in general the whole convert_to_working_tree() infrastructure.
> It changes depending on who you are (e.g. on what platform you are on).
> So running comparison between two clean versions is the only sane thing to
> do.

Yes. I guess I'm being too much of a mathematician here: if clean is a
well-defined function, then clean(x) is well defined by specifying x. In
that sense x is equally authoritative.
Again, if smudge is the inverse of clean, i.e. smudge and clean are
bijective, then x differs from y iff clean(x) differs from clean(y).

> You could argue textconv should work on smudged contents or on clean
> contents before smudging. As long as it is done consistently, I do not
> care either way too deeply, as its output is not supposed to be used for
> anything but human consumption. Two equally sane arrangement would be:
>
> (1) Start from two clean contents (run convert_to_git() if contents were
> obtained from the work tree), run textconv, run diff, and output the
> result literally; or
>
> (2) Start from two smudged contents (run convert_to_working_tree() for
> contents taken from the repository), run textconv, run diff, and
> run clean before sending the result to the output.
>
> The former assumes a textconv filter that wants to work on clean
> contents, the latter for a one that expects smudged input. I probably
> would suggest going the former approach, as it is consistent with the
> general principle in other parts of the system (the internal processing
> happens on clean contents).
>
> Both of the above two assumes that the output should come in clean form;
> it is consistent with the way normal diff is generated for consumption by
> git-apply. You can certainly argue that the final output should be in
> smudged form when textconv is used, as it is purely for human consumption,
> and is not even supposed to be fed to apply.

Also, I don't expect clean to be necessarily meaningful when applied to
the result of textconv, and even less so to the output of diff.

Now, a simple test shows that git diff obviously does this when diffing
HEAD to worktree:

diff between HEAD and clean(worktree)

Which is the right thing. It just seems so that textconv is not even
called "in the wrong place of the chain", but messes the diff up in this
way:

diff between textconv(HEAD) and textconv(worktree)

(I expected clean(textconv(worktree)) first, which would be wrong, too).
I.e., the clean filter is ignored completely in the presence of textconv.

OK, I'll stop bugging you, until I checked the existing tests and the
code...

Re: Transparently encrypt repository contents with GPG

On Mon, Mar 16, 2009 at 05:01:33PM +0100, Michael J Gruber wrote:

> Now, a simple test shows that git diff obviously does this when diffing
> HEAD to worktree:
>
> diff between HEAD and clean(worktree)
>
> Which is the right thing. It just seems so that textconv is not even
> called "in the wrong place of the chain", but messes the diff up in this
> way:
>
> diff between textconv(HEAD) and textconv(worktree)
>
> (I expected clean(textconv(worktree)) first, which would be wrong, too).
> I.e., the clean filter is ignored completely in the presence of textconv.

Yeah, I think this should probably be textconv(clean(worktree)) to match
the regular HEAD/worktree diff (if it isn't already). Can you put
together a test that shows the breakage?

Re: Transparently encrypt repository contents with GPG

> As the sole raison d'etre of diff.textconv is to allow potentially lossy
> conversion (e.g. msword-to-text) applied to the preimage and postimage
> pair of contents (that are supposed to be "clean") before giving a textual
> diff to human consumption, the above config may appear to work, but if you
> really want an encrypted repository, you should be using an encrypting
> filesystem. That would give an added benefit that the work tree
> associated with your repository would also be encrypted.

I can think of one reason that having git do the encryption might be
beneficial: pushing to an untrusted source.

If you encrypted all blobs but kept trees and commits in plaintext, you
could retain (some of) the benefits of git's incremental push. The
downsides, though, are:

1. You are revealing the hashes of your blobs' plaintext. Which means
I can try brute-forcing your blobs by checking against a hash
function.

2. The remote can't actually look at the blobs. The most obvious
problem with this is that you can't send it thin packs, since it
can't actually resolve deltas.

And given the ensuing mess that it would make of the code to
conditionally say "Oh, we have this object, but you're not allowed to
read it", it is almost certainly not worth it.

But maybe somebody can prove me wrong and design a system that allows
efficient encrypted pushing to a non-trusted remote and also doesn't
suck.

Re: Transparently encrypt repository contents with GPG

I would like to have repository that transparently encrypts and
decrypts all files using GPG.

What I need is a way to automatically modify each file

a) before it is written in the repository
b) after it is read from the repository

Is there a way to get this work somehow? Can someone give me some
hints where I need to begin?

regards,
Matthias

Have come across this on my own search for an encrypted git repo. Matthias it looks as if somebody has come up with a "working" system that uses the 'smudge & clean' filter features of git.
Seems to me that to use it for storing the repo on a non trusted or possibly public git repo with some private content in the files this seems to be a workable solution.

Re: Transparently encrypt repository contents with GPG

This post has NOT been accepted by the mailing list yet.

Hi,
On your first question: So, does it work or not, or partially ?
And if partially, what does not work?

As Junio C Hamano indicated in his message,
''
git internally runs diff and xdelta to see what was changed, so that:
* it can reduce storage requirement when it runs pack-objects;
* it can check what path in the preimage was similar to what
other path in the postimage, to deduce a rename;
* it can check what blocks of lines in the postimage came
from what other blocks of lines in the preimage, to pass blames across
file boundaries.
''
you will lose the benefits offered through these git features. I have
not tested it, but I believe what Junio said is true. "Git encryption
via smudge/clean filters" is a hack to the existing git system,
meaning it is not "by design" of git. The designing goal of
"transparent git encryption" is to provide confidentiality of git data
outsourced to an external server (or "the Cloud"). This is achieved by
asking yourselves to manage your passwords / keys. The integrity of
git data is partially protected by the git system itself through
chained hashing. If the features Junio mentioned aren't important to
you, then the method works. As also mentioned in Junio's message,
using an encrypted filesystem (with tools such as "truecrypt") is an
alternative way of achieving outsourced data confidentiality.

On your second question: As the hash are identical from one run to
another, I don't understand why we should stick to the ECB cypher.

You certainly do not have to stick to the ECB mode as long as your
encryption method is deterministic. In the example you have shown ($
openssl enc -base64 -aes-256-cbc -S 1762851 -k a5G4juy64VVBgfq4
<Wiley.pdf), you are explicitly providing a fixed-valued "salt" (the
-S option) so that it together with the password is used to
*deterministically* derive an IV and an encryption key for AES CBC
encryption. Note that using ECB mode is generally regarded as a bad
crypto practice; so is using a fixed-valued salt for CBC. (The latter
may be slightly better than the former, depending on what you
believe.) If we can manage to find a way of changing the salt value
based on the file name, I think it will be a better way. In fact, I
thought about the same thing some time ago, but have not found time to
look deeper into it. I may update my document in the near future once
I find out more.

If you have high-value, high-impact data to protect on an external
server, do not use this method, and use an encrypted filesystem.

Re: Transparently encrypt repository contents with GPG

Thanks for your clarifications ! stars 2 & 3 are still not clear for me. Probably because I am new to git.

Do you think that if a solution is found, in the hypothesis it respects both git & strong cryptography, it would have success ? My analyse is that small enterprises that do not have many servers nor premises may need git hosting. Even big companies with their own networks if they want more security.

TrueCrypt or encrypted file system on the host is not feasible off the shelves. One have to settle its own dedicated server at the host.

On my side, I am afraid to push my projects in clear into a host. But possibly I am too much paranoïde. Do you have an idea of the risk ?