Why this page?

I wanted to understand what GIT was all about, but all I can see is
what-it-does level documentation.

So I've sketched a description based on what I know about SCMs
already. If you think it's useful and the inevitable errors are
fixable, please go on and add to it.

Repositories and "objects"

Like CVS and more modern SCMs, GIT provides a repository in which
you can lodge a 'project' (a hierarchical directory structure
and its file data), update incrementally, and subsequently extract
any past version of the whole repository, subsets and individual
files.

The repository, of course, is itself a hierarchically-structured set
of Linux files.

GIT calls the stored images of project files and directories
"objects", which I find... well, objectionable: The excessively
abstract word "object" is already widely used as in "object file" and
"object-orientated". The latter meaning of 'object' is loose, vague
and poorly understood too - another good reason to avoid it. Sigh.

I'm going to refer to these things as 'GIT files'; for many purposes
GIT looks like a filesystem in its own right. It happens to be a
filesystem which automagically stores old versions, and which
internally uses hash-indexed data. Filesystems which do this are known.

GIT's not a filesystem in the full Linux sense, because you don't
access it strictly through open/read/write/unlink etc; it's interface
looks more like CVS (etc).

The repository is a tree of files on a Linux filesystem: but you are
not entitled to believe that GIT files are one-to-one with files
in the repository. On the other hand, experience shows that a
repository system is a lot more reliable if you have unchanged GIT
files represented by unchanged Linux files...

GIT is distributed

Unlike CVS but like some modern SCMs, GIT is "distributed": that is,
everyday developer interactions terminate at a local copy of the
repository, and the system works well even without a full-time,
low-latency or high-bandwidth connection to peer repositories.

That makes an interesting problem: you want to be able to
synchronise a pair of repositories without user intervention, and be
confident that a set of peer copies which synchronise with each
other will evolve (fairly rapidly) towards being identical so long
as the graph of pairings is connected.

That requires that a single pair-synchronisation reliably ends up in a
common state which captures all changes from both ends. That sets
limits on the kinds of repository evolution which are permitted. You
cannot make unsynchronisable changes to the repository: the easiest
solution to this is that the thing grows, but no data is ever
discarded.

All merge tools (even helpfully automated ones) operate under user
control and work locally.

GIT relies on hashes

Like a few other proposed systems, GIT relies on hashes to uniquely
identify GIT files based on their data. A 160-bit SHA1 hash is big
enough that the chance of two different files in a repository having
the same hash is vanishingly remote. [Hmm, some people use SHA1
hashes truncated to 128 bits, not sure what GIT does].

You need to be careful: remember that old thing about "how big does
a party have to be before you have a 50% chance that two people have
the same birthday?" The answer is 23 or so, which surprises people
who haven't heard it before... It turns out the group size where
you get a 50% chance is a bit bigger than the square root of the
number of possible birthdays (23 is indeed a bit bigger than the
square root of 365).

So you have a 50% chance of a false "alias" between SHA-identified
files when you have somewhere around 2^80 files in the
archive: so far so good, we really don't want a repository as big as
that. The chance of an alias in the archive varies as the square of
the number of files in the archive, too: so a very large archive of
100M files (that's around 2^27) has about a one in
2^((80-27)*2) or one in 2^106 chance of an
alias. I really did mean 'vanishingly remote'...

It is not clear to me whether GIT's repository integrity depends on
this very unlikely event never happening, or whether it would detect
it and refuse a commit.

Indexes, trees and commits - a user view of a project

If you're prepared to keep files 'forever', version management with
hash-identified files is just a matter of maintaining appropriate
index information to locate the right set of GIT files: and GIT
defines special GIT files called 'trees' and 'commits', for that
purpose. A "tree" records the directory structure of the project,
while a "commit" snapshots the version seen by a user who's just
committed some changes. See #More on trees and commits below.

The "commit" GIT file is analogous to a thing called a "view" in
other systems.

Storing data

Since GIT is completely in charge of its own data, it can (and does)
compress data behind your back using gzip/bzip etc. Further, it can
(and already attempts to) "cross-compress" related files - in
particular, you can store one GIT file's data in the form of a patch
to apply to some other hash-identified GIT file.

That use of patch is wholly private to GIT, and has no logical
connection with a user's experience of patch/merge as a way of
incorporating other's changes or porting your changes to a different
root version.

But when a user commits changes the system usually knows the
differences between the new and previous version, so that makes
cross-compression practicable by identifying a pair of GIT files which
can be cross-compressed.