Git - Revision Control Perfected

Commits

A commit is meant to record a set of changes introduced to a project.
What it really does is associate a tree object—representing a complete
snapshot of a directory structure at a moment in time—with contextual
information about it, such as who made the change and when, a description,
and its parent commit(s).

A commit doesn't actually store a list of changes (a "diff")
directly, but
it doesn't need to. What changed can be calculated on-demand by
comparing the current commit's tree to that of its parent. Comparing
two trees is a lightweight operation, so there is no need to store this
information. Because there actually is nothing special about the parent
commit other than chronology, one commit can be compared to any other
just as easily regardless of how many commits are in between.

All commits should have a parent except the first one. Commits usually
have a single parent, but they will have more if they are the result of a
merge (I explain branching and merging
later in this article). A commit from a merge still is just a snapshot in time like
any other, but its history has more than one lineage.

By following the chain of parent references backward from the current
commit, the entire history of a project can be reconstructed and browsed
all the way back to the first commit.

A commit is expanded recursively into a project history in exactly the
same manner as a tree is expanded into a directory structure. More
important, just as the SHA1 of a tree is a fingerprint of all the
data in all the trees and blobs below it, the SHA1 of a commit is a
fingerprint of all the data in its tree, as well as all of the data in
all the commits that preceded it.

This happens automatically because references are part of an object's
overall content. The SHA1 of each object is computed, in part, from the
SHA1s of any objects it references, which in turn were computed from the
SHA1s they referenced and so on.

Tags

A tag is just a named reference to an object—usually a commit. Tags
typically are used to associate a particular version number with a
commit. The 40-character SHA1 names are many things, but human-friendly
isn't one of them. Tags solve this problem by letting you give an object
an additional name.

There are two types of tags: object tags and lightweight tags. Lightweight
tags are not objects in the repository, but instead are simple refs like branches,
except that they don't change. (I explain branches in more detail in the
Branching and Merging section below.)

Setting Up Git

If you don't already have Git on your system, install it with your
package manager. Because Git is primarily a simple command-line tool,
installing it is quick and easy under any modern distro.

You'll want to set the name and e-mail address that will be recorded in
new commits:

This just sets these parameters in the config file ~/.gitconfig. The
config has a simple syntax and could be edited by hand just as easily.

User Interface

Git's interface consists of the "working copy" (the files you directly
interact with when working on the project), a local repository stored in
a hidden .git subdirectory at the root of the working copy, and commands
to move data back and forth between them, or between remote repositories.

The advantages of this design are many, but right away you'll notice that
there aren't pesky version control files scattered throughout the working
copy, and that you can work off-line without any loss of features. In fact,
Git doesn't have any concept of a central authority, so you always
are "working off-line" unless you specifically ask Git to exchange commits
with your peers.

The repository is made up of files that are manipulated by invoking
the git command from within the working copy. There is no special server
process or extra overhead, and you can have as many repositories on your
system as you like.

You can turn any directory into a working copy/repository just by running
this command from within it:

git init

Next, add all the files within the working copy to be tracked and
commit them:

git add .
git commit -m "My first commit"

You can commit additional changes as frequently or infrequently as you
like by calling git add followed by git
commit after each modification
you want to record.

If you're new to Git, you may be wondering why you need to call git
add each time. It has to do with the process of
"staging" a set of
changes before committing them, and it's one of the most common sources of
confusion. When you call git add on one or more files, they are added
to the Index. The files in the Index—not the working copy—are what
get committed when you call git commit.

Think of the Index as what will become the next commit. It simply provides
an extra layer of granularity and control in the commit process. It
allows you to commit some of the differences in your working copy,
but not others, which is useful in many situations.

You don't have to take advantage of the Index if you don't want to, and
you're not doing anything "wrong" if you don't. If you want to pretend
it doesn't exist, just remember to call git add . from the root of
the working copy (which will update the Index to match) each time and
immediately before git commit. You also can use the -a option with
git commit to add changes automatically; however, it will not add new
files, only changes to existing files. Running git add. always
will add everything.

The exact work flow and specific style of commands largely are left up
to you as long as you follow the basic rules.

The git status command shows you all the differences between your
working copy and the Index, and the Index and the most recent commit
(the current HEAD):

git status

This lets you see pending changes easily at any given time, and it even
reminds you of relevant commands like git add to stage pending changes
into the Index, or git reset HEAD <file> to remove (unstage) changes
that were added previously.

The limitation that I immediately ran into when I considered migrating to git is to check out some (rather randomly selected) subset of files on a small/portable computing device.

Say I have a big repository of files and I only needed a very small subset of files while on the go -- to refer to and to be edited.

It was originally a small netbook computer where I could check out a few directories from a big repository and be able to edit files on the netbook computer while on the bus.

Netbook might have grown larger with regard to its disk storage, but now, I want to do the same on an Android phone.

git's sparse checkout feature still pulls the entire repository to the device. It only checkout a subset of files to give the appearance of sparse checkout, but it doesn't resolve the storage issue.

I don't think git submodules help, as, I think, one can't easily move selected files across repositories with all history intact (i.e., every now and then, add some additional directories to the list available to small devices by moving them to a submodule, when it becomes necessary), as one can easily do with CVS.

The only solution that I can think of is to remotely mount .git/objects/ directory and deal with its limitation.

Is there any creative brain power would find a solution lift this limitation?

Can splitting git repository be implemented by splitting some git's Tree object into 2 (sub-) Tree objects on a personal workstation, (perhaps new Commit objects to keep track of the split,) allowing a smaller tree be checked out to a small device.

Remote changes (done by others) can, then, be merged to the personal workstation (as staging), before merging to the splitted Tree branches for the small devices if necessary.

Changes on the small devices can be merged to the personal workstation (as staging), before being pulled by others?

Would that solve the disk space problem by limiting checkout to a small (sub-) Tree?

If this idea works, would some able developer turns it into an implementation?

First of all I want to thank the author for this clear and concise article.

However, I want to point out some inaccuracy regarding the paragraph on SHA1. The author states that SHA1 guarantees that the data in the blobs is different, and that the chance that two pieces of data have the same SHA1 is infinitesimally small. I disagree on this point.

The 40-character string that SHA1 outputs gives us 16^40 = 2^160 ~~ 10^16 different checksums. Although this is big enough to assume the above descripted 'guarantee', the claim about the infinitesimal chance is just wrong.

Consider for example 2^160 + 1 pairwise distinct files (this is data, be it hypothetical). The chance that there will be two different pieces of data in this set having the same checksum is 1. And 1 is very very different from infinitesimal.

I agree that it is highly unlikely that two such files will occur in practice, let alone in one project. (For example, each person on earth would have to create about 100.000 distinct files, to come close to the 2^160 files.) Still I wanted to point this out about the cryptographic features of SHA1.