Instead of reading the index from disk and worrying about disk
corruption, the index is cached in memory (memory bit-flips happen
too, but hopefully less often). The result is faster read. Read time
is reduced by 70%.

The biggest gain is not having to verify the trailing SHA-1, which
takes lots of time especially on large index files. But this also
opens doors for further optimizations:

- we could create an in-memory format that's essentially the memory
dump of the index to eliminate most of parsing/allocation
overhead. The mmap'd memory can be used straight away. Experiment
[1] shows we could reduce read time by 88%.

- we could cache non-index info such as name hash

Shared memory is done by storing files in a per-repository temporary
directory. This is more portable than shm (which requires
posix-realtime and has various quirks on OS X). It might even work on
Windows, although this has not been tested. The shared memory file's
name follows the template "shm-<object>-<SHA1>" where <SHA1> is the
trailing SHA-1 of the index file. <object> is "index" for cached index
files (and might later be "name-hash" for name-hash cache). If such
shared memory exists, it contains the same index content as on
disk. The content is already validated by the daemon and git won't
validate it again (except comparing the trailing SHA-1s).

We also add some bits to the index (to_shm and from_shm) to track
when an index came from shared memory or is going to shared memory.

We keep this daemon's logic as thin as possible. The "brain" stays in
git. So the daemon can read and validate stuff, but that's all it's
allowed to do. It does not add/create new information. It doesn't even
accept direct updates from git.

Git can poke the daemon via unix domain sockets to tell it to refresh
the index cache, or to keep it alive some more minutes. It can't give
any real index data directly to the daemon. Real data goes to disk
first, then the daemon reads and verifies it from there. The daemon
only handles $GIT_DIR/index, not temporary index files; it only gets
poked for the former.

$GIT_DIR/index-helper.sock is the socket for the daemon process. The
daemon reads from the socket and executes commands.

Named pipes were considered for portability reasons, but then commands
that need replies from the daemon would have to open their own pipes,
since a named pipe should only have one reader. Unix domain sockets
don't have this problem.

On webkit.git with index format v2, duplicating 8 times to 1.5m
entries and 236MB in size:

diff --git a/.gitignore b/.gitignore
index 5087ce1..b92f122 100644
--- a/.gitignore
+++ b/.gitignore
@@ -71,6 +71,7 @@
/git-http-fetch
/git-http-push
/git-imap-send
+/git-index-helper
/git-index-pack
/git-init
/git-init-db
diff --git a/Documentation/git-index-helper.txt b/Documentation/git-index-helper.txt
new file mode 100644
index 0000000..f892184
--- /dev/null
+++ b/Documentation/git-index-helper.txt
@@ -0,0 +1,50 @@
+git-index-helper(1)
+===================
+
+NAME
+----
+git-index-helper - A simple cache daemon for speeding up index file access
+
+SYNOPSIS
+--------
+[verse]
+'git index-helper' [options]
+
+DESCRIPTION
+-----------
+Keep the index file in memory for faster access. This daemon is per
+repository and per working tree. So if you have two working trees
+each with a submodule, you might need four index-helpers. (In practice,
+this is only worthwhile for large indexes, so only use it if you notice
+that git status is slow).
+
+OPTIONS
+-------
+
+--exit-after=<n>::
+ Exit if the cached index is not accessed for `<n>`
+ seconds. Specify 0 to wait forever. Default is 600.
+
+NOTES
+-----
+
+$GIT_DIR/index-helper.sock a Unix domain socket that the daemon reads
+commands from. The directory will also contain files named
+"shm-index-<SHA1>". These are used as backing stores for shared
+memory. Normally the daemon will clean up these files when it exits
+or when they are no longer relevant. But if it crashes, some objects
+could remain there and they can be safely deleted with "rm"
+command. The following commands are used to control the daemon:
+
+"refresh"::
+ Reread the index.
+
+"poke":
+ Let the daemon know the index is to be read. It keeps the
+ daemon alive longer, unless `--exit-after=0` is used.
+
+All commands and replies are terminated by a NUL byte.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 2742a69..c8be0e7 100644
--- a/Makefile
+++ b/Makefile
@@ -1433,6 +1433,10 @@ ifdef HAVE_DEV_TTY
BASIC_CFLAGS += -DHAVE_DEV_TTY
endif

[PATCH v9 04/19] index-helper: add --strict

There are "holes" in the index-helper approach because the shared
memory is not verified again by git. If $USER is compromised, shared
memory could be modified. But anyone who could do this could already
modify $GIT_DIR/index. A more realistic risk is some bugs in
index-helper that produce corrupt shared memory. --strict is added to
avoid that.

Strictly speaking there's still a very small gap where corrupt shared
memory could still be read by git: after we write the trailing SHA-1 in
the shared memory (thus signaling "this shm is ready") and before
verify_shm() detects an error.

+--strict::
+--no-strict::
+ Strict mode makes index-helper verify the shared memory after
+ it's created. If the result does not match what's read from
+ $GIT_DIR/index, the shared memory is destroyed. This makes
+ index-helper take more than double the amount of time required
+ for reading an index, but because it will happen in the
+ background, it's not noticable. `--strict` is enabled by default.
+
NOTES
-----

[PATCH v9 08/19] read-cache: add watchman 'WAMA' extension

The extension contains a bitmap, one bit for each entry in the
index. If the n-th bit is zero, the n-th entry is considered
unchanged, we can ce_mark_uptodate() it without refreshing. If the bit
is non-zero and we found out the corresponding file is clean after
refresh, we can clear the bit.

In addition, there's a list of directories in the untracked-cache
to invalidate (because they have new or modified entries).

The 'skipping refresh' bit is not in this patch yet as we would need
watchman. More details in later patches.

Watchman is hidden behind index-helper. Before git tries to read the
index from shm, it notifies index-helper through the socket and waits
for index-helper to prepare a file for sharing memory (with
MAP_SHARED). index-helper then contacts watchman, updates 'WAMA'
extension and put it in a separate file and wakes git up with a reply
to git's socket.

Git uses this extension to not lstat unchanged entries. Git only
trusts the 'WAMA' extension when it's received from the separate file,
not from disk. Unmarked entries are "clean". Marked entries are dirty
from watchman point of view. If it finds out some entries are
'watchman-dirty', but are really unchanged (e.g. the file was changed,
then reverted back), then Git will clear the marking in 'WAMA' before
writing it down.

Hiding watchman behind index-helper means you need both daemons. You
can't run watchman alone. Not so good. But on the other hand, 'git'
binary is not linked to watchman/json libraries, which is good for
packaging. Core git package will run fine without watchman-related
packages. If they need watchman, they can install git-index-helper and
dependencies.

This also lets us trust anything in the untracked cache that we haven't
marked invalid, saving those stat() calls.

Another reason for tying watchman to index-helper is, when used with
untracked cache, we need to keep track of $GIT_WORK_TREE file
listing. That kind of list can be kept in index-helper.

diff --git a/Documentation/git-index-helper.txt b/Documentation/git-index-helper.txt
index e144752..cce00cb 100644
--- a/Documentation/git-index-helper.txt
+++ b/Documentation/git-index-helper.txt
@@ -55,6 +55,12 @@ command. The following commands are used to control the daemon:
Let the daemon know the index is to be read. It keeps the
daemon alive longer, unless `--exit-after=0` is used.

+"poke <path>":
+ Like "poke", but replies with "OK". If the index has the
+ watchman extension, index-helper queries watchman, then
+ prepares a shared memory object with the watchman index
+ extension before replying.
+
All commands and replies are terminated by a NUL byte.

diff --git a/Documentation/git-index-helper.txt b/Documentation/git-index-helper.txt
index cce00cb..55a8a5a 100644
--- a/Documentation/git-index-helper.txt
+++ b/Documentation/git-index-helper.txt
@@ -18,6 +18,9 @@ each with a submodule, you might need four index-helpers. (In practice,
this is only worthwhile for large indexes, so only use it if you notice
that git status is slow).

[PATCH v9 17/19] index-helper: optionally automatically run

Introduce a new config option, indexhelper.autorun, to automatically
run git index-helper before starting up a builtin git command. This
enables users to keep index-helper running without manual
intervention.

[PATCH v9 18/19] trace: measure where the time is spent in the index-heavy operations

All the known heavy code blocks are measured (except object database
access). This should help identify if an optimization is effective or
not. An unoptimized git-status would give something like below (92% of
time is accounted). To sum up the effort of making git scale better: