*[PATCH v2 00/14] Serialized Git Commit Graph@ 2018-01-30 21:39 Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
` (15 more replies)0 siblings, 16 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
To: git; +Cc: gitster, peff, git, sbeller, dstolee
Thanks to everyone who gave comments on v1. I tried my best to respond to
all of the feedback, but may have missed some while I was doing several
renames, including:
* builtin/graph.c -> builtin/commit-graph.c
* packed-graph.[c|h] -> commit-graph.[c|h]
* t/t5319-graph.sh -> t/t5318-commit-graph.sh
Because of these renames (and several type/function renames) the diff
is too large to conveniently share here.
Some issues that came up and are addressed:
* Use <hash> instead of <oid> when referring to the graph-<hash>.graph
filenames and the contents of graph-head.
* 32-bit timestamps will not cause undefined behavior.
* timestamp_t is unsigned, so they are never negative.
* The config setting "core.commitgraph" now only controls consuming the
graph during normal operations and will not block the commit-graph
plumbing command.
* The --stdin-commits is better about sanitizing the input for strings
that do not parse to OIDs or are OIDs for non-commit objects.
One unresolved comment that I would like consensus on is the use of
globals to store the config setting and the graph state. I'm currently
using the pattern from packed_git instead of putting these values in
the_repository. However, we want to eventually remove globals like
packed_git. Should I deviate from the pattern _now_ in order to keep
the problem from growing, or should I keep to the known pattern?
Finally, I tried to clean up my incorrect style as I was recreating
these commits. Feel free to be merciless in style feedback now that the
architecture is more stable.
Thanks,
-Stolee
-- >8 --
As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:
* 'git log --topo-order -1000' walks all reachable commits to avoid
incorrect topological orders, but only needs the commit message for
the top 1000 commits.
* 'git merge-base <A> <B>' may walk many commits to find the correct
boundary between the commits reachable from A and those reachable
from B. No commit messages are needed.
* 'git branch -vv' checks ahead/behind status for all local branches
compared to their upstream remote branches. This is essentially as
hard as computing merge bases for each.
The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.
The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.
Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.
| Command | Before | After | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% |
| branch -vv | 0.42s | 0.27s | -35% |
| rev-list --all | 6.4s | 1.0s | -84% |
| rev-list --all --objects | 32.6s | 27.6s | -15% |
To test this yourself, run the following on your repo:
git config core.commitgraph true
git show-ref -s | git graph --write --update-head --stdin-commits
The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.
[1] https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a2ce@jeffhostetler.com/
Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)
[2] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.
Derrick Stolee (14):
commit-graph: add format document
graph: add commit graph design document
commit-graph: create git-commit-graph builtin
commit-graph: implement construct_commit_graph()
commit-graph: implement git-commit-graph --write
commit-graph: implement git-commit-graph --read
commit-graph: implement git-commit-graph --update-head
commit-graph: implement git-commit-graph --clear
commit-graph: teach git-commit-graph --delete-expired
commit-graph: add core.commitgraph setting
commit: integrate commit graph with commit parsing
commit-graph: read only from specific pack-indexes
commit-graph: close under reachability
commit-graph: build graph from starting commits
.gitignore | 1 +
Documentation/config.txt | 3 +
Documentation/git-commit-graph.txt | 100 +++
Documentation/technical/commit-graph-format.txt | 89 +++
Documentation/technical/commit-graph.txt | 189 ++++++
Makefile | 2 +
alloc.c | 1 +
builtin.h | 1 +
builtin/commit-graph.c | 229 +++++++
cache.h | 1 +
command-list.txt | 1 +
commit-graph.c | 841 ++++++++++++++++++++++++
commit-graph.h | 69 ++
commit.c | 10 +-
commit.h | 4 +
config.c | 5 +
environment.c | 1 +
git.c | 1 +
log-tree.c | 3 +-
packfile.c | 4 +-
packfile.h | 2 +
t/t5318-commit-graph.sh | 272 ++++++++
22 files changed, 1824 insertions(+), 5 deletions(-)
create mode 100644 Documentation/git-commit-graph.txt
create mode 100644 Documentation/technical/commit-graph-format.txt
create mode 100644 Documentation/technical/commit-graph.txt
create mode 100644 builtin/commit-graph.c
create mode 100644 commit-graph.c
create mode 100644 commit-graph.h
create mode 100755 t/t5318-commit-graph.sh
--
2.16.0
^permalinkrawreply [flat|nested] 146+ messages in thread

*[PATCH v2 01/14] commit-graph: add format document
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
2018-02-01 21:44 ` Jonathan Tan
2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
` (14 subsequent siblings)15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
To: git; +Cc: gitster, peff, git, sbeller, dstolee
Add document specifying the binary format for commit graphs. This
format allows for:
* New versions.
* New hash functions and hash lengths.
* Optional extensions.
Basic header information is followed by a binary table of contents
into "chunks" that include:
* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.
The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph-format.txt | 89 +++++++++++++++++++++++++
1 file changed, 89 insertions(+)
create mode 100644 Documentation/technical/commit-graph-format.txt
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 0000000000..8a987c7aa9
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,89 @@
+Git commit graph format
+=======================
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+ generation number 1; commits with parents have generation number
+ one more than the maximum generation number of its parents. We
+ reserve zero as special, and can be used to mark a generation
+ number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+ the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+ 4-byte signature:
+ The signature is: {'C', 'G', 'P', 'H'}
+
+ 1-byte version number:
+ Currently, the only valid version is 1.
+
+ 1-byte Object Id Version (1 = SHA-1)
+
+ 1-byte Object Id Length (H)
+
+ 1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+ (C + 1) * 12 bytes listing the table of contents for the chunks:
+ First 4 bytes describe chunk id. Value 0 is a terminating label.
+ Other 8 bytes provide offset in current file for chunk to start.
+ (Chunks are ordered contiguously in the file, so you can infer
+ the length using the next chunk position if necessary.)
+
+ The remaining data in the body is described one chunk at a time, and
+ these chunks may be given in any order. Chunks are required unless
+ otherwise specified.
+
+CHUNK DATA:
+
+ OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+ The ith entry, F[i], stores the number of OIDs with first
+ byte at most i. Thus F[255] stores the total
+ number of commits (N).
+
+ OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+ The OIDs for all commits in the graph, sorted in ascending order.
+
+ Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+ * The first H bytes are for the OID of the root tree.
+ * The next 8 bytes are for the int-ids of the first two parents
+ of the ith commit. Stores value 0xffffffff if no parent in that
+ position. If there are more than two parents, the second value
+ has its most-significant bit on and the other bits store an array
+ position into the Large Edge List chunk.
+ * The next 8 bytes store the generation number of the commit and
+ the commit time in seconds since EPOCH. The generation number
+ uses the higher 30 bits of the first 4 bytes, while the commit
+ time uses the 32 bits of the second 4 bytes, along with the lowest
+ 2 bits of the lowest byte, storing the 33rd and 34th bit of the
+ commit time.
+
+ Large Edge List (ID: {'E', 'D', 'G', 'E'})
+ This list of 4-byte values store the second through nth parents for
+ all octopus merges. The second parent value in the commit data is a
+ negative number pointing into this list. Then iterate through this
+ list starting at that position until reaching a value with the most-
+ significant bit on. The other bits correspond to the int-id of the
+ last parent. This chunk should always be present, but may be empty.
+
+TRAILER:
+
+ H-byte HASH-checksum of all of the above.
--
2.16.0.15.g9c3cf44.dirty
^permalinkrawreply [flat|nested] 146+ messages in thread

*[PATCH v2 02/14] graph: add commit graph design document
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
@ 2018-01-30 21:39 ` Derrick Stolee
2018-01-31 2:19 ` Stefan Beller
2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
` (13 subsequent siblings)15 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-01-30 21:39 UTC (permalink / raw)
To: git; +Cc: gitster, peff, git, sbeller, dstolee
Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph.txt | 189 +++++++++++++++++++++++++++++++
1 file changed, 189 insertions(+)
create mode 100644 Documentation/technical/commit-graph.txt
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000000..cbf88f7264
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,189 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'git show --remerge-diff' and can take minutes to compute depending on
+history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitgraph'
+config setting, then the existing ODB is sufficient. The file is stored
+next to packfiles either in the .git/objects/pack directory or in the pack
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+ the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+ If A and B are commits with generation numbers N and M, respectively,
+ and N <= M, then A cannot reach B. That is, we know without searching
+ that B is not an ancestor of A because it is further from a root commit
+ than A.
+
+ Conversely, when checking if A is an ancestor of B, then we only need
+ to walk commits until all commits on the walk boundary have generation
+ number at most N. If we walk commits using a priority queue seeded by
+ generation numbers, then we always expand the boundary commit with highest
+ generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+ If A and B are commits with commit time X and Y, respectively, and
+ X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<hash>.graph' in the pack
+ directory. This could be stored in an alternate.
+
+- The most-recent graph file hash is stored in a 'graph-head' file for
+ immediate access and storing backup graphs. This could be stored in an
+ alternate, and refers to a 'graph-<hash>.graph' file in the same pack
+ directory.
+
+- The core.commitgraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object id length and hash
+ algorithm, so a future change of hash algorithm does not require a change
+ in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer position
+ to seek into the single graph file. It is possible to extend the model
+ for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'commit-graph' builtin. These are not
+ updated automatically during clone, fetch, repack, or creating new commits.
+
+- There is no '--verify' option for the 'commit-graph' builtin to verify the
+ contents of the graph file agree with the contents in the ODB.
+
+- When rewriting the graph, we do not check for a commit still existing
+ in the ODB, so garbage collection may remove commits.
+
+- Generation numbers are not computed in the current version. The file
+ format supports storing them, along with a mechanism to upgrade from
+ a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+ are not currently computed, so all generation numbers will be marked as
+ 0 (or "uncomputed"). A later patch will include this calculation.
+
+- The commit graph is currently incompatible with commit grafts. This can be
+ remedied by duplicating or refactoring the current graft logic.
+
+- After computing and storing generation numbers, we must make graph
+ walks aware of generation numbers to gain the performance benefits they
+ enable. This will mostly be accomplished by swapping a commit-date-ordered
+ priority queue with one ordered by generation number. The following
+ operations are important candidates:
+
+ - paint_down_to_common()
+ - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+ When writing a new graph, we could check that the ODB still contains
+ the commits and choose to remove the commits that are deleted from the
+ ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+ object for a commit. This passes through lookup_tree() and consequently
+ lookup_object(). Also, it calls lookup_commit() when loading the parents.
+ These method calls check the ODB for object existence, even if the
+ consumer does not need the content. For example, we do not need the
+ tree contents when computing merge bases. Now that commit parsing is
+ removed from the computation time, these lookup operations are the
+ slowest operations keeping graph walks from being fast. Consider
+ loading these objects without verifying their existence in the ODB and
+ only loading them fully when consumers need them. Consider a method
+ such as "ensure_tree_loaded(commit)" that fully loads a tree before
+ using commit->tree.
+
+- The current design uses the 'commit-graph' builtin to generate the graph.
+ When this feature stabilizes enough to recommend to most users, we should
+ add automatic graph writes to common operations that create many commits.
+ For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
+ commands.
+
+- A server could provide a commit graph file as part of the network protocol
+ to avoid extra calculations by clients.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+ Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+ An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+ Discussion about generation numbers on commits and how they interact
+ with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+ More discussion about generation numbers and not storing them inside
+ commit objects. A valuable quote:
+
+ "I think we should be moving more in the direction of keeping
+ repo-local caches for optimizations. Reachability bitmaps have been
+ a big performance win. I think we should be doing the same with our
+ properties of commits. Not just generation numbers, but making it
+ cheap to access the graph structure without zlib-inflating whole
+ commit objects (i.e., packv4 or something like the "metapacks" I
+ proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+ A patch to remove the ahead-behind calculation from 'status'.
--
2.16.0.15.g9c3cf44.dirty
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 00/14] Serialized Git Commit Graph
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
` (13 preceding siblings ...)
2018-01-30 21:39 ` [PATCH v2 14/14] commit-graph: build graph from starting commits Derrick Stolee
@ 2018-01-30 21:47 ` Stefan Beller
2018-02-01 2:34 ` Stefan Beller
2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
15 siblings, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-01-30 21:47 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee
On Tue, Jan 30, 2018 at 1:39 PM, Derrick Stolee <stolee@gmail.com> wrote:
> Thanks to everyone who gave comments on v1. I tried my best to respond to
> all of the feedback, but may have missed some while I was doing several
> renames, including:
>
> * builtin/graph.c -> builtin/commit-graph.c
> * packed-graph.[c|h] -> commit-graph.[c|h]
> * t/t5319-graph.sh -> t/t5318-commit-graph.sh
>
> Because of these renames (and several type/function renames) the diff
> is too large to conveniently share here.
>
> Some issues that came up and are addressed:
>
> * Use <hash> instead of <oid> when referring to the graph-<hash>.graph
> filenames and the contents of graph-head.
> * 32-bit timestamps will not cause undefined behavior.
> * timestamp_t is unsigned, so they are never negative.
> * The config setting "core.commitgraph" now only controls consuming the
> graph during normal operations and will not block the commit-graph
> plumbing command.
> * The --stdin-commits is better about sanitizing the input for strings
> that do not parse to OIDs or are OIDs for non-commit objects.
>
> One unresolved comment that I would like consensus on is the use of
> globals to store the config setting and the graph state. I'm currently
> using the pattern from packed_git instead of putting these values in
> the_repository. However, we want to eventually remove globals like
> packed_git. Should I deviate from the pattern _now_ in order to keep
> the problem from growing, or should I keep to the known pattern?
I have a series doing the conversion in
https://github.com/stefanbeller/git/tree/object-store
that is based on 2.16.
While the commits are structured for easy review (to not miss any of
the globals that that series is based upon), I did not come up with a
good strategy how to take care of series in flight that add more globals.
So I think for now you'd want to keep it as global vars, such that
it is consistent with the code base and then we'll figure out how to
do the conversion one step at a time.
Please do not feel stopped or hindered by my slow pace of working
through that series, maybe I'll have to come up with another approach
that is better for upstream (rebasing that series is a pain, as upstream
moves rather quickly. Maybe I'll have to send that series in smaller chunks).
> Finally, I tried to clean up my incorrect style as I was recreating
> these commits. Feel free to be merciless in style feedback now that the
> architecture is more stable.
ok, will do.
Thanks,
Stefan
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 02/14] graph: add commit graph design document
2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
@ 2018-01-31 2:19 ` Stefan Beller0 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-01-31 2:19 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, Junio C Hamano, Jeff King, Jeff Hostetler, Derrick Stolee
On Tue, Jan 30, 2018 at 1:39 PM, Derrick Stolee <stolee@gmail.com> wrote:
> Add Documentation/technical/commit-graph.txt with details of the planned
> commit graph feature, including future plans.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/technical/commit-graph.txt | 189 +++++++++++++++++++++++++++++++
> 1 file changed, 189 insertions(+)
> create mode 100644 Documentation/technical/commit-graph.txt
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> new file mode 100644
> index 0000000000..cbf88f7264
> --- /dev/null
> +++ b/Documentation/technical/commit-graph.txt
> @@ -0,0 +1,189 @@
> +Git Commit Graph Design Notes
> +=============================
> +
> +Git walks the commit graph for many reasons, including:
> +
> +1. Listing and filtering commit history.
> +2. Computing merge bases.
> +
> +These operations can become slow as the commit count grows. The merge
> +base calculation shows up in many user-facing commands, such as 'merge-base'
> +or 'git show --remerge-diff' and can take minutes to compute depending on
> +history shape.
Sorry for appearing more authoritative than I am here. The --remerge-diff flag
is just floating around the mailing list, and was never merged. (It is
such a cool
feature though, but it would actually confuse users looking for it,
not finding it)
> +There are two main costs here:
> +
> +1. Decompressing and parsing commits.
> +2. Walking the entire graph to avoid topological order mistakes.
> +
> +The commit graph file is a supplemental data structure that accelerates
> +commit graph walks. If a user downgrades or disables the 'core.commitgraph'
> +config setting, then the existing ODB is sufficient. The file is stored
> +next to packfiles either in the .git/objects/pack directory or in the pack
> +directory of an alternate.
> +
> +The commit graph file stores the commit graph structure along with some
> +extra metadata to speed up graph walks. By listing commit OIDs in lexi-
> +cographic order, we can identify an integer position for each commit and
> +refer to the parents of a commit using those integer positions. We use
> +binary search to find initial commits and then use the integer positions
> +for fast lookups during the walk.
> +
> +A consumer may load the following info for a commit from the graph:
> +
> +1. The commit OID.
> +2. The list of parents, along with their integer position.
> +3. The commit date.
> +4. The root tree OID.
> +5. The generation number (see definition below).
> +
> +Values 1-4 satisfy the requirements of parse_commit_gently().
> +
> +Define the "generation number" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has generation number one.
> +
> + * A commit with at least one parent has generation number one more than
> + the largest generation number among its parents.
> +
> +Equivalently, the generation number of a commit A is one more than the
> +length of a longest path from A to a root commit. The recursive definition
> +is easier to use for computation and observing the following property:
> +
> + If A and B are commits with generation numbers N and M, respectively,
> + and N <= M, then A cannot reach B. That is, we know without searching
> + that B is not an ancestor of A because it is further from a root commit
> + than A.
> +
> + Conversely, when checking if A is an ancestor of B, then we only need
> + to walk commits until all commits on the walk boundary have generation
> + number at most N. If we walk commits using a priority queue seeded by
> + generation numbers, then we always expand the boundary commit with highest
> + generation number and can easily detect the stopping condition.
> +
> +This property can be used to significantly reduce the time it takes to
> +walk commits and determine topological relationships. Without generation
> +numbers, the general heuristic is the following:
> +
> + If A and B are commits with commit time X and Y, respectively, and
> + X < Y, then A _probably_ cannot reach B.
> +
> +This heuristic is currently used whenever the computation can make
> +mistakes with topological orders (such as "git log" with default order),
> +but is not used when the topological order is required (such as merge
> +base calculations, "git log --graph").
> +
> +In practice, we expect some commits to be created recently and not stored
> +in the commit graph. We can treat these commits as having "infinite"
> +generation number and walk until reaching commits with known generation
> +number.
> +
> +Design Details
> +--------------
> +
> +- A graph file is stored in a file named 'graph-<hash>.graph' in the pack
> + directory. This could be stored in an alternate.
> +
> +- The most-recent graph file hash is stored in a 'graph-head' file for
> + immediate access and storing backup graphs. This could be stored in an
> + alternate, and refers to a 'graph-<hash>.graph' file in the same pack
> + directory.
> +
> +- The core.commitgraph config setting must be on to consume graph files.
> +
> +- The file format includes parameters for the object id length and hash
> + algorithm, so a future change of hash algorithm does not require a change
> + in format.
> +
> +Current Limitations
> +-------------------
> +
> +- Only one graph file is used at one time. This allows the integer position
> + to seek into the single graph file. It is possible to extend the model
> + for multiple graph files, but that is currently not part of the design.
> +
> +- .graph files are managed only by the 'commit-graph' builtin. These are not
> + updated automatically during clone, fetch, repack, or creating new commits.
> +
> +- There is no '--verify' option for the 'commit-graph' builtin to verify the
> + contents of the graph file agree with the contents in the ODB.
> +
> +- When rewriting the graph, we do not check for a commit still existing
> + in the ODB, so garbage collection may remove commits.
> +
> +- Generation numbers are not computed in the current version. The file
> + format supports storing them, along with a mechanism to upgrade from
> + a file without generation numbers to one that uses them.
> +
> +Future Work
> +-----------
> +
> +- The file format includes room for precomputed generation numbers. These
> + are not currently computed, so all generation numbers will be marked as
> + 0 (or "uncomputed"). A later patch will include this calculation.
> +
> +- The commit graph is currently incompatible with commit grafts. This can be
> + remedied by duplicating or refactoring the current graft logic.
> +
> +- After computing and storing generation numbers, we must make graph
> + walks aware of generation numbers to gain the performance benefits they
> + enable. This will mostly be accomplished by swapping a commit-date-ordered
> + priority queue with one ordered by generation number. The following
> + operations are important candidates:
> +
> + - paint_down_to_common()
> + - 'log --topo-order'
> +
> +- The graph currently only adds commits to a previously existing graph.
> + When writing a new graph, we could check that the ODB still contains
> + the commits and choose to remove the commits that are deleted from the
> + ODB. For performance reasons, this check should remain optional.
> +
> +- Currently, parse_commit_gently() requires filling in the root tree
> + object for a commit. This passes through lookup_tree() and consequently
> + lookup_object(). Also, it calls lookup_commit() when loading the parents.
> + These method calls check the ODB for object existence, even if the
> + consumer does not need the content. For example, we do not need the
> + tree contents when computing merge bases. Now that commit parsing is
> + removed from the computation time, these lookup operations are the
> + slowest operations keeping graph walks from being fast. Consider
> + loading these objects without verifying their existence in the ODB and
> + only loading them fully when consumers need them. Consider a method
> + such as "ensure_tree_loaded(commit)" that fully loads a tree before
> + using commit->tree.
> +
> +- The current design uses the 'commit-graph' builtin to generate the graph.
> + When this feature stabilizes enough to recommend to most users, we should
> + add automatic graph writes to common operations that create many commits.
> + For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
> + commands.
> +
> +- A server could provide a commit graph file as part of the network protocol
> + to avoid extra calculations by clients.
> +
> +Related Links
> +-------------
> +[0] https://bugs.chromium.org/p/git/issues/detail?id=8
> + Chromium work item for: Serialized Commit Graph
> +
> +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
> + An abandoned patch that introduced generation numbers.
> +
> +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
> + Discussion about generation numbers on commits and how they interact
> + with fsck.
> +
> +[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
> + More discussion about generation numbers and not storing them inside
> + commit objects. A valuable quote:
> +
> + "I think we should be moving more in the direction of keeping
> + repo-local caches for optimizations. Reachability bitmaps have been
> + a big performance win. I think we should be doing the same with our
> + properties of commits. Not just generation numbers, but making it
> + cheap to access the graph structure without zlib-inflating whole
> + commit objects (i.e., packv4 or something like the "metapacks" I
> + proposed a few years ago)."
> +
> +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
> + A patch to remove the ahead-behind calculation from 'status'.
> --
> 2.16.0.15.g9c3cf44.dirty
>
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
2018-02-02 22:48 ` Junio C Hamano@ 2018-02-03 1:58 ` Derrick Stolee
2018-02-03 9:28 ` Jeff King0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-03 1:58 UTC (permalink / raw)
To: Junio C Hamano, Stefan Beller
Cc: Jonathan Tan, git, Jeff King, Jeff Hostetler, Derrick Stolee, szeder.dev
On 2/2/2018 5:48 PM, Junio C Hamano wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> It is true for git-submodule and a few others (the minority of commands IIRC)
>> git-tag for example takes subcommands such as --list or --verify.
>> https://public-inbox.org/git/xmqqiomodkt9.fsf@gitster.dls.corp.google.com/
>
> Thanks. It refers to an article at gmane, which is not readily
> accessible unless you use newsreader. The original discussion it
> refers to appears at:
>
> https://public-inbox.org/git/7vbo5itjfl.fsf@alter.siamese.dyndns.org/
>
> for those who are interested.
Thanks for the links.
> I am still not sure if it is a good design to add a new command like
> this series does, though. I would naively have expected that this
> would be a new pack index format that is produced by pack-objects
> and index-pack, for example, in which case its maintenance would
> almost be invisible to end users (i.e. just like how the pack bitmap
> feature was added to the system).
I agree that the medium-term goal is to have this happen without user
intervention. Something like a "core.autoCommitGraph" setting to trigger
commit-graph writes during other cleanup activities, such as a repack or
a gc.
I don't think pairing this with pack-objects or index-pack is a good
direction, because the commit graph is not locked into a packfile the
way the bitmap is. In fact, the entire ODB could be replaced
independently and the graph is still valid (the commits in the graph may
no longer have their paired commits in the ODB due to a GC; you should
never navigate to those commits without having a ref pointing to them,
so this is not immediately a problem).
This sort of interaction with GC is one reason why I did not include the
automatic updates in this patch. The integration with existing
maintenance tasks will be worth discussion in its own right. I'd rather
demonstrate the value of having a graph (even if it is currently
maintained manually) and then follow up with a focus to integrate with
repack, gc, etc.
I plan to clean up this patch on Monday given the feedback I received
the last two days (Thanks Jonathan and Szeder!). However, if the current
builtin design will block merging, then I'll wait until we can find one
that works.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
2018-02-03 1:58 ` Derrick Stolee@ 2018-02-03 9:28 ` Jeff King
2018-02-05 18:48 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Jeff King @ 2018-02-03 9:28 UTC (permalink / raw)
To: Derrick Stolee
Cc: Junio C Hamano, Stefan Beller, Jonathan Tan, git, Jeff Hostetler,
Derrick Stolee, szeder.dev
On Fri, Feb 02, 2018 at 08:58:52PM -0500, Derrick Stolee wrote:
> I don't think pairing this with pack-objects or index-pack is a good
> direction, because the commit graph is not locked into a packfile the way
> the bitmap is. In fact, the entire ODB could be replaced independently and
> the graph is still valid (the commits in the graph may no longer have their
> paired commits in the ODB due to a GC; you should never navigate to those
> commits without having a ref pointing to them, so this is not immediately a
> problem).
One advantage of tying this to packs is that you can piggy-back on the
.idx to avoid storing object ids a second time. If we imagine that you
use a 32-bit index into the .idx instead, that's a savings of 16 bytes
per object (or more when we switch to a longer hash). You only need to
refer to commits and their root trees, though. So on something like
linux.git, you're talking about 2 * 700k * 16 = 21 megabytes you could
save.
That may not be worth worrying about too much, compared to the size of
the rest of the data. Disk space is obviously cheap, but I'm more
concerned about working-set size. However, 21 megabytes probably isn't
breaking the bank there these days (and it may even be faster, since the
commit-graph lookups can use the more compact index that contains only
commits, not other objects).
The big advantage of your scheme is that you can update the graph index
without repacking. The traditional advice has been that you should
always do a full repack during a gc (since it gives the most delta
opportunities). So metadata like reachability bitmaps were happy to tied
to packs, since you're repacking anyway during a gc. But my
understanding is that this doesn't really fly with the Windows
repository, where it's simply so big that you never obtain a single
pack, and just pass around slices of history in pack format.
So I think I'm OK with the direction here of keeping metadata caches
separate from the pack storage.
> This sort of interaction with GC is one reason why I did not include the
> automatic updates in this patch. The integration with existing maintenance
> tasks will be worth discussion in its own right. I'd rather demonstrate the
> value of having a graph (even if it is currently maintained manually) and
> then follow up with a focus to integrate with repack, gc, etc.
>
> I plan to clean up this patch on Monday given the feedback I received the
> last two days (Thanks Jonathan and Szeder!). However, if the current builtin
> design will block merging, then I'll wait until we can find one that works.
If they're not tied to packs, then I think having a separate builtin
like this is the best approach. It gives you a plumbing command to
experiment with, and it can easily be called from git-gc.
-Peff
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
2018-02-02 15:32 ` SZEDER Gábor@ 2018-02-05 16:06 ` Derrick Stolee
2018-02-07 15:08 ` SZEDER Gábor0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-05 16:06 UTC (permalink / raw)
To: SZEDER Gábor; +Cc: git, gitster, peff, git, sbeller, dstolee
On 2/2/2018 10:32 AM, SZEDER Gábor wrote:
>> Teach Git to write a commit graph file by checking all packed objects
>> to see if they are commits, then store the file in the given pack
>> directory.
> I'm afraid that scanning all packed objects is a bit of a roundabout
> way to approach this.
>
> In my git repo, with 9 pack files at the moment, i.e. not that big a
> repo and not that many pack files:
>
> $ time ./git commit-graph --write --update-head
> 4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>
> real 0m27.550s
> user 0m27.113s
> sys 0m0.376s
>
> In comparison, performing a good old revision walk to gather all the
> info that is written into the graph file:
>
> $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
> 52954
>
> real 0m0.903s
> user 0m0.972s
> sys 0m0.058s
Two reasons this is in here:
(1) It's easier to get the write implemented this way and add the
reachable closure later (which I do).
(2) For GVFS, we want to add all commits that arrived in a "prefetch
pack" to the graph even if we do not have a ref that points to the
commit yet. We expect many commits to become reachable soon and having
them in the graph saves a lot of time in merge-base calculations.
So, (1) is for patch simplicity, and (2) is why I want it to be an
option in the final version. See the --stdin-packs argument later for a
way to do this incrementally.
I expect almost all users to use the reachable closure method with
--stdin-commits (and that's how I will integrate automatic updates with
'fetch', 'repack', and 'gc' in a later patch).
>
>> +char* get_commit_graph_filename_hash(const char *pack_dir,
>> + struct object_id *hash)
>> +{
>> + size_t len;
>> + struct strbuf head_path = STRBUF_INIT;
>> + strbuf_addstr(&head_path, pack_dir);
>> + strbuf_addstr(&head_path, "/graph-");
>> + strbuf_addstr(&head_path, oid_to_hex(hash));
>> + strbuf_addstr(&head_path, ".graph");
> Nit: this is assembling the path of a graph file, not that of a
> graph-head, so the strbuf should be renamed accordingly.
>
>> +
>> + return strbuf_detach(&head_path, &len);
>> +}
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
2018-02-03 9:28 ` Jeff King@ 2018-02-05 18:48 ` Junio C Hamano
2018-02-06 18:55 ` Derrick Stolee0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-05 18:48 UTC (permalink / raw)
To: Jeff King
Cc: Derrick Stolee, Stefan Beller, Jonathan Tan, git, Jeff Hostetler,
Derrick Stolee, szeder.dev
Jeff King <peff@peff.net> writes:
> The big advantage of your scheme is that you can update the graph index
> without repacking. The traditional advice has been that you should
> always do a full repack during a gc (since it gives the most delta
> opportunities). So metadata like reachability bitmaps were happy to tied
> to packs, since you're repacking anyway during a gc. But my
> understanding is that this doesn't really fly with the Windows
> repository, where it's simply so big that you never obtain a single
> pack, and just pass around slices of history in pack format.
>
> So I think I'm OK with the direction here of keeping metadata caches
> separate from the pack storage.
OK. I guess that the topology information surviving repacking is a
reason good enough to keep this separate from pack files, and I
agree with your "If they're not tied to packs,..." below, too.
Thanks.
> If they're not tied to packs, then I think having a separate builtin
> like this is the best approach. It gives you a plumbing command to
> experiment with, and it can easily be called from git-gc.
>
> -Peff
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 06/14] commit-graph: implement git-commit-graph --read
2018-02-02 0:23 ` Jonathan Tan@ 2018-02-05 19:29 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-05 19:29 UTC (permalink / raw)
To: Jonathan Tan; +Cc: git, gitster, peff, git, sbeller, dstolee
On 2/1/2018 7:23 PM, Jonathan Tan wrote:
> On Tue, 30 Jan 2018 16:39:35 -0500
> Derrick Stolee <stolee@gmail.com> wrote:
>
>> Teach git-commit-graph to read commit graph files and summarize their contents.
> One overall question - is the "read" command meant to be a command used
> by the end user, or is it here just to test that some aspects of reading
> works? If the former, I'm not sure how useful it is. And if the latter,
> I think that it is more useful to just implementing something that reads
> it, then make the 11/14 change (modifying parse_commit_gently) and
> include a perf test to show that your commit graph reading is both
> correct and (performance-)effective.
The "read" command is intended for use with the tests to verify that the
different --write commands write the correct number of commits at a
time. For example, we can verify that the closure under reachability
works when using --stdin-commits, that we get every commit in a pack
when using --stdin-packs, and that we don't get more commits than we should.
It doesn't serve much purpose on the user-facing side, but this is
intended to be a plumbing command that is called by other porcelain
commands.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write
2018-02-05 18:48 ` Junio C Hamano@ 2018-02-06 18:55 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-06 18:55 UTC (permalink / raw)
To: Junio C Hamano, Jeff King
Cc: Stefan Beller, Jonathan Tan, git, Jeff Hostetler, Derrick Stolee,
szeder.dev
On 2/5/2018 1:48 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
>> The big advantage of your scheme is that you can update the graph index
>> without repacking. The traditional advice has been that you should
>> always do a full repack during a gc (since it gives the most delta
>> opportunities). So metadata like reachability bitmaps were happy to tied
>> to packs, since you're repacking anyway during a gc. But my
>> understanding is that this doesn't really fly with the Windows
>> repository, where it's simply so big that you never obtain a single
>> pack, and just pass around slices of history in pack format.
>>
>> So I think I'm OK with the direction here of keeping metadata caches
>> separate from the pack storage.
> OK. I guess that the topology information surviving repacking is a
> reason good enough to keep this separate from pack files, and I
> agree with your "If they're not tied to packs,..." below, too.
>
> Thanks.
>
>> If they're not tied to packs, then I think having a separate builtin
>> like this is the best approach. It gives you a plumbing command to
>> experiment with, and it can easily be called from git-gc.
>>
>> -Peff
Thanks for all the advice here. In addition to the many cleanups that
were suggested, I'm going to take a try at the "subcommand" approach.
I'll use git-submodule--helper and git-remote as models for my
implementation.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
2018-02-05 16:06 ` Derrick Stolee@ 2018-02-07 15:08 ` SZEDER Gábor
2018-02-07 15:10 ` Derrick Stolee0 siblings, 1 reply; 146+ messages in thread
From: SZEDER Gábor @ 2018-02-07 15:08 UTC (permalink / raw)
To: Derrick Stolee
Cc: Git mailing list, Junio C Hamano, Jeff King, git, Stefan Beller,
Derrick Stolee
On Mon, Feb 5, 2018 at 5:06 PM, Derrick Stolee <stolee@gmail.com> wrote:
> On 2/2/2018 10:32 AM, SZEDER Gábor wrote:
>> In my git repo, with 9 pack files at the moment, i.e. not that big a
>> repo and not that many pack files:
>>
>> $ time ./git commit-graph --write --update-head
>> 4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>>
>> real 0m27.550s
>> user 0m27.113s
>> sys 0m0.376s
>>
>> In comparison, performing a good old revision walk to gather all the
>> info that is written into the graph file:
>>
>> $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
>> 52954
>>
>> real 0m0.903s
>> user 0m0.972s
>> sys 0m0.058s
>
>
> Two reasons this is in here:
>
> (1) It's easier to get the write implemented this way and add the reachable
> closure later (which I do).
>
> (2) For GVFS, we want to add all commits that arrived in a "prefetch pack"
> to the graph even if we do not have a ref that points to the commit yet. We
> expect many commits to become reachable soon and having them in the graph
> saves a lot of time in merge-base calculations.
>
> So, (1) is for patch simplicity, and (2) is why I want it to be an option in
> the final version. See the --stdin-packs argument later for a way to do this
> incrementally.
>
> I expect almost all users to use the reachable closure method with
> --stdin-commits (and that's how I will integrate automatic updates with
> 'fetch', 'repack', and 'gc' in a later patch).
I see. I was about to ask about the expected use-cases of the
'--stdin-packs' option, considering how much slower it is to enumerate
all objects in pack files, but run out of time after patch 10.
The run-time using '--stdin-commits' is indeed great:
$ time { git for-each-ref --format='%(objectname)' refs/heads/ | ./git
commit-graph --write --update-head --stdin-commits ; }
82fe9a5cd715ff578f01f7f44e0611d7902d20c8
real 0m0.985s
user 0m0.916s
sys 0m0.024s
Considering the run-time difference, I think in the end it would be a
better default for a plain 'git commit-graph --write' to traverse
history from all refs, and it should enumerate pack files only if
explicitly told so via '--stdin-packs'.
To be clear: I'm not saying that traversing history should already be
the default when introducing construct_commit_graph() and '--write'. If
enumerating pack files keeps the earlier patches simpler and easier to
review, then by all means stick with it, and only change the
'--stdin-*'-less behavior near the end of the series, when all the
building blocks are already in place (but then mention this in the early
commit messages).
I have also noticed a segfault when feeding non-commit object names to
'--stdin-commits', i.e. when I run the above command without restricting
'git for-each-ref' to branches and it listed object names of tags as
well.
$ git rev-parse v2.16.1 |./git commit-graph --write --update-head
--stdin-commits
error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
Segmentation fault
(gdb) bt
#0 __memcpy_avx_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:126
#1 0x00000000004ea97c in sha1write (f=0x356bbf0, buf=0x4, count=20)
at csum-file.c:104
#2 0x00000000004d98e1 in write_graph_chunk_data (f=0x356bbf0, hash_len=20,
commits=0x3508de0, nr_commits=50615) at commit-graph.c:506
#3 0x00000000004da9ca in construct_commit_graph (
pack_dir=0x8ff360 ".git/objects/pack", pack_indexes=0x0, nr_packs=0,
commit_hex=0x8ff790, nr_commits=1) at commit-graph.c:818
#4 0x000000000044184e in graph_write () at builtin/commit-graph.c:149
#5 0x0000000000441a8c in cmd_commit_graph (argc=0, argv=0x7fffffffe310,
prefix=0x0) at builtin/commit-graph.c:224
#6 0x0000000000405a0a in run_builtin (p=0x8ad950 <commands+528>, argc=4,
argv=0x7fffffffe310) at git.c:346
#7 0x0000000000405ce4 in handle_builtin (argc=4, argv=0x7fffffffe310)
at git.c:555
#8 0x0000000000405ec8 in run_argv (argcp=0x7fffffffe1cc, argv=0x7fffffffe1c0)
at git.c:607
#9 0x0000000000406079 in cmd_main (argc=4, argv=0x7fffffffe310) at git.c:684
#10 0x00000000004a85c8 in main (argc=5, argv=0x7fffffffe308)
at common-main.c:43
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()
2018-02-07 15:08 ` SZEDER Gábor@ 2018-02-07 15:10 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-07 15:10 UTC (permalink / raw)
To: SZEDER Gábor
Cc: Git mailing list, Junio C Hamano, Jeff King, git, Stefan Beller,
Derrick Stolee
On 2/7/2018 10:08 AM, SZEDER Gábor wrote:
> On Mon, Feb 5, 2018 at 5:06 PM, Derrick Stolee <stolee@gmail.com> wrote:
>> On 2/2/2018 10:32 AM, SZEDER Gábor wrote:
>>> In my git repo, with 9 pack files at the moment, i.e. not that big a
>>> repo and not that many pack files:
>>>
>>> $ time ./git commit-graph --write --update-head
>>> 4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>>>
>>> real 0m27.550s
>>> user 0m27.113s
>>> sys 0m0.376s
>>>
>>> In comparison, performing a good old revision walk to gather all the
>>> info that is written into the graph file:
>>>
>>> $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
>>> 52954
>>>
>>> real 0m0.903s
>>> user 0m0.972s
>>> sys 0m0.058s
>>
>> Two reasons this is in here:
>>
>> (1) It's easier to get the write implemented this way and add the reachable
>> closure later (which I do).
>>
>> (2) For GVFS, we want to add all commits that arrived in a "prefetch pack"
>> to the graph even if we do not have a ref that points to the commit yet. We
>> expect many commits to become reachable soon and having them in the graph
>> saves a lot of time in merge-base calculations.
>>
>> So, (1) is for patch simplicity, and (2) is why I want it to be an option in
>> the final version. See the --stdin-packs argument later for a way to do this
>> incrementally.
>>
>> I expect almost all users to use the reachable closure method with
>> --stdin-commits (and that's how I will integrate automatic updates with
>> 'fetch', 'repack', and 'gc' in a later patch).
> I see. I was about to ask about the expected use-cases of the
> '--stdin-packs' option, considering how much slower it is to enumerate
> all objects in pack files, but run out of time after patch 10.
>
> The run-time using '--stdin-commits' is indeed great:
>
> $ time { git for-each-ref --format='%(objectname)' refs/heads/ | ./git
> commit-graph --write --update-head --stdin-commits ; }
> 82fe9a5cd715ff578f01f7f44e0611d7902d20c8
>
> real 0m0.985s
> user 0m0.916s
> sys 0m0.024s
>
> Considering the run-time difference, I think in the end it would be a
> better default for a plain 'git commit-graph --write' to traverse
> history from all refs, and it should enumerate pack files only if
> explicitly told so via '--stdin-packs'.
>
> To be clear: I'm not saying that traversing history should already be
> the default when introducing construct_commit_graph() and '--write'. If
> enumerating pack files keeps the earlier patches simpler and easier to
> review, then by all means stick with it, and only change the
> '--stdin-*'-less behavior near the end of the series, when all the
> building blocks are already in place (but then mention this in the early
> commit messages).
I will consider this.
> I have also noticed a segfault when feeding non-commit object names to
> '--stdin-commits', i.e. when I run the above command without restricting
> 'git for-each-ref' to branches and it listed object names of tags as
> well.
>
> $ git rev-parse v2.16.1 |./git commit-graph --write --update-head
> --stdin-commits
> error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
> error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
> error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
> Segmentation fault
>
> (gdb) bt
> #0 __memcpy_avx_unaligned ()
> at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:126
> #1 0x00000000004ea97c in sha1write (f=0x356bbf0, buf=0x4, count=20)
> at csum-file.c:104
> #2 0x00000000004d98e1 in write_graph_chunk_data (f=0x356bbf0, hash_len=20,
> commits=0x3508de0, nr_commits=50615) at commit-graph.c:506
> #3 0x00000000004da9ca in construct_commit_graph (
> pack_dir=0x8ff360 ".git/objects/pack", pack_indexes=0x0, nr_packs=0,
> commit_hex=0x8ff790, nr_commits=1) at commit-graph.c:818
> #4 0x000000000044184e in graph_write () at builtin/commit-graph.c:149
> #5 0x0000000000441a8c in cmd_commit_graph (argc=0, argv=0x7fffffffe310,
> prefix=0x0) at builtin/commit-graph.c:224
> #6 0x0000000000405a0a in run_builtin (p=0x8ad950 <commands+528>, argc=4,
> argv=0x7fffffffe310) at git.c:346
> #7 0x0000000000405ce4 in handle_builtin (argc=4, argv=0x7fffffffe310)
> at git.c:555
> #8 0x0000000000405ec8 in run_argv (argcp=0x7fffffffe1cc, argv=0x7fffffffe1c0)
> at git.c:607
> #9 0x0000000000406079 in cmd_main (argc=4, argv=0x7fffffffe310) at git.c:684
> #10 0x00000000004a85c8 in main (argc=5, argv=0x7fffffffe308)
> at common-main.c:43
I noticed this while preparing v3. I have a fix, but you now remind me
that I need to add tags to the test script.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*[PATCH v3 01/14] commit-graph: add format document
2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
@ 2018-02-08 20:37 ` Derrick Stolee
2018-02-08 21:21 ` Junio C Hamano
2018-02-08 20:37 ` [PATCH v3 02/14] graph: add commit graph design document Derrick Stolee
` (13 subsequent siblings)14 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev
Add document specifying the binary format for commit graphs. This
format allows for:
* New versions.
* New hash functions and hash lengths.
* Optional extensions.
Basic header information is followed by a binary table of contents
into "chunks" that include:
* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.
The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph-format.txt | 91 +++++++++++++++++++++++++
1 file changed, 91 insertions(+)
create mode 100644 Documentation/technical/commit-graph-format.txt
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 0000000000..349fa0c14c
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,91 @@
+Git commit graph format
+=======================
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+ generation number 1; commits with parents have generation number
+ one more than the maximum generation number of its parents. We
+ reserve zero as special, and can be used to mark a generation
+ number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+ the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+ 4-byte signature:
+ The signature is: {'C', 'G', 'P', 'H'}
+
+ 1-byte version number:
+ Currently, the only valid version is 1.
+
+ 1-byte Object Id Version (1 = SHA-1)
+
+ 1-byte Object Id Length (H)
+
+ 1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+ (C + 1) * 12 bytes listing the table of contents for the chunks:
+ First 4 bytes describe chunk id. Value 0 is a terminating label.
+ Other 8 bytes provide offset in current file for chunk to start.
+ (Chunks are ordered contiguously in the file, so you can infer
+ the length using the next chunk position if necessary.)
+
+ The remaining data in the body is described one chunk at a time, and
+ these chunks may be given in any order. Chunks are required unless
+ otherwise specified.
+
+CHUNK DATA:
+
+ OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+ The ith entry, F[i], stores the number of OIDs with first
+ byte at most i. Thus F[255] stores the total
+ number of commits (N).
+
+ OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+ The OIDs for all commits in the graph, sorted in ascending order.
+
+ Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+ * The first H bytes are for the OID of the root tree.
+ * The next 8 bytes are for the int-ids of the first two parents
+ of the ith commit. Stores value 0xffffffff if no parent in that
+ position. If there are more than two parents, the second value
+ has its most-significant bit on and the other bits store an array
+ position into the Large Edge List chunk.
+ * The next 8 bytes store the generation number of the commit and
+ the commit time in seconds since EPOCH. The generation number
+ uses the higher 30 bits of the first 4 bytes, while the commit
+ time uses the 32 bits of the second 4 bytes, along with the lowest
+ 2 bits of the lowest byte, storing the 33rd and 34th bit of the
+ commit time.
+
+ Large Edge List (ID: {'E', 'D', 'G', 'E'})
+ This list of 4-byte values store the second through nth parents for
+ all octopus merges. The second parent value in the commit data stores
+ an array position within this list along with the most-significant bit
+ on. Starting at that array position, iterate through this list of int-ids
+ for the parents until reaching a value with the most-significant bit on.
+ The other bits correspond to the int-id of the last parent. This chunk
+ should always be present, but may be empty.
+
+TRAILER:
+
+ H-byte HASH-checksum of all of the above.
+
--
2.15.1.45.g9b7079f
^permalinkrawreply [flat|nested] 146+ messages in thread

*[PATCH v3 02/14] graph: add commit graph design document
2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
2018-02-08 20:37 ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
@ 2018-02-08 20:37 ` Derrick Stolee
2018-02-08 20:37 ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
` (12 subsequent siblings)14 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 20:37 UTC (permalink / raw)
To: git; +Cc: dstolee, git, gitster, peff, jonathantanmy, sbeller, szeder.dev
Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph.txt | 189 +++++++++++++++++++++++++++++++
1 file changed, 189 insertions(+)
create mode 100644 Documentation/technical/commit-graph.txt
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000000..fc86b06041
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,189 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+next to packfiles either in the .git/objects/pack directory or in the pack
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+ the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+ If A and B are commits with generation numbers N and M, respectively,
+ and N <= M, then A cannot reach B. That is, we know without searching
+ that B is not an ancestor of A because it is further from a root commit
+ than A.
+
+ Conversely, when checking if A is an ancestor of B, then we only need
+ to walk commits until all commits on the walk boundary have generation
+ number at most N. If we walk commits using a priority queue seeded by
+ generation numbers, then we always expand the boundary commit with highest
+ generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+ If A and B are commits with commit time X and Y, respectively, and
+ X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<hash>.graph' in the pack
+ directory. This could be stored in an alternate.
+
+- The most-recent graph file hash is stored in a 'graph-head' file for
+ immediate access and storing backup graphs. This could be stored in an
+ alternate, and refers to a 'graph-<hash>.graph' file in the same pack
+ directory.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object id length and hash
+ algorithm, so a future change of hash algorithm does not require a change
+ in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer position
+ to seek into the single graph file. It is possible to extend the model
+ for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'commit-graph' builtin. These are not
+ updated automatically during clone, fetch, repack, or creating new commits.
+
+- There is no '--verify' option for the 'commit-graph' builtin to verify the
+ contents of the graph file agree with the contents in the ODB.
+
+- When rewriting the graph, we do not check for a commit still existing
+ in the ODB, so garbage collection may remove commits.
+
+- Generation numbers are not computed in the current version. The file
+ format supports storing them, along with a mechanism to upgrade from
+ a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+ are not currently computed, so all generation numbers will be marked as
+ 0 (or "uncomputed"). A later patch will include this calculation.
+
+- The commit graph is currently incompatible with commit grafts. This can be
+ remedied by duplicating or refactoring the current graft logic.
+
+- After computing and storing generation numbers, we must make graph
+ walks aware of generation numbers to gain the performance benefits they
+ enable. This will mostly be accomplished by swapping a commit-date-ordered
+ priority queue with one ordered by generation number. The following
+ operations are important candidates:
+
+ - paint_down_to_common()
+ - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+ When writing a new graph, we could check that the ODB still contains
+ the commits and choose to remove the commits that are deleted from the
+ ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+ object for a commit. This passes through lookup_tree() and consequently
+ lookup_object(). Also, it calls lookup_commit() when loading the parents.
+ These method calls check the ODB for object existence, even if the
+ consumer does not need the content. For example, we do not need the
+ tree contents when computing merge bases. Now that commit parsing is
+ removed from the computation time, these lookup operations are the
+ slowest operations keeping graph walks from being fast. Consider
+ loading these objects without verifying their existence in the ODB and
+ only loading them fully when consumers need them. Consider a method
+ such as "ensure_tree_loaded(commit)" that fully loads a tree before
+ using commit->tree.
+
+- The current design uses the 'commit-graph' builtin to generate the graph.
+ When this feature stabilizes enough to recommend to most users, we should
+ add automatic graph writes to common operations that create many commits.
+ For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
+ commands.
+
+- A server could provide a commit graph file as part of the network protocol
+ to avoid extra calculations by clients.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+ Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+ An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+ Discussion about generation numbers on commits and how they interact
+ with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+ More discussion about generation numbers and not storing them inside
+ commit objects. A valuable quote:
+
+ "I think we should be moving more in the direction of keeping
+ repo-local caches for optimizations. Reachability bitmaps have been
+ a big performance win. I think we should be doing the same with our
+ properties of commits. Not just generation numbers, but making it
+ cheap to access the graph structure without zlib-inflating whole
+ commit objects (i.e., packv4 or something like the "metapacks" I
+ proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+ A patch to remove the ahead-behind calculation from 'status'.
+
--
2.15.1.45.g9b7079f
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 01/14] commit-graph: add format document
2018-02-08 20:37 ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
@ 2018-02-08 21:21 ` Junio C Hamano
2018-02-08 21:33 ` Derrick Stolee0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 21:21 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
Derrick Stolee <stolee@gmail.com> writes:
> Add document specifying the binary format for commit graphs. This
> format allows for:
>
> * New versions.
> * New hash functions and hash lengths.
It still is unclear, at least to me, why OID and OID length are
stored as if they can be independent. If a reader does not
understand a new Object Id hash, is there anything the reader can
still do by knowing how long the hash (which it cannot recompute to
validate) is? And if a reader does know what OID hashing scheme is
used to refer to the objects, it certainly would know how long the
OIDs are.
Giving length may make sense only when a reader can treat these OIDs
as completely opaque identifiers, without having to (re)hash from
the contents, but if that is the case, then there is no point saying
what exact hash function is used to compute OID.
So I'd understand storing only either one or the other, but not
both. Am I missing something?
> +The Git commit graph stores a list of commit OIDs and some associated
> +metadata, including:
> +
> +- The generation number of the commit. Commits with no parents have
> + generation number 1; commits with parents have generation number
> + one more than the maximum generation number of its parents. We
> + reserve zero as special, and can be used to mark a generation
> + number invalid or as "not computed".
This "most natural" definition of generation number is stricter than
absolutely necessary (a looser definition that is sufficient is
"gennum of a child is larger than all of its parents'"). While I
personally think that is OK, some people who floated different ideas
in previous discussions on generation numbers may want to articulate
their ideas again. One idea that I found clever was to use the
total number of commits that are ancestors of a commit instead (it
is far more expensive to compute than the most natural gennum, but
doing so may help other topology math, like "describe").
> +CHUNK LOOKUP:
> +
> + (C + 1) * 12 bytes listing the table of contents for the chunks:
> + First 4 bytes describe chunk id. Value 0 is a terminating label.
> + Other 8 bytes provide offset in current file for chunk to start.
> + (Chunks are ordered contiguously in the file, so you can infer
> + the length using the next chunk position if necessary.)
Aren't chunks numbered contiguously, starting from #1, thereby
making it unnecessary to store the 4-byte?
How does a reader obtain the length of the last chunk? Ahh, that is
why there are C+1 entries in this table, not just C, so that the
reader knows where to stop while reading the last one. Does that
mean that this table looks like this?
{ 1, offset_1 },
{ 2, offset_2 },
...
{ C, offset_C },
{ 0, offset_C+1 },
where where (offset_N+1 - offset_N) gives the length of chunk #N?
> + The remaining data in the body is described one chunk at a time, and
> + these chunks may be given in any order. Chunks are required unless
> + otherwise specified.
> +
> +CHUNK DATA:
> +
> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> + The ith entry, F[i], stores the number of OIDs with first
> + byte at most i. Thus F[255] stores the total
> + number of commits (N).
> +
> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> + The OIDs for all commits in the graph, sorted in ascending order.
> +
> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> + * The first H bytes are for the OID of the root tree.
> + * The next 8 bytes are for the int-ids of the first two parents
> + of the ith commit. Stores value 0xffffffff if no parent in that
> + position. If there are more than two parents, the second value
> + has its most-significant bit on and the other bits store an array
> + position into the Large Edge List chunk.
> + * The next 8 bytes store the generation number of the commit and
> + the commit time in seconds since EPOCH. The generation number
> + uses the higher 30 bits of the first 4 bytes, while the commit
> + time uses the 32 bits of the second 4 bytes, along with the lowest
> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> + commit time.
> +
> + Large Edge List (ID: {'E', 'D', 'G', 'E'})
> + This list of 4-byte values store the second through nth parents for
> + all octopus merges. The second parent value in the commit data stores
> + an array position within this list along with the most-significant bit
> + on. Starting at that array position, iterate through this list of int-ids
> + for the parents until reaching a value with the most-significant bit on.
> + The other bits correspond to the int-id of the last parent. This chunk
> + should always be present, but may be empty.
I am not convinced about the value of these 4-byte section IDs.
They are useless unless you define what should happen when a reader
sees a block of data with an unknown ID; presence of these IDs given
an impression that you allow a reimplementation to reorder the
sections unnecessarily, especially when all of these are required
anyway, making the canonical reader implementation unnecessarily
complex, no?
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 01/14] commit-graph: add format document
2018-02-08 21:21 ` Junio C Hamano@ 2018-02-08 21:33 ` Derrick Stolee
2018-02-08 23:16 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 21:33 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
On 2/8/2018 4:21 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Add document specifying the binary format for commit graphs. This
>> format allows for:
>>
>> * New versions.
>> * New hash functions and hash lengths.
> It still is unclear, at least to me, why OID and OID length are
> stored as if they can be independent. If a reader does not
> understand a new Object Id hash, is there anything the reader can
> still do by knowing how long the hash (which it cannot recompute to
> validate) is? And if a reader does know what OID hashing scheme is
> used to refer to the objects, it certainly would know how long the
> OIDs are.
>
> Giving length may make sense only when a reader can treat these OIDs
> as completely opaque identifiers, without having to (re)hash from
> the contents, but if that is the case, then there is no point saying
> what exact hash function is used to compute OID.
>
> So I'd understand storing only either one or the other, but not
> both. Am I missing something?
You're right that this data is redundant. It is easy to describe the
width of the tables using the OID length, so it is convenient to have
that part of the format. Also, it is good to have 4-byte alignment here,
so we are not wasting space.
There isn't a strong reason to put that here, but I don't have a great
reason to remove it, either.
Perhaps leave a byte blank for possible future use?
>
>> +The Git commit graph stores a list of commit OIDs and some associated
>> +metadata, including:
>> +
>> +- The generation number of the commit. Commits with no parents have
>> + generation number 1; commits with parents have generation number
>> + one more than the maximum generation number of its parents. We
>> + reserve zero as special, and can be used to mark a generation
>> + number invalid or as "not computed".
> This "most natural" definition of generation number is stricter than
> absolutely necessary (a looser definition that is sufficient is
> "gennum of a child is larger than all of its parents'"). While I
> personally think that is OK, some people who floated different ideas
> in previous discussions on generation numbers may want to articulate
> their ideas again. One idea that I found clever was to use the
> total number of commits that are ancestors of a commit instead (it
> is far more expensive to compute than the most natural gennum, but
> doing so may help other topology math, like "describe").
It is more difficult to compute the number of reachable commits, since
you cannot learn that only by looking at the parents (you need to know
how many commits are in the intersection of their reachable sets for a
two-parent merge, or just walk all of the commits). This leads to a
quadratic computation to discover the value for N commits.
I define it this rigidly now because I will submit a patch soon after
this one lands that computes generation numbers and consumes them in
paint_down_to_common(). I've got it sitting in my local repo ready for a
rebase.
>
>> +CHUNK LOOKUP:
>> +
>> + (C + 1) * 12 bytes listing the table of contents for the chunks:
>> + First 4 bytes describe chunk id. Value 0 is a terminating label.
>> + Other 8 bytes provide offset in current file for chunk to start.
>> + (Chunks are ordered contiguously in the file, so you can infer
>> + the length using the next chunk position if necessary.)
> Aren't chunks numbered contiguously, starting from #1, thereby
> making it unnecessary to store the 4-byte?
>
> How does a reader obtain the length of the last chunk? Ahh, that is
> why there are C+1 entries in this table, not just C, so that the
> reader knows where to stop while reading the last one. Does that
> mean that this table looks like this?
>
> { 1, offset_1 },
> { 2, offset_2 },
> ...
> { C, offset_C },
> { 0, offset_C+1 },
>
> where where (offset_N+1 - offset_N) gives the length of chunk #N?
This is correct.
>
>> + The remaining data in the body is described one chunk at a time, and
>> + these chunks may be given in any order. Chunks are required unless
>> + otherwise specified.
>> +
>> +CHUNK DATA:
>> +
>> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>> + The ith entry, F[i], stores the number of OIDs with first
>> + byte at most i. Thus F[255] stores the total
>> + number of commits (N).
>> +
>> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>> + The OIDs for all commits in the graph, sorted in ascending order.
>> +
>> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> + * The first H bytes are for the OID of the root tree.
>> + * The next 8 bytes are for the int-ids of the first two parents
>> + of the ith commit. Stores value 0xffffffff if no parent in that
>> + position. If there are more than two parents, the second value
>> + has its most-significant bit on and the other bits store an array
>> + position into the Large Edge List chunk.
>> + * The next 8 bytes store the generation number of the commit and
>> + the commit time in seconds since EPOCH. The generation number
>> + uses the higher 30 bits of the first 4 bytes, while the commit
>> + time uses the 32 bits of the second 4 bytes, along with the lowest
>> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
>> + commit time.
>> +
>> + Large Edge List (ID: {'E', 'D', 'G', 'E'})
>> + This list of 4-byte values store the second through nth parents for
>> + all octopus merges. The second parent value in the commit data stores
>> + an array position within this list along with the most-significant bit
>> + on. Starting at that array position, iterate through this list of int-ids
>> + for the parents until reaching a value with the most-significant bit on.
>> + The other bits correspond to the int-id of the last parent. This chunk
>> + should always be present, but may be empty.
> I am not convinced about the value of these 4-byte section IDs.
>
> They are useless unless you define what should happen when a reader
> sees a block of data with an unknown ID; presence of these IDs given
> an impression that you allow a reimplementation to reorder the
> sections unnecessarily, especially when all of these are required
> anyway, making the canonical reader implementation unnecessarily
> complex, no?
>
One reason to have chunks is to be simple: we have a clear table of
contents at the beginning of the file and can immediately navigate to
portions of the file we care about. The chunk order also doesn't matter
for the purpose of the format.
The true value they present is flexibility: We can extend the v1 format
to include extra metadata in a chunk and insert it anywhere in the
format. If there is some extra information that would be beneficial for
graphs (say, a de-duplicated author list) then we could extend the
format to be v1-compatible.
BUT: I do notice now that load_commit_graph_one() will die() if seeing a
chunk id it doesn't recognize, so that should be fixed if we keep this
chunk model. We should be able to read "v1.1" files if they have extra
chunks (but ignore that data).
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
2018-02-08 21:27 ` Junio C Hamano@ 2018-02-08 21:36 ` Derrick Stolee
2018-02-08 23:21 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-08 21:36 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
On 2/8/2018 4:27 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Teach git the 'commit-graph' builtin that will be used for writing and
>> reading packed graph files. The current implementation is mostly
>> empty, except for a '--pack-dir' option.
> Why do we want to use "pack" dir, when this is specifically designed
> not tied to packfile? .git/objects/pack/ certainly is a possibility
> in the sense that anywhere inside .git/objects/ would make sense,
> but using the "pack" dir smells like signalling a wrong message to
> users.
>
I wanted to have the smallest footprint as possible in the objects
directory, and the .git/objects directory currently only holds folders.
I suppose this feature, along with the multi-pack-index (MIDX), extends
the concept of the pack directory to be a "compressed data" directory,
but keeps the "pack" name to be compatible with earlier versions.
Another option is to create a .git/objects/graph directory instead, but
then we need to worry about that directory being present.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 01/14] commit-graph: add format document
2018-02-08 21:33 ` Derrick Stolee@ 2018-02-08 23:16 ` Junio C Hamano0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 23:16 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
Derrick Stolee <stolee@gmail.com> writes:
> You're right that this data is redundant. It is easy to describe the
> width of the tables using the OID length, so it is convenient to have
> that part of the format. Also, it is good to have 4-byte alignment
> here, so we are not wasting space.
>
> There isn't a strong reason to put that here, but I don't have a great
> reason to remove it, either.
Redundant information that can go out of sync is a great enough
reason not to have it in the first place.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin
2018-02-08 21:36 ` Derrick Stolee@ 2018-02-08 23:21 ` Junio C Hamano0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 23:21 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
Derrick Stolee <stolee@gmail.com> writes:
> I wanted to have the smallest footprint as possible in the objects
> directory, and the .git/objects directory currently only holds
> folders.
When we cull stale files from pack directory, we rely on the related
files to share pack-<hash>.* pattern. It is better not to contaminate
the directory with unrelated cruft.
As this is purely optional auxiliary information used for optimization,
perhaps .git/objects/info is a better place? I dunno.
In any case, even if its default position ends up in .git/objects/pack/,
if this thing conceptually does not have any ties with packs
(i.e. it is not a corruption if the graph file also described
topologies including loose objects, and it is not a corruption if
the graph file did not cover objects in all packs), then the end
user visible option name shouldn't say "--pack-dir". "--graph-dir"
that defaults to .git/objects/pack/ might be acceptable but it still
feels wrong.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read'
2018-02-08 20:37 ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
@ 2018-02-08 23:38 ` Junio C Hamano0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-08 23:38 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
Derrick Stolee <stolee@gmail.com> writes:
> +'read'::
> +
> +Read a graph file given by the graph-head file and output basic
> +details about the graph file.
"a graph file", assuming that there must be only one in the
specified place? Or if there are more than one, read all of them?
Or is it an error to have more than one?
Do not answer questions in a message that is a response to _me_;
the purpose of a review is not to educate reviewers---it is to
improve the patch.
> +With `--graph-hash=<hash>` option, consider the graph file
> +graph-<hash>.graph in the pack directory.
I think it is more in line with how plumbing works to just let the
full pathname be specifiable (e.g. learn from how pack-objects takes
"pack-" prefix from the command line, even though in practice names
of all packs you see in any repos start from "pack-").
> +struct commit_graph *load_commit_graph_one(const char *graph_file, const char *pack_dir)
This somehow smells like a screwed up API. It gets a filename to
read from that is directly passed to git_open(). Why does an
instance of graph has to know and remember the path to the directory
(i.e. pack_dir) that was given when it was constructed? "I am an
instance that holds commit topology learned from this object
database" is something it might want to know and remember, but "I am
told that I'll eventually be written back to there when I was
created" does not sound like a useful thing to have.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 07/14] commit-graph: update graph-head during write
2018-02-12 18:56 ` Junio C Hamano@ 2018-02-12 20:37 ` Junio C Hamano
2018-02-12 21:24 ` Derrick Stolee0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-12 20:37 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
Junio C Hamano <gitster@pobox.com> writes:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> It is possible to have multiple commit graph files in a pack directory,
>> but only one is important at a time. Use a 'graph_head' file to point
>> to the important file. Teach git-commit-graph to write 'graph_head' upon
>> writing a new commit graph file.
>
> Why this design, instead of what "repack -a" would do, iow, if there
> always is a singleton that is the only one that matters, shouldn't
> the creation of that latest singleton just clear the older ones
> before it returns control?
Note that I am not complaining---I am just curious why we want to
expose this "there is one relevant one but we keep irrelevant ones
we usually do not look at and need to be garbage collected" to end
users, and also expect readers of the series, resulting code and
docs would have the same puzzled feeling.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v3 07/14] commit-graph: update graph-head during write
2018-02-12 20:37 ` Junio C Hamano@ 2018-02-12 21:24 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-12 21:24 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, dstolee, git, peff, jonathantanmy, sbeller, szeder.dev
On 2/12/2018 3:37 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> It is possible to have multiple commit graph files in a pack directory,
>>> but only one is important at a time. Use a 'graph_head' file to point
>>> to the important file. Teach git-commit-graph to write 'graph_head' upon
>>> writing a new commit graph file.
>> Why this design, instead of what "repack -a" would do, iow, if there
>> always is a singleton that is the only one that matters, shouldn't
>> the creation of that latest singleton just clear the older ones
>> before it returns control?
> Note that I am not complaining---I am just curious why we want to
> expose this "there is one relevant one but we keep irrelevant ones
> we usually do not look at and need to be garbage collected" to end
> users, and also expect readers of the series, resulting code and
> docs would have the same puzzled feeling.
>
Aside: I forgot to mention in my cover letter that the experience around
the "--delete-expired" flag for "git commit-graph write" is different
than v2. If specified, we delete all ".graph" files in the pack
directory other than the one referenced by "graph_head" at the beginning
of the process or the one written by the process. If these deletes fail,
then we ignore the failure (assuming that they are being used by another
Git process). In usual cases, we will delete these expired files in the
next instance. I believe this matches similar behavior in gc and repack.
-- Back to discussion about the value of "graph_head" --
The current design of using a pointer file (graph_head) is intended to
have these benefits:
1. We do not need to rely on a directory listing and mtimes to determine
which graph file to use.
2. If we write a new graph file while another git process is reading the
existing graph file, we can update the graph_head pointer without
deleting the file that is currently memory-mapped. (This is why we
cannot just rely on a canonical file name, such as "the_graph", to store
the data.)
3. We can atomically change the 'graph_head' file without interrupting
concurrent git processes. I think this is different from the "repack"
situation because a concurrent process would load all packfiles in the
pack directory and possibly have open handles when the repack is trying
to delete them.
4. We remain open to making the graph file incremental (as the MIDX
feature is designed to do; see [1]). It is less crucial to have an
incremental graph file structure (the graph file for the Windows
repository is currently ~120MB versus a MIDX file of 1.25 GB), but the
graph_head pattern makes this a possibility.
I tried to avoid item 1 due to personal taste, and since I am storing
the files in the objects/pack directory (so that listing may be very
large with a lot of wasted entries). This is less important with our
pending change of moving the graph files to a different directory. This
also satisfies items 2 and 3, as long as we never write graph files so
quickly that we have a collision on mtime.
I cannot think of another design that satisfies item 4.
As for your end user concerns: My philosophy with this feature is that
end users will never interact with the commit-graph builtin. 99% of
users will benefit from a repack or GC automatically computing a commit
graph (when we add that integration point). The other uses for the
builtin are for users who want extreme control over their data, such as
code servers and build agents.
Perhaps someone with experience managing large repositories with git in
a server environment could chime in with some design requirements they
would need.
Thanks,
-Stolee
[1]
https://public-inbox.org/git/20180107181459.222909-2-dstolee@microsoft.com/
[RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
^permalinkrawreply [flat|nested] 146+ messages in thread

*[PATCH v4 01/13] commit-graph: add format document
2018-02-19 18:53 ` [PATCH v4 00/13] " Derrick Stolee
@ 2018-02-19 18:53 ` Derrick Stolee
2018-02-20 20:49 ` Junio C Hamano
` (2 more replies)
2018-02-19 18:53 ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
` (12 subsequent siblings)13 siblings, 3 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
To: git, git
Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee
Add document specifying the binary format for commit graphs. This
format allows for:
* New versions.
* New hash functions and hash lengths.
* Optional extensions.
Basic header information is followed by a binary table of contents
into "chunks" that include:
* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.
The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph-format.txt | 90 +++++++++++++++++++++++++
1 file changed, 90 insertions(+)
create mode 100644 Documentation/technical/commit-graph-format.txt
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 0000000..11b18b5
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,90 @@
+Git commit graph format
+=======================
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+ generation number 1; commits with parents have generation number
+ one more than the maximum generation number of its parents. We
+ reserve zero as special, and can be used to mark a generation
+ number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+ the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+ 4-byte signature:
+ The signature is: {'C', 'G', 'P', 'H'}
+
+ 1-byte version number:
+ Currently, the only valid version is 1.
+
+ 1-byte Object Id Version (1 = SHA-1)
+
+ 1-byte number (C) of "chunks"
+
+ 1-byte (reserved for later use)
+
+CHUNK LOOKUP:
+
+ (C + 1) * 12 bytes listing the table of contents for the chunks:
+ First 4 bytes describe chunk id. Value 0 is a terminating label.
+ Other 8 bytes provide offset in current file for chunk to start.
+ (Chunks are ordered contiguously in the file, so you can infer
+ the length using the next chunk position if necessary.)
+
+ The remaining data in the body is described one chunk at a time, and
+ these chunks may be given in any order. Chunks are required unless
+ otherwise specified.
+
+CHUNK DATA:
+
+ OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+ The ith entry, F[i], stores the number of OIDs with first
+ byte at most i. Thus F[255] stores the total
+ number of commits (N).
+
+ OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+ The OIDs for all commits in the graph, sorted in ascending order.
+
+ Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+ * The first H bytes are for the OID of the root tree.
+ * The next 8 bytes are for the int-ids of the first two parents
+ of the ith commit. Stores value 0xffffffff if no parent in that
+ position. If there are more than two parents, the second value
+ has its most-significant bit on and the other bits store an array
+ position into the Large Edge List chunk.
+ * The next 8 bytes store the generation number of the commit and
+ the commit time in seconds since EPOCH. The generation number
+ uses the higher 30 bits of the first 4 bytes, while the commit
+ time uses the 32 bits of the second 4 bytes, along with the lowest
+ 2 bits of the lowest byte, storing the 33rd and 34th bit of the
+ commit time.
+
+ Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+ This list of 4-byte values store the second through nth parents for
+ all octopus merges. The second parent value in the commit data stores
+ an array position within this list along with the most-significant bit
+ on. Starting at that array position, iterate through this list of int-ids
+ for the parents until reaching a value with the most-significant bit on.
+ The other bits correspond to the int-id of the last parent.
+
+TRAILER:
+
+ H-byte HASH-checksum of all of the above.
+
--
2.7.4
^permalinkrawreply [flat|nested] 146+ messages in thread

*[PATCH v4 02/13] graph: add commit graph design document
2018-02-19 18:53 ` [PATCH v4 00/13] " Derrick Stolee
2018-02-19 18:53 ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
@ 2018-02-19 18:53 ` Derrick Stolee
2018-02-20 21:42 ` Junio C Hamano
2018-02-21 19:34 ` Stefan Beller
2018-02-19 18:53 ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
` (11 subsequent siblings)13 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-19 18:53 UTC (permalink / raw)
To: git, git
Cc: peff, jonathantanmy, szeder.dev, sbeller, gitster, Derrick Stolee
Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph.txt | 185 +++++++++++++++++++++++++++++++
1 file changed, 185 insertions(+)
create mode 100644 Documentation/technical/commit-graph.txt
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000..e52ab23
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,185 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+either in the .git/objects/info directory or in the info directory of an
+alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+ the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+ If A and B are commits with generation numbers N and M, respectively,
+ and N <= M, then A cannot reach B. That is, we know without searching
+ that B is not an ancestor of A because it is further from a root commit
+ than A.
+
+ Conversely, when checking if A is an ancestor of B, then we only need
+ to walk commits until all commits on the walk boundary have generation
+ number at most N. If we walk commits using a priority queue seeded by
+ generation numbers, then we always expand the boundary commit with highest
+ generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+ If A and B are commits with commit time X and Y, respectively, and
+ X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<hash>.graph' in the
+ .git/objects/info directory. This could be stored in the info directory
+ of an alternate.
+
+- The latest graph file name is stored in a 'graph-latest' file next to
+ the graph files. This allows atomic swaps of latest graph files without
+ race conditions with concurrent processes.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+ so a future change of hash algorithm does not require a change in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer position
+ to seek into the single graph file. It is possible to extend the model
+ for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'commit-graph' builtin. These are not
+ updated automatically during clone, fetch, repack, or creating new commits.
+
+- There is no 'verify' subcommand for the 'commit-graph' builtin to verify
+ the contents of the graph file agree with the contents in the ODB.
+
+- Generation numbers are not computed in the current version. The file
+ format supports storing them, along with a mechanism to upgrade from
+ a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+ are not currently computed, so all generation numbers will be marked as
+ 0 (or "uncomputed"). A later patch will include this calculation.
+
+- The commit graph is currently incompatible with commit grafts. This can be
+ remedied by duplicating or refactoring the current graft logic.
+
+- After computing and storing generation numbers, we must make graph
+ walks aware of generation numbers to gain the performance benefits they
+ enable. This will mostly be accomplished by swapping a commit-date-ordered
+ priority queue with one ordered by generation number. The following
+ operations are important candidates:
+
+ - paint_down_to_common()
+ - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+ When writing a new graph, we could check that the ODB still contains
+ the commits and choose to remove the commits that are deleted from the
+ ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+ object for a commit. This passes through lookup_tree() and consequently
+ lookup_object(). Also, it calls lookup_commit() when loading the parents.
+ These method calls check the ODB for object existence, even if the
+ consumer does not need the content. For example, we do not need the
+ tree contents when computing merge bases. Now that commit parsing is
+ removed from the computation time, these lookup operations are the
+ slowest operations keeping graph walks from being fast. Consider
+ loading these objects without verifying their existence in the ODB and
+ only loading them fully when consumers need them. Consider a method
+ such as "ensure_tree_loaded(commit)" that fully loads a tree before
+ using commit->tree.
+
+- The current design uses the 'commit-graph' builtin to generate the graph.
+ When this feature stabilizes enough to recommend to most users, we should
+ add automatic graph writes to common operations that create many commits.
+ For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
+ commands.
+
+- A server could provide a commit graph file as part of the network protocol
+ to avoid extra calculations by clients.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+ Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+ An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+ Discussion about generation numbers on commits and how they interact
+ with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+ More discussion about generation numbers and not storing them inside
+ commit objects. A valuable quote:
+
+ "I think we should be moving more in the direction of keeping
+ repo-local caches for optimizations. Reachability bitmaps have been
+ a big performance win. I think we should be doing the same with our
+ properties of commits. Not just generation numbers, but making it
+ cheap to access the graph structure without zlib-inflating whole
+ commit objects (i.e., packv4 or something like the "metapacks" I
+ proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+ A patch to remove the ahead-behind calculation from 'status'.
+
--
2.7.4
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 01/13] commit-graph: add format document
2018-02-19 18:53 ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
@ 2018-02-20 20:49 ` Junio C Hamano
2018-02-21 19:23 ` Stefan Beller
2018-03-30 13:25 ` Jakub Narebski2 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-20 20:49 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> Documentation/technical/commit-graph-format.txt | 90 +++++++++++++++++++++++++
> 1 file changed, 90 insertions(+)
> create mode 100644 Documentation/technical/commit-graph-format.txt
Hopefully just a few remaining nits. Overall I find this written
really clearly.
> +== graph-*.graph files have the following format:
> +
> +In order to allow extensions that add extra data to the graph, we organize
> +the body into "chunks" and provide a binary lookup table at the beginning
> +of the body. The header includes certain values, such as number of chunks,
> +hash lengths and types.
We no longer have lengths stored.
> + ...
> + The remaining data in the body is described one chunk at a time, and
> + these chunks may be given in any order. Chunks are required unless
> + otherwise specified.
It is good that this explicitly says chunks can come in any order,
and which ones are required. It should also say which chunk can
appear multiple times. I think all four chunk types we currently
define can have at most one instance in a file.
> +CHUNK DATA:
> +
> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> + The ith entry, F[i], stores the number of OIDs with first
> + byte at most i. Thus F[255] stores the total
> + number of commits (N).
> +
> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> + The OIDs for all commits in the graph, sorted in ascending order.
Somewhere in this document, we probably would want to say that this
format allows at most (1<<31)-1 commits recorded in the file (as
CGET and EDGE uses 31-bit uint to index into this table, using MSB
for other purposes, and the all-1-bit pattern is also special), and
when we refer to "int-ids" of a commit, it is this 31-bit number
that is an index into this table.
> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> + * The first H bytes are for the OID of the root tree.
> + * The next 8 bytes are for the int-ids of the first two parents
> + of the ith commit. Stores value 0xffffffff if no parent in that
> + position. If there are more than two parents, the second value
> + has its most-significant bit on and the other bits store an array
> + position into the Large Edge List chunk.
> + * The next 8 bytes store the generation number of the commit and
> + the commit time in seconds since EPOCH. The generation number
> + uses the higher 30 bits of the first 4 bytes, while the commit
> + time uses the 32 bits of the second 4 bytes, along with the lowest
> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> + commit time.
> +
> + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> + This list of 4-byte values store the second through nth parents for
> + all octopus merges. The second parent value in the commit data stores
> + an array position within this list along with the most-significant bit
> + on. Starting at that array position, iterate through this list of int-ids
> + for the parents until reaching a value with the most-significant bit on.
> + The other bits correspond to the int-id of the last parent.
> +
> +TRAILER:
> +
> + H-byte HASH-checksum of all of the above.
> +
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 02/13] graph: add commit graph design document
2018-02-19 18:53 ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
@ 2018-02-20 21:42 ` Junio C Hamano
2018-02-23 15:44 ` Derrick Stolee
2018-02-21 19:34 ` Stefan Beller1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-20 21:42 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> +2. Walking the entire graph to avoid topological order mistakes.
You have at least one more mention of "topological order mistakes"
below, but we commonly refer to this issue and blame it for "clock
skew". Using the word highlights that there is no "mistake" in topo
order algorithm and mistakes are in the commit timestamps.
> +In practice, we expect some commits to be created recently and not stored
> +in the commit graph. We can treat these commits as having "infinite"
> +generation number and walk until reaching commits with known generation
> +number.
Hmm, "pretend infinity" is an interesting approach---I need to think
about it a bit more if it is sufficient.
> +- .graph files are managed only by the 'commit-graph' builtin. These are not
> + updated automatically during clone, fetch, repack, or creating new commits.
OK. s/builtin/subcommand/; it does not make much difference if it
is a built-in or standalone command.
> +- There is no 'verify' subcommand for the 'commit-graph' builtin to verify
> + the contents of the graph file agree with the contents in the ODB.
I am not entirely sure about the merit of going into this level of
detail. Being able to use only a single file looks like a more
fundamental design limitation, which deserves to be decribed in this
section, and we could ship the subsystem with that limitation.
But the lack of verify that can be called from fsck is merely the
matter of not the subsystem being mature enough (to be written,
reviewed and tested) and not a fundamental one, and we will not be
shipping the subsystem until that limitation is lifted.
So I'd guess that we prefer this bullet item to be in the commit log
message, not here, that describes the current status of the
development (as opposed to the state of the subsystem).
> +- Generation numbers are not computed in the current version. The file
> + format supports storing them, along with a mechanism to upgrade from
> + a file without generation numbers to one that uses them.
Exactly the same comment as above applies to this item.
> +- The commit graph is currently incompatible with commit grafts. This can be
> + remedied by duplicating or refactoring the current graft logic.
Hmm. Can it be lifted without first allowing us to use more than
one commit graph file (i.e. one for "traverse while honoring the
grafts", the other for "traverse while ignoring the grafts")?
> +- After computing and storing generation numbers, we must make graph
> + walks aware of generation numbers to gain the performance benefits they
> + enable. This will mostly be accomplished by swapping a commit-date-ordered
> + priority queue with one ordered by generation number. The following
> + operations are important candidates:
> +
> + - paint_down_to_common()
> + - 'log --topo-order'
Yes.
> +- The graph currently only adds commits to a previously existing graph.
> + When writing a new graph, we could check that the ODB still contains
> + the commits and choose to remove the commits that are deleted from the
> + ODB. For performance reasons, this check should remain optional.
The last sentence is somehow unconvincing. It probably is not
appropriate for the "Future Work" section to be making a hurried
design decision before having any working verification code to run
benchmark on.
> +- Currently, parse_commit_gently() requires filling in the root tree
> + object for a commit. This passes through lookup_tree() and consequently
> + lookup_object(). Also, it calls lookup_commit() when loading the parents.
> + These method calls check the ODB for object existence, even if the
> + consumer does not need the content. For example, we do not need the
> + tree contents when computing merge bases. Now that commit parsing is
> + removed from the computation time, these lookup operations are the
> + slowest operations keeping graph walks from being fast. Consider
> + loading these objects without verifying their existence in the ODB and
> + only loading them fully when consumers need them. Consider a method
> + such as "ensure_tree_loaded(commit)" that fully loads a tree before
> + using commit->tree.
Very good idea.
> +- The current design uses the 'commit-graph' builtin to generate the graph.
> + When this feature stabilizes enough to recommend to most users, we should
> + add automatic graph writes to common operations that create many commits.
> + For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
> + commands.
s/coulde/could/.
Also do not forget "fsck" that calls "verify". That is more urgent
than intergration with any other subcommand.
> +- A server could provide a commit graph file as part of the network protocol
> + to avoid extra calculations by clients.
We need to assess the riskiness and threat models regarding this, if
we really want to follow this "could" through. I would imagine that
the cost for verification is comparable to the cost for regenerating,
in which case it may not be worth doing this _unless_ the user opts
into it saying that the other side over the wire is trusted without
any reservation.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 01/13] commit-graph: add format document
2018-02-19 18:53 ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
2018-02-20 20:49 ` Junio C Hamano@ 2018-02-21 19:23 ` Stefan Beller
2018-02-21 19:45 ` Derrick Stolee
2018-03-30 13:25 ` Jakub Narebski2 siblings, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-02-21 19:23 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
Junio C Hamano, Derrick Stolee
On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
> Add document specifying the binary format for commit graphs. This
> format allows for:
>
> * New versions.
> * New hash functions and hash lengths.
> * Optional extensions.
>
> Basic header information is followed by a binary table of contents
> into "chunks" that include:
>
> * An ordered list of commit object IDs.
> * A 256-entry fanout into that list of OIDs.
> * A list of metadata for the commits.
> * A list of "large edges" to enable octopus merges.
>
> The format automatically includes two parent positions for every
> commit. This favors speed over space, since using only one position
> per commit would cause an extra level of indirection for every merge
> commit. (Octopus merges suffer from this indirection, but they are
> very rare.)
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/technical/commit-graph-format.txt | 90 +++++++++++++++++++++++++
> 1 file changed, 90 insertions(+)
> create mode 100644 Documentation/technical/commit-graph-format.txt
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> new file mode 100644
> index 0000000..11b18b5
> --- /dev/null
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -0,0 +1,90 @@
> +Git commit graph format
> +=======================
> +
> +The Git commit graph stores a list of commit OIDs and some associated
> +metadata, including:
> +
> +- The generation number of the commit. Commits with no parents have
> + generation number 1; commits with parents have generation number
> + one more than the maximum generation number of its parents. We
> + reserve zero as special, and can be used to mark a generation
> + number invalid or as "not computed".
> +
> +- The root tree OID.
> +
> +- The commit date.
> +
> +- The parents of the commit, stored using positional references within
> + the graph file.
> +
> +== graph-*.graph files have the following format:
> +
> +In order to allow extensions that add extra data to the graph, we organize
> +the body into "chunks" and provide a binary lookup table at the beginning
> +of the body. The header includes certain values, such as number of chunks,
> +hash lengths and types.
> +
> +All 4-byte numbers are in network order.
> +
> +HEADER:
> +
> + 4-byte signature:
> + The signature is: {'C', 'G', 'P', 'H'}
> +
> + 1-byte version number:
> + Currently, the only valid version is 1.
> +
> + 1-byte Object Id Version (1 = SHA-1)
> +
> + 1-byte number (C) of "chunks"
> +
> + 1-byte (reserved for later use)
What should clients of today do with it?
* ignore it completely [as they have no idea what it is] or
* throw hands up in the air if it is anything other than 0 ?
[because clearly we will increment the version
or have new information in a new chunk instead of just sneaking
in information here?]
> +CHUNK LOOKUP:
> +
> + (C + 1) * 12 bytes listing the table of contents for the chunks:
> + First 4 bytes describe chunk id. Value 0 is a terminating label.
> + Other 8 bytes provide offset in current file for chunk to start.
offset [in bytes? I could imagine having a larger granularity here,
because chunks don't sound small.]
> + (Chunks are ordered contiguously in the file, so you can infer
> + the length using the next chunk position if necessary.)
> +
> + The remaining data in the body is described one chunk at a time, and
> + these chunks may be given in any order. Chunks are required unless
> + otherwise specified.
> +
> +CHUNK DATA:
> +
> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> + The ith entry, F[i], stores the number of OIDs with first
> + byte at most i. Thus F[255] stores the total
> + number of commits (N).
[ so in small repos, where there are fewer than 256 objects,
F[i] == F[i+1], for all i'th where there is no object starting with i byte]
> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> + The OIDs for all commits in the graph, sorted in ascending order.
> +
> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> + * The first H bytes are for the OID of the root tree.
> + * The next 8 bytes are for the int-ids of the first two parents
> + of the ith commit. Stores value 0xffffffff if no parent in that
> + position. If there are more than two parents, the second value
> + has its most-significant bit on and the other bits store an array
> + position into the Large Edge List chunk.
> + * The next 8 bytes store the generation number of the commit and
> + the commit time in seconds since EPOCH. The generation number
> + uses the higher 30 bits of the first 4 bytes, while the commit
> + time uses the 32 bits of the second 4 bytes, along with the lowest
> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> + commit time.
> +
> + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> + This list of 4-byte values store the second through nth parents for
> + all octopus merges. The second parent value in the commit data stores
> + an array position within this list along with the most-significant bit
> + on. Starting at that array position, iterate through this list of int-ids
> + for the parents until reaching a value with the most-significant bit on.
> + The other bits correspond to the int-id of the last parent.
> +
> +TRAILER:
> +
> + H-byte HASH-checksum of all of the above.
> +
> --
> 2.7.4
Makes sense so far, I'll read on.
I agree with Junio, that I could read this documentation without
the urge to point out nits. :)
Thanks,
Stefan
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write'
2018-02-19 18:53 ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
@ 2018-02-21 19:25 ` Junio C Hamano0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-21 19:25 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> +static int graph_write(int argc, const char **argv)
> +{
> + ...
> + graph_name = write_commit_graph(opts.obj_dir);
> +
> + if (graph_name) {
> + printf("%s\n", graph_name);
> + FREE_AND_NULL(graph_name);
> + }
> +
> + return 0;
> +}
After successfully writing a graph file out, write_commit_graph()
signals that fact by returning a non-NULL pointer, so that this
caller can report the filename to the end user. This caller
protects itself from a NULL return, presumably because the callee
uses it to signal an error when writing the graph file out?
Is it OK to lose that 1-bit of information, or should we have more like
if (graph_name) {
printf;
return 0;
} else {
return -1;
}
> int cmd_commit_graph(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_commit_graph_options[] = {
> - { OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
> + { OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
> N_("dir"),
> N_("The object directory to store the graph") },
> OPT_END(),
The same comment for a no-op patch from an earlier step applies
here, and we have another one that we saw above in graph_write().
> @@ -31,6 +67,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
> builtin_commit_graph_usage,
> PARSE_OPT_STOP_AT_NON_OPTION);
>
> + if (argc > 0) {
> + if (!strcmp(argv[0], "write"))
> + return graph_write(argc, argv);
And if we fix "graph_write" to report an error with negative return,
this needs to become something like
return !!graph_write(argc, argv);
as we do not want to return a negative value to be passed via
run_builtin() to exit(3) in handle_builtin().
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> new file mode 100755
> index 0000000..6a5e93c
> --- /dev/null
> +++ b/t/t5318-commit-graph.sh
> @@ -0,0 +1,119 @@
> +#!/bin/sh
> +
> +test_description='commit graph'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup full repo' '
> + rm -rf .git &&
I am perfectly OK with creating a separate subdirectory called
'full' in the trash directory given by the test framework, but
unless absolutely necessary I'd rather not to see "rm -rf",
especially on ".git", in our test scripts. People can screw up
doing various things (like copying and pasting).
> + mkdir full &&
> + cd full &&
> + git init &&
> + objdir=".git/objects"
> +'
And I absolutely do not want to see "cd full" that leaves and stays
in the subdirectory after this step is done.
Imagine what happens if any earlier step fails before doing "cd
full", causing this "setup full" step to report failure, and then
the test goes on to the next step? We will not be in "full" and
worse yet because we do not have "$TRASH_DIRECTORY/.git" (you
removed it), the "git commit-graph write --object-dir" command we
end up doing next will see the git source repository as the
repository it is working on. Never risk trashing our source
repository with your test. That is why we give you $TRASH_DIRECTORY
to play in. Make use of it when you can.
I'd make this step just a single
git init full
and then the next one
git -C full commit-graph write --object-dir .
In later tests that have multi-step things, I'd instead make them
(
cd full &&
... whatever you do &&
... in that separate &&
... 'full' repository
)
if I were writing this test *and* if I really wanted to do things
inside $TRASH_DIRECTORY/full/.git repository. I am not convinced
yet about the latter. I know that it will make certain things
simpler to use a separate /full hierarchy (e.g. cleaning up, having
another unrelated test repository, etc.) while making other things
more cumbersome (e.g. you need to be careful when you "cd" and the
easiest way to do so is to ( do things in a subshell )). I just do
not know what the trade-off would look like in this particular case.
A simple rule of thumb I try to follow is not to change $(pwd) for
the process that runs these test_expect_success shell functions.
> +
> +test_expect_success 'write graph with no packs' '
> + git commit-graph write --object-dir .
> +'
> +
> +test_expect_success 'create commits and repack' '
> + for i in $(test_seq 3)
> + do
> + test_commit $i &&
> + git branch commits/$i
> + done &&
> + git repack
> +'
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 01/13] commit-graph: add format document
2018-02-21 19:23 ` Stefan Beller@ 2018-02-21 19:45 ` Derrick Stolee
2018-02-21 19:48 ` Stefan Beller0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-21 19:45 UTC (permalink / raw)
To: Stefan Beller
Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
Junio C Hamano, Derrick Stolee
On 2/21/2018 2:23 PM, Stefan Beller wrote:
> On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
>> +In order to allow extensions that add extra data to the graph, we organize
>> +the body into "chunks" and provide a binary lookup table at the beginning
>> +of the body. The header includes certain values, such as number of chunks,
>> +hash lengths and types.
>> +
>> +All 4-byte numbers are in network order.
>> +
>> +HEADER:
>> +
>> + 4-byte signature:
>> + The signature is: {'C', 'G', 'P', 'H'}
>> +
>> + 1-byte version number:
>> + Currently, the only valid version is 1.
>> +
>> + 1-byte Object Id Version (1 = SHA-1)
>> +
>> + 1-byte number (C) of "chunks"
>> +
>> + 1-byte (reserved for later use)
> What should clients of today do with it?
> * ignore it completely [as they have no idea what it is] or
> * throw hands up in the air if it is anything other than 0 ?
> [because clearly we will increment the version
> or have new information in a new chunk instead of just sneaking
> in information here?]
They should ignore it completely, which will allow using the value for
something meaningful later without causing a version change (which we DO
die() for). A user could downgrade from a version that uses this byte
for something meaningful and not require a new commit-graph file.
The "commit-graph read" subcommand does output this byte, so we can
verify that the "write" subcommand places a 0 in this position.
>
>> +CHUNK LOOKUP:
>> +
>> + (C + 1) * 12 bytes listing the table of contents for the chunks:
>> + First 4 bytes describe chunk id. Value 0 is a terminating label.
>> + Other 8 bytes provide offset in current file for chunk to start.
> offset [in bytes? I could imagine having a larger granularity here,
> because chunks don't sound small.]
It is good to specify "offset in bytes".
>
>> + (Chunks are ordered contiguously in the file, so you can infer
>> + the length using the next chunk position if necessary.)
>> +
>> + The remaining data in the body is described one chunk at a time, and
>> + these chunks may be given in any order. Chunks are required unless
>> + otherwise specified.
>> +
>> +CHUNK DATA:
>> +
>> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>> + The ith entry, F[i], stores the number of OIDs with first
>> + byte at most i. Thus F[255] stores the total
>> + number of commits (N).
> [ so in small repos, where there are fewer than 256 objects,
> F[i] == F[i+1], for all i'th where there is no object starting with i byte]
Correct. I'm not sure this additional information is valuable for the
document, though.
>
>> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>> + The OIDs for all commits in the graph, sorted in ascending order.
>> +
>> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> + * The first H bytes are for the OID of the root tree.
>> + * The next 8 bytes are for the int-ids of the first two parents
>> + of the ith commit. Stores value 0xffffffff if no parent in that
>> + position. If there are more than two parents, the second value
>> + has its most-significant bit on and the other bits store an array
>> + position into the Large Edge List chunk.
>> + * The next 8 bytes store the generation number of the commit and
>> + the commit time in seconds since EPOCH. The generation number
>> + uses the higher 30 bits of the first 4 bytes, while the commit
>> + time uses the 32 bits of the second 4 bytes, along with the lowest
>> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
>> + commit time.
>> +
>> + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>> + This list of 4-byte values store the second through nth parents for
>> + all octopus merges. The second parent value in the commit data stores
>> + an array position within this list along with the most-significant bit
>> + on. Starting at that array position, iterate through this list of int-ids
>> + for the parents until reaching a value with the most-significant bit on.
>> + The other bits correspond to the int-id of the last parent.
>> +
>> +TRAILER:
>> +
>> + H-byte HASH-checksum of all of the above.
>> +
>> --
>> 2.7.4
> Makes sense so far, I'll read on.
> I agree with Junio, that I could read this documentation without
> the urge to point out nits. :)
>
> Thanks,
> Stefan
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 06/13] commit-graph: implement git commit-graph read
2018-02-19 18:53 ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
@ 2018-02-21 20:11 ` Junio C Hamano
2018-02-22 18:25 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-21 20:11 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> +'read'::
> +
> +Read a graph file given by the graph-head file and output basic
> +details about the graph file.
> ++
> +With `--file=<name>` option, consider the graph stored in the file at
> +the path <object-dir>/info/<name>.
> +
A sample reader confusion after reading the above twice:
What is "the graph-head file" and how does the user specify it? Is
it given by the value for the "--file=<name>" command line option?
Another sample reader reaction after reading the above:
What are the kind of "basic details" we can learn from this
command is unclear, but perhaps there is an example to help me
decide if this command is worth studying.
> @@ -44,6 +53,12 @@ EXAMPLES
> $ git commit-graph write
> ------------------------------------------------
>
> +* Read basic information from a graph file.
> ++
> +------------------------------------------------
> +$ git commit-graph read --file=<name>
> +------------------------------------------------
> +
And the sample reader is utterly disappointed at this point.
> +static int graph_read(int argc, const char **argv)
> +{
> + struct commit_graph *graph = 0;
> + struct strbuf full_path = STRBUF_INIT;
> +
> + static struct option builtin_commit_graph_read_options[] = {
> + { OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
> + N_("dir"),
> + N_("The object directory to store the graph") },
> + { OPTION_STRING, 'H', "file", &opts.graph_file,
> + N_("file"),
> + N_("The filename for a specific commit graph file in the object directory."),
> + PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
> + OPT_END(),
> + };
The same comment as all the previous ones apply, wrt short options
and non-use of OPT_STRING().
Also, I suspect that these two would want to use OPT_FILENAME
instead, if we anticipate that the command might want to be
sometimes run from a subdirectory. Otherwise wouldn't
cd t && git commit-graph read --file=../.git/object/info/$whatever
end up referring to a wrong place because the code that uses the
value obtained from OPTION_STRING does not do the equivalent of
parse-options.c::fix_filename()? The same applies to object-dir
handling.
> + argc = parse_options(argc, argv, NULL,
> + builtin_commit_graph_read_options,
> + builtin_commit_graph_read_usage, 0);
> +
> + if (!opts.obj_dir)
> + opts.obj_dir = get_object_directory();
> +
> + if (!opts.graph_file)
> + die("no graph hash specified");
> +
> + strbuf_addf(&full_path, "%s/info/%s", opts.obj_dir, opts.graph_file);
Ahh, I was fooled by a misnamed option. --file does *not* name the
file. It is a filename in a fixed place that is determined by other
things.
So it would be a mistake to use OPT_FILENAME() in the parser for
that misnamed "--file" option. The parser for --object-dir still
would want to be OPT_FILENAME(), but quite honestly, I do not see
the point of having --object-dir option in the first place. The
graph file is not relative to it but is forced to have /info/ in
between that directory and the filename, so it is not like the user
gets useful flexibility out of being able to specify two different
places using --object-dir= option and $GIT_OBJECT_DIRECTORY
environment (iow, a caller that wants to work on a specific object
directory can use the environment, which is how it would tell any
other Git subcommand which object store it wants to work with).
But stepping back a bit, I think the way --file argument is defined
is halfway off from two possible more useful ways to define it. If
it were just "path to the file" (iow, what OPT_FILENAME() is suited
for parsing it), then a user could say "I have this graph file that
I created for testing, it is not installed in its usual place in
$GIT_OBJECT_DIRECTORY/info/ at all, but I want you to read it
because I am debugging". That is one possible useful extreme. The
other possibility would be to allow *only* the hash part to be
specified, iow, not just forcing /info/ relative to object
directory, you would force the "graph-" prefix and ".graph" suffix.
That would be the other extreme that is useful (less typing and less
error prone).
For a low-level command line this, my gut feeling is that it would
be better to allow paths to the object directory and the graph file
to be totally independently specified.
> + if (graph_signature != GRAPH_SIGNATURE) {
> + munmap(graph_map, graph_size);
> + close(fd);
> + die("graph signature %X does not match signature %X",
> + graph_signature, GRAPH_SIGNATURE);
> + }
> +
> + graph_version = *(unsigned char*)(data + 4);
> + if (graph_version != GRAPH_VERSION) {
> + munmap(graph_map, graph_size);
> + close(fd);
> + die("graph version %X does not match version %X",
> + graph_version, GRAPH_VERSION);
> + }
> +
> + hash_version = *(unsigned char*)(data + 5);
> + if (hash_version != GRAPH_OID_VERSION) {
> + munmap(graph_map, graph_size);
> + close(fd);
> + die("hash version %X does not match version %X",
> + hash_version, GRAPH_OID_VERSION);
It becomes a bit tiring to see munmap/close/die pattern repreated
over and over again, doesn't it? Can we make it simpler, perhaps by
letting die() take care of the clean-up? After all, if the very
next step dies because alloc_commit_graph() got NULL in xmalloc(),
we are letting die() there take care of the clean-up anyway already,
and die() in the chunk parsing look has no such cleanup, either.
Of course, when we later want to libify this part of the code, then
we wouldn't be calling die() from this codepath, but the change
required to do so will not be just s/die/error/; it would be more
like
if (x_version != X_VERSION) {
error("X version %X does not match",...);
goto cleanup_fail;
}
with munmap/close done at the jumped-to label.
> + }
> +
> + graph = alloc_commit_graph();
> +
> + graph->hash_len = GRAPH_OID_LEN;
> + ...
> + if (chunk_offset > graph_size - GIT_MAX_RAWSZ)
> + die("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32),
> + (uint32_t)chunk_offset);
> +
> + switch (chunk_id) {
> + case GRAPH_CHUNKID_OIDFANOUT:
> + graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
> + break;
This is over-indented from our point of view. In our codebase, case
arms aling with switch, i.e.
switch (chunk_id) {
case GRAPH_CHUNKID_OIDFANOUT:
graph->chunk_oid_fanout = ...;
break;
When the input file has GRAPH_CHUNKID_OIDFANOUT twice, I think it
should be flagged as a corrupt/malformed input file, causing the
reader to reject it. It is plausible that you wanted to make it
"the last one wins", but even if that is the case, I think the user
should at least get a warning, as (I'd imagine) it is an unusual
condition.
The same applies to multiple instances of any currently-defined
chunk types.
> +graph_read_expect() {
> + OPTIONAL=""
> + NUM_CHUNKS=3
> + if [ ! -z $2 ]
We use "test" and do not use "[ ... ]" or "[[ ... ]]".
I'll stop here.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
2018-02-19 18:53 ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
2018-02-21 21:34 ` Stefan Beller@ 2018-02-22 18:48 ` Junio C Hamano
2018-02-23 17:59 ` Derrick Stolee1 sibling, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-22 18:48 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> Teach git-commit-graph to delete the .graph files that are siblings of a
> newly-written graph file, except for the file referenced by 'graph-latest'
> at the beginning of the process and the newly-written file. If we fail to
> delete a graph file, only report a warning because another git process may
> be using that file. In a multi-process environment, we expect the previoius
> graph file to be used by a concurrent process, so we do not delete it to
> avoid race conditions.
I do not understand the later part of the above. On some operating
systems, you actually can remove a file that is open by another
process without any ill effect. There are systems that do not allow
removing a file that is in use, and an attempt to unlink it may
fail. The need to handle such a failure gracefully is not limited
to the case of removing a commit graph file---we need to deal with
it when removing file of _any_ type.
Especially the last sentence "we do not delete it to avoid race
conditions" I find problematic. If a system does not allow removing
a file in use and we detect a failure after an attempt to do so, it
is not "we do not delete it" --- even if you do, you won't succeed
anyway, so there is no point saying that. And on systems that do
allow safe removal of a file in use (i.e. they allow an open file to
be used by processes that have open filehandles to it after its
removal), there is no point refraining to delete it "to avoid race
conditions", either---in fact it is unlikely that you would even know
somebody else had it open and was using it.
In any case, I do not think '--delete-expired' option that can be
given only when you are writing makes much sense as an API. An
'expire' command, just like 'set-latest' command, that is a separate
command from 'write', may make sense, though.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 02/13] graph: add commit graph design document
2018-02-20 21:42 ` Junio C Hamano@ 2018-02-23 15:44 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 15:44 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/20/2018 4:42 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +2. Walking the entire graph to avoid topological order mistakes.
> You have at least one more mention of "topological order mistakes"
> below, but we commonly refer to this issue and blame it for "clock
> skew". Using the word highlights that there is no "mistake" in topo
> order algorithm and mistakes are in the commit timestamps.
I'll drop the word "mistakes" and instead here say:
2. Walking the entire graph to satisfy topological order constraints.
and later say
This heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
>
>> +In practice, we expect some commits to be created recently and not stored
>> +in the commit graph. We can treat these commits as having "infinite"
>> +generation number and walk until reaching commits with known generation
>> +number.
> Hmm, "pretend infinity" is an interesting approach---I need to think
> about it a bit more if it is sufficient.
Since we require the commit graph file to be closed under reachability,
the commits reachable from the file all have "finite" generation number.
>
>> +- .graph files are managed only by the 'commit-graph' builtin. These are not
>> + updated automatically during clone, fetch, repack, or creating new commits.
> OK. s/builtin/subcommand/; it does not make much difference if it
> is a built-in or standalone command.
>
>> +- There is no 'verify' subcommand for the 'commit-graph' builtin to verify
>> + the contents of the graph file agree with the contents in the ODB.
> I am not entirely sure about the merit of going into this level of
> detail. Being able to use only a single file looks like a more
> fundamental design limitation, which deserves to be decribed in this
> section, and we could ship the subsystem with that limitation.
>
> But the lack of verify that can be called from fsck is merely the
> matter of not the subsystem being mature enough (to be written,
> reviewed and tested) and not a fundamental one, and we will not be
> shipping the subsystem until that limitation is lifted.
>
> So I'd guess that we prefer this bullet item to be in the commit log
> message, not here, that describes the current status of the
> development (as opposed to the state of the subsystem).
I was treating this design document as a living document that will be
updated as the feature matures. It is difficult to time when to discuss
these limitations, since in this commit the graph feature is not
implemented at all. But, it is important to have _some_ design document
before continuing to implement the feature.
I can remove this bullet, but I'm not sure which commit message would be
appropriate to contain that information.
I do intend to remove these limitations and future work bullets as they
are implemented in later patches.
>
>> +- Generation numbers are not computed in the current version. The file
>> + format supports storing them, along with a mechanism to upgrade from
>> + a file without generation numbers to one that uses them.
> Exactly the same comment as above applies to this item.
>
>> +- The commit graph is currently incompatible with commit grafts. This can be
>> + remedied by duplicating or refactoring the current graft logic.
> Hmm. Can it be lifted without first allowing us to use more than
> one commit graph file (i.e. one for "traverse while honoring the
> grafts", the other for "traverse while ignoring the grafts")?
I consider this list unordered, but will move this bullet to the top and
replace its first sentence with:
The commit graph feature currently does not honor commit grafts.
>
>> +- After computing and storing generation numbers, we must make graph
>> + walks aware of generation numbers to gain the performance benefits they
>> + enable. This will mostly be accomplished by swapping a commit-date-ordered
>> + priority queue with one ordered by generation number. The following
>> + operations are important candidates:
>> +
>> + - paint_down_to_common()
>> + - 'log --topo-order'
> Yes.
>
>> +- The graph currently only adds commits to a previously existing graph.
>> + When writing a new graph, we could check that the ODB still contains
>> + the commits and choose to remove the commits that are deleted from the
>> + ODB. For performance reasons, this check should remain optional.
> The last sentence is somehow unconvincing. It probably is not
> appropriate for the "Future Work" section to be making a hurried
> design decision before having any working verification code to run
> benchmark on.
I'll remove this entire block, since it is not relevant starting at v4.
I dropped this "additive only" step in v4 and forgot to remove the bullet.
>
>> +- Currently, parse_commit_gently() requires filling in the root tree
>> + object for a commit. This passes through lookup_tree() and consequently
>> + lookup_object(). Also, it calls lookup_commit() when loading the parents.
>> + These method calls check the ODB for object existence, even if the
>> + consumer does not need the content. For example, we do not need the
>> + tree contents when computing merge bases. Now that commit parsing is
>> + removed from the computation time, these lookup operations are the
>> + slowest operations keeping graph walks from being fast. Consider
>> + loading these objects without verifying their existence in the ODB and
>> + only loading them fully when consumers need them. Consider a method
>> + such as "ensure_tree_loaded(commit)" that fully loads a tree before
>> + using commit->tree.
> Very good idea.
I will likely submit an orthogonal patch that does this, as it will save
time even without the commit graph. The time spent in 'lookup_tree()' is
less significant when the majority of the time is spent parsing commits,
but it is still 1-2% in some cases.
>
>> +- The current design uses the 'commit-graph' builtin to generate the graph.
>> + When this feature stabilizes enough to recommend to most users, we should
>> + add automatic graph writes to common operations that create many commits.
>> + For example, one coulde compute a graph on 'clone', 'fetch', or 'repack'
>> + commands.
> s/coulde/could/.
>
> Also do not forget "fsck" that calls "verify". That is more urgent
> than intergration with any other subcommand.
Noted.
>
>> +- A server could provide a commit graph file as part of the network protocol
>> + to avoid extra calculations by clients.
> We need to assess the riskiness and threat models regarding this, if
> we really want to follow this "could" through. I would imagine that
> the cost for verification is comparable to the cost for regenerating,
> in which case it may not be worth doing this _unless_ the user opts
> into it saying that the other side over the wire is trusted without
> any reservation.
I agree. There is a certain level of trust that is required here.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 03/13] commit-graph: create git-commit-graph builtin
2018-02-21 18:58 ` Junio C Hamano@ 2018-02-23 16:07 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 16:07 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/21/2018 1:58 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> +int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>>> +{
>>> + static struct option builtin_commit_graph_options[] = {
>>> + { OPTION_STRING, 'p', "object-dir", &opts.obj_dir,
>>> + N_("dir"),
>>> + N_("The object directory to store the graph") },
>> I have a suspicion that this was modeled after some other built-in
>> that has a similar issue (perhaps written long time ago), but isn't
>> OPT_STRING() sufficient to define this element these days?
>>
>> Or am I missing something?
You are not. There are several places in this history of this patch
where I was using old patterns because I was using old code as my model
(places like 'index-pack').
>> Why squat on short-and-sweet "-p"? For that matter, since this is
>> not expected to be end-user facing command anyway, I suspect that we
>> do not want to allocate a single letter option from day one, which
>> paints ourselves into a corner from where we cannot escape.
I'll drop all single-letter shortcuts.
> I suspect that exactly the same comment applies to patches in this
> series that add other subcommands (I just saw one in the patch for
> adding 'write').
>
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
2018-02-20 22:57 ` Junio C Hamano@ 2018-02-23 17:23 ` Derrick Stolee
2018-02-23 19:30 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:23 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/20/2018 5:57 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +#define GRAPH_OID_VERSION_SHA1 1
>> +#define GRAPH_OID_LEN_SHA1 20
> This hardcoded 20 on the right hand side of this #define is probably
> problematic. Unless you are planning to possibly store truncated
> hash value for some future hash algorithm, GRAPH_OID_LEN_$HASH should
> always be the same as GIT_$HASH_RAWSZ, I would think. IOW
>
> #define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ
>
> perhaps?
Yes.
>
>> +static void write_graph_chunk_fanout(struct sha1file *f,
>> + struct commit **commits,
>> + int nr_commits)
>> +{
>> + uint32_t i, count = 0;
>> + struct commit **list = commits;
>> + struct commit **last = commits + nr_commits;
>> +
>> + /*
>> + * Write the first-level table (the list is sorted,
>> + * but we use a 256-entry lookup to be able to avoid
>> + * having to do eight extra binary search iterations).
>> + */
>> + for (i = 0; i < 256; i++) {
>> + while (list < last) {
>> + if ((*list)->object.oid.hash[0] != i)
>> + break;
>> + count++;
>> + list++;
>> + }
> If count and list are always incremented in unison, perhaps you do
> not need an extra variable "last". If typeof(nr_commits) is wider
> than typeof(count), this loop and the next write-be32 is screwed
> anyway ;-)
>
> This comment probably applies equally to some other uses of the same
> "compute last pointer to compare with running pointer for
> termination" pattern in this patch.
Yes. Also turning i and count into int to match nr_commits.
>
>> + sha1write_be32(f, count);
>> + }
>> +}
>> +static int commit_pos(struct commit **commits, int nr_commits,
>> + const struct object_id *oid, uint32_t *pos)
>> +{
> It is a bit unusual to see something_pos() that returns an integer
> that does *NOT* return the position as its return value. Dropping
> the *pos parameter, and returning "mid" when commits[mid] is what we
> wanted to see, and otherwise returning "-1 - first" to signal the
> position at which we _would_ have found the object, if it were in
> the table, would make it more consistent with the usual convention.
I can make this change. I found the boolean return to make the
consumer's logic simpler, but it isn't that much simpler.
> Don't we even have such a generalized binary search helper already
> somewhere in the system?
jt/binsearch-with-fanout introduces one when there is a 256-entry fanout
table (not the case here).
The bsearch() method in search.h (and used in
pack-write.c:need_large_offset) does not return the _position_ of a
found element.
Neither of these suit my needs, but I could just be searching for the
wrong strings. Also, I could divert my energies in this area to create a
generic search in the style of jt/binsearch-with-fanout.
>
>> +static void write_graph_chunk_data(struct sha1file *f, int hash_len,
>> + struct commit **commits, int nr_commits)
>> +{
>> + struct commit **list = commits;
>> + struct commit **last = commits + nr_commits;
>> + uint32_t num_large_edges = 0;
>> +
>> + while (list < last) {
>> + struct commit_list *parent;
>> + uint32_t int_id;
>> + uint32_t packedDate[2];
>> +
>> +...
>> + if (!parent)
>> + int_id = GRAPH_PARENT_NONE;
>> + else if (parent->next)
>> + int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges;
>> + else if (!commit_pos(commits, nr_commits,
>> + &(parent->item->object.oid), &int_id))
>> + int_id = GRAPH_PARENT_MISSING;
>> +
>> + sha1write_be32(f, int_id);
>> +
>> + if (parent && parent->next) {
> This is equivalent to checking "int_id & GRAPH_LARGE_EDGES_NEEDED",
> right? Not suggesting to use the other form of checks, but trying
> to see what's the best way to express it in the most readable way.
You're right, we already set the bit above, so let's make use of that
check. Important to note that GRAPH_LARGE_EDGES_NEEDED &
GRAPH_PARENT_MISSING == 0.
>
>> + do {
>> + num_large_edges++;
>> + parent = parent->next;
>> + } while (parent);
> It feels somewhat wasteful to traverse the commit's parents list
> only to count, without populating the octopus table (which I
> understand is assumed to be minority case under this design).
Since we are writing the commit graph file in-order, we cannot write the
octopus table until after the chunk lengths are known. We could store
the octopus table in memory and then dump it into the file later, but
walking the parents is quite fast after all the commits are loaded. I'm
not sure the time optimization merits the extra complexity here. (I'm
happy to revisit this if we do see this performance lacking.)
P.S. I really like the name "octopus table" and will use that for
informal discussions of this format.
>
>> + }
>> +
>> + if (sizeof((*list)->date) > 4)
>> + packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
>> + else
>> + packedDate[0] = 0;
> OK, the undefined pattern in the previous round is now gone ;-) Good.
>
>> + packedDate[1] = htonl((*list)->date);
>> + sha1write(f, packedDate, 8);
>> +
>> + list++;
>> + }
>> +}
>> +
>> +static void write_graph_chunk_large_edges(struct sha1file *f,
>> + struct commit **commits,
>> + int nr_commits)
>> +{
>> + struct commit **list = commits;
>> + struct commit **last = commits + nr_commits;
>> + struct commit_list *parent;
>> +
>> + while (list < last) {
>> + int num_parents = 0;
>> + for (parent = (*list)->parents; num_parents < 3 && parent;
>> + parent = parent->next)
>> + num_parents++;
>> +
>> + if (num_parents <= 2) {
>> + list++;
>> + continue;
>> + }
>> +
>> + /* Since num_parents > 2, this initializer is safe. */
>> + for (parent = (*list)->parents->next; parent; parent = parent->next) {
>> + uint32_t int_id, swap_int_id;
>> + uint32_t last_edge = 0;
>> + if (!parent->next)
>> + last_edge |= GRAPH_LAST_EDGE;
>> +
>> + if (commit_pos(commits, nr_commits,
>> + &(parent->item->object.oid),
>> + &int_id))
>> + swap_int_id = htonl(int_id | last_edge);
>> + else
>> + swap_int_id = htonl(GRAPH_PARENT_MISSING | last_edge);
>> + sha1write(f, &swap_int_id, 4);
> What does "swap_" in the name of this variable mean? For some
> archs, there is no swap. The only difference between int_id and the
> variable is that its MSB may possibly be smudged with last_edge bit.
Sorry, I tried to catch all of these, but some fell through the cracks.
I should be using sha1write_be32() after modifying int_id directly.
This whole block is a bit of a mess. I'll replace it with something like:
uint32_t int_id;
if (!commit_pos(commits, nr_commits,
&(parent->item->object.oid),
&int_id))
int_id = GRAPH_PARENT_MISSING;
else if (!parent->next)
int_id |= GRAPH_LAST_EDGE;
sha1write_be32(f, int_id);
> This is a tangent, but after having seen many instances of "int_id",
> I started to feel that it is grossly misnamed. We do not care about
> its "int" ness---what's more significant about it is that we use can
> it as a short identifier in place for a full object name, given the
> table of known OIDs. "oid_table_index" may be a better name (but
> others may be able to suggest even better one).
>
> int pos;
> pos = commit_pos(commits, nr_commits, parent->item->object.oid);
> oid_table_pos = (pos < 0) ? GRAPH_PARENT_MISSING : pos;
> if (!parent->net)
> oid_table_pos |= GRAPH_LAST_EDGE;
> oid_table_pos = htonl(oid_table_pos);
> sha1write(f, &oid_table_pos, sizeof(oid_table_pos));
>
> or something like that, perhaps?
You're right that int_id isn't great, and your more-specific
"oid_table_pos" shows an extra reason why it isn't great: when we add
the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value is
NOT a table position.
I'll rework references of "int_id" into "edge_value" to store a value
that goes into - or is read from - the file, either in the two parent
columns or the octopus table.
>
>> +static int commit_compare(const void *_a, const void *_b)
>> +{
>> + struct object_id *a = (struct object_id *)_a;
>> + struct object_id *b = (struct object_id *)_b;
>> + return oidcmp(a, b);
>> +}
> I think oidcmp() takes const pointers, so there is no need to
> discard constness from the parameter like this code does. Also I
> think we tend to prefer writing a_/b_ (instead of _a/_b) to appease
> language lawyers who do not want us mere mortals to use names that
> begin with underscore.
>
>> +static int if_packed_commit_add_to_list(const struct object_id *oid,
>> + struct packed_git *pack,
>> + uint32_t pos,
>> + void *data)
> That is a strange name. "collect packed commits", perhaps?
We are walking all objects in the pack-index and calling this method. If
the object is a commit, we add it to the list; otherwise do nothing.
"data" points to the list.
I think the current name makes the following call very clear:
for_each_object_in_pack(p, if_packed_commit_add_to_list, &oids);
i.e. "for each object in the pack p: if it is a commit, then add it to
the list of oids".
>
>> +char *write_commit_graph(const char *obj_dir)
>> +{
>> + struct packed_oid_list oids;
>> + struct packed_commit_list commits;
>> + struct sha1file *f;
>> + int i, count_distinct = 0;
>> + DIR *info_dir;
>> + struct strbuf tmp_file = STRBUF_INIT;
>> + struct strbuf graph_file = STRBUF_INIT;
>> + unsigned char final_hash[GIT_MAX_RAWSZ];
>> + char *graph_name;
>> + int fd;
>> + uint32_t chunk_ids[5];
>> + uint64_t chunk_offsets[5];
>> + int num_chunks;
>> + int num_long_edges;
>> + struct commit_list *parent;
>> +
>> + oids.nr = 0;
>> + oids.alloc = (int)(0.15 * approximate_object_count());
> Heh, traditionalist would probably avoid unnecessary use of float
> and use something like 1/4 or 1/8 ;-) After all, it is merely a
> ballpark guestimate.
>
>> + num_long_edges = 0;
> This again is about naming, but I find it a bit unnatural to call
> the edge between a chind and its octopus parents "long". Individual
> edges are not long--the only thing that is long is your "list of
> edges". Some other codepaths in this patch seems to call the same
> concept with s/long/large/, which I found somewhat puzzling.
How about "num_extra_edges" which counts the "overflow" into the octopus
table. Note: "num_octopus_edges" sounds like summing the out-degree of
octopus merges, but this count is really "(total number of parents of
octopus merges) - (number of octopus merges)".
>
>> + for (i = 0; i < oids.nr; i++) {
>> + int num_parents = 0;
>> + if (i > 0 && !oidcmp(&oids.list[i-1], &oids.list[i]))
>> + continue;
>> +
>> + commits.list[commits.nr] = lookup_commit(&oids.list[i]);
>> + parse_commit(commits.list[commits.nr]);
>> +
>> + for (parent = commits.list[commits.nr]->parents;
>> + parent; parent = parent->next)
>> + num_parents++;
>> +
>> + if (num_parents > 2)
>> + num_long_edges += num_parents - 1;
> OK, so we count how many entries we will record in the overflow
> parent table, and...
>
>> +
>> + commits.nr++;
>> + }
>> + num_chunks = num_long_edges ? 4 : 3;
> ... if we do not have any octopus commit, we do not need the chunk
> for the overflow parent table. Makes sense.
>
>> + strbuf_addf(&tmp_file, "%s/info", obj_dir);
>> + info_dir = opendir(tmp_file.buf);
>> +
>> + if (!info_dir && mkdir(tmp_file.buf, 0777) < 0)
>> + die_errno(_("cannot mkdir %s"), tmp_file.buf);
>> + if (info_dir)
>> + closedir(info_dir);
>> + strbuf_addstr(&tmp_file, "/tmp_graph_XXXXXX");
>> +
>> + fd = git_mkstemp_mode(tmp_file.buf, 0444);
>> + if (fd < 0)
>> + die_errno("unable to create '%s'", tmp_file.buf);
> It is not performance critical, but it feels a bit wasteful to
> opendir merely to see if something exists as a directory, and it is
> misleading to the readers (it looks as if we care about what files
> we already have in the directory).
>
> The approach that optimizes for the most common case would be to
>
> - prepare full path to the tempfile first
> - try create with mkstemp
> - if successful, you do not have to worry about creating
> the directory at all, which is the most common case
> - see why mkstemp step above failed. Was it because you
> did not have the surrounding directory?
> - if not, there is no point continuing. Just error out.
> - if it was due to missing directory, try creating one.
> - try create with mkstemp
> - if successful, all is well.
> - otherwise there isn't anything more we can do here.
It looks like sha1_file.c:create_tmpfile() has code I can use as a model
here. Thanks.
I wonder: should we move that method into wrapper.c and have its
external definition be available in cache.h? Then I can just consume it
from here.
>
>> +
>> + f = sha1fd(fd, tmp_file.buf);
>> +
>> + sha1write_be32(f, GRAPH_SIGNATURE);
>> +
>> + sha1write_u8(f, GRAPH_VERSION);
>> + sha1write_u8(f, GRAPH_OID_VERSION);
>> + sha1write_u8(f, num_chunks);
>> + sha1write_u8(f, 0); /* unused padding byte */
>> +
>> + chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
>> + chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
>> + chunk_ids[2] = GRAPH_CHUNKID_DATA;
>> + if (num_long_edges)
>> + chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
>> + else
>> + chunk_ids[3] = 0;
>> + chunk_ids[4] = 0;
>> +
>> + chunk_offsets[0] = 8 + GRAPH_CHUNKLOOKUP_SIZE;
>> + chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
>> + chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.nr;
>> + chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * commits.nr;
>> + chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
> Do we have to care about overflowing any of the above? For example,
> the format allows only up to (1<<31)-1 commits, but did something
> actually check if commits.nr at this point stayed under that limit?
Thanks for pointing this out. It should be a while before we have a repo
with 2 billion commits, but there's no time like the present to be safe.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
2018-02-21 21:34 ` Stefan Beller@ 2018-02-23 17:43 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:43 UTC (permalink / raw)
To: Stefan Beller
Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, SZEDER Gábor,
Junio C Hamano, Derrick Stolee
On 2/21/2018 4:34 PM, Stefan Beller wrote:
> On Mon, Feb 19, 2018 at 10:53 AM, Derrick Stolee <stolee@gmail.com> wrote:
>
>> graph_name = write_commit_graph(opts.obj_dir);
>>
>> if (graph_name) {
>> if (opts.set_latest)
>> set_latest_file(opts.obj_dir, graph_name);
>>
>> + if (opts.delete_expired)
>> + do_delete_expired(opts.obj_dir,
>> + old_graph_name,
>> + graph_name);
>> +
> So this only allows to delete expired things and setting the latest
> when writing a new graph. Would we ever envision a user to produce
> a new graph (e.g. via obtaining a graph that they got from a server) and
> then manually rerouting the latest to that new graph file without writing
> that graph file in the same process? The same for expired.
>
> I guess these operations are just available via editing the
> latest or deleting files manually, which slightly contradicts
> e.g. "git update-ref", which in olden times was just a fancy way
> of rewriting the refs file manually. (though it claims to be less
> prone to errors as it takes lock files)
I imagine these alternatives for placing a new, latest commit graph file
would want Git to handle rewriting the "graph-latest" file. Given such a
use case, we could consider extending the 'commit-graph' interface, but
I don't want to plan for it now.
>
>> extern char *get_graph_latest_filename(const char *obj_dir);
>> +extern char *get_graph_latest_contents(const char *obj_dir);
> Did
> https://public-inbox.org/git/20180208213806.GA6381@sigill.intra.peff.net/
> ever make it into tree? (It is sort of new, but I feel we'd want to
> strive for consistency in the code base, eventually.)
Thank you for the reminder. I've removed the externs from 'commit-graph.h'.
Should I also remove the externs from other methods I introduce even
though their surrounding definitions include 'extern'?
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 07/13] commit-graph: implement --set-latest
2018-02-22 18:31 ` Junio C Hamano@ 2018-02-23 17:53 ` Derrick Stolee0 siblings, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:53 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/22/2018 1:31 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> static struct opts_commit_graph {
>> const char *obj_dir;
>> const char *graph_file;
>> + int set_latest;
>> } opts;
>> ...
>> @@ -89,6 +106,8 @@ static int graph_write(int argc, const char **argv)
>> { OPTION_STRING, 'o', "object-dir", &opts.obj_dir,
>> N_("dir"),
>> N_("The object directory to store the graph") },
>> + OPT_BOOL('u', "set-latest", &opts.set_latest,
>> + N_("update graph-head to written graph file")),
>> OPT_END(),
>> };
>>
>> @@ -102,6 +121,9 @@ static int graph_write(int argc, const char **argv)
>> graph_name = write_commit_graph(opts.obj_dir);
>>
>> if (graph_name) {
>> + if (opts.set_latest)
>> + set_latest_file(opts.obj_dir, graph_name);
>> +
> This feels like a very strange API from potential caller's point of
> view. Because you have to decide that you are going to mark it as
> the latest one upfront before actually writing the graph file, if
> you forget to pass --set-latest, you have to know how to manually
> mark the file as latest anyway. I would understand if it were one
> of the following:
>
> (1) whenever a new commit graph file is written in the
> objects/info/ directory, always mark it as the latest (drop
> --set-latest option altogether); or
>
> (2) make set-latest command that takes a name of an existing graph
> file in the objects/info/ directory, and sets the latest
> pointer to point at it (make it separate from 'write' command).
>
> though.
Perhaps the 'write' subcommand should be replaced with 'replace' which
does the following:
1. Write a new commit graph based on the starting commits (from all
packs, from specified packs, from OIDs).
2. Update 'graph-latest' to point to that new file.
3. Delete all "expired" commit graph files.
(Hence, we would drop the "--set-latest" and "--delete-expired" options.)
Due to the concerns with concurrency, I really don't want to split these
operations into independent processes that consumers need to call in
series. Since this sequence of events is the only real interaction we
expect (for now), this interface will simplify the design.
The biggest reason I didn't design it like this in the first place is
that the behavior changes as the patch develops.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
2018-02-22 18:48 ` Junio C Hamano@ 2018-02-23 17:59 ` Derrick Stolee
2018-02-23 19:33 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 17:59 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/22/2018 1:48 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Teach git-commit-graph to delete the .graph files that are siblings of a
>> newly-written graph file, except for the file referenced by 'graph-latest'
>> at the beginning of the process and the newly-written file. If we fail to
>> delete a graph file, only report a warning because another git process may
>> be using that file. In a multi-process environment, we expect the previoius
>> graph file to be used by a concurrent process, so we do not delete it to
>> avoid race conditions.
> I do not understand the later part of the above. On some operating
> systems, you actually can remove a file that is open by another
> process without any ill effect. There are systems that do not allow
> removing a file that is in use, and an attempt to unlink it may
> fail. The need to handle such a failure gracefully is not limited
> to the case of removing a commit graph file---we need to deal with
> it when removing file of _any_ type.
My thought is that we should _warn_ when we fail to delete a .graph file
that we think should be safe to delete. However, if we are warning for a
file that is currently being accessed (as is the case on Windows, at
least), then we will add a lot of noise. This is especially true when
using IDEs that run 'status' or 'fetch' in the background, frequently.
> Especially the last sentence "we do not delete it to avoid race
> conditions" I find problematic. If a system does not allow removing
> a file in use and we detect a failure after an attempt to do so, it
> is not "we do not delete it" --- even if you do, you won't succeed
> anyway, so there is no point saying that. And on systems that do
> allow safe removal of a file in use (i.e. they allow an open file to
> be used by processes that have open filehandles to it after its
> removal), there is no point refraining to delete it "to avoid race
> conditions", either---in fact it is unlikely that you would even know
> somebody else had it open and was using it.
The (unlikely, but possible) race condition involves two processes (P1
and P2):
1. P1 reads from graph-latest to see commit graph file F1.
2. P2 updates graph-latest to point to F2 and deletes F1.
3. P1 tries to read F1 and fails.
I could explicitly mention this condition in the message, or we can just
let P2 fail by deleting all files other than the one referenced by
'graph-latest'. Thoughts?
> In any case, I do not think '--delete-expired' option that can be
> given only when you are writing makes much sense as an API. An
> 'expire' command, just like 'set-latest' command, that is a separate
> command from 'write', may make sense, though.
In another message, I proposed dropping the argument and assuming
expires happen on every write.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
2018-02-23 17:23 ` Derrick Stolee@ 2018-02-23 19:30 ` Junio C Hamano
2018-02-23 19:48 ` Junio C Hamano
2018-02-23 20:02 ` Derrick Stolee0 siblings, 2 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:30 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> jt/binsearch-with-fanout introduces one when there is a 256-entry
> fanout table (not the case here).
>
> The bsearch() method in search.h (and used in
> pack-write.c:need_large_offset) does not return the _position_ of a
> found element.
>
> Neither of these suit my needs, but I could just be searching for the
> wrong strings. Also, I could divert my energies in this area to create
> a generic search in the style of jt/binsearch-with-fanout.
... me goes and digs ...
What I had in mind was the one in sha1-lookup.c, actually. Having
said that, hand-rolling another one is not too huge a deal;
eventually people will notice and consolidate code after the series
stabilizes anyway ;-)
>>> + num_large_edges++;
>>> + parent = parent->next;
>>> + } while (parent);
>> It feels somewhat wasteful to traverse the commit's parents list
>> only to count, without populating the octopus table (which I
>> understand is assumed to be minority case under this design).
>
> Since we are writing the commit graph file in-order, we cannot write
> the octopus table until after the chunk lengths are known.
Oh, my "minority case" comment was me wondering "since we expect
there are only a few, why not keep them in memory while we discover
them here, so that the writing phase that come later do not have to
go through all commits again counting their parents? would it be
more performant and a better trade-off?" We can certainly do such
an optimization later (iow, it is not all that crucial issue and
certainly I didn't mention the above as something that needs to be
"fixed"--there is nothing broken).
> store the octopus table in memory and then dump it into the file
> later, but walking the parents is quite fast after all the commits are
> loaded. I'm not sure the time optimization merits the extra complexity
> here. (I'm happy to revisit this if we do see this performance
> lacking.)
>
> P.S. I really like the name "octopus table" and will use that for
> informal discussions of this format.
I actually do not mind that name used as the official term. I find
it far more descriptive and understandable than "long edge" / "large
edge" at least to a Git person.
> You're right that int_id isn't great, and your more-specific
> "oid_table_pos" shows an extra reason why it isn't great: when we add
> the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value
> is NOT a table position.
Perhaps I am somewhat biased, but it is quite natural for our
codebase and internal API to say something like this:
x_pos(table, key) function's return value is the non-negative
position for the key in the table when the key is there; when it
returns a negative value, it is (-1 - position) where the "position"
is the position in the table they key would have been found if
it was in there.
and store the return value from such a function in a variable called
"pos". Surely, sometimes "pos" does not have _the_ position, but
that does not make it a bad name.
Saying "MISSING is a special value that denotes 'nothing is here'"
and allowing it to be set to a variable that meant to hold the
position is not such a bad thing, though. After all, that is how
you use NULL as a special value for a pointer variable ;-).
Same for using the high bit to mean something else. Taking these
together you would explain "low 31-bit of pos holds the position for
the item in the table. MISSING is a special value that you can use
to denote there is nothing. The LAST_EDGE bit denotes that one
group of positions ends there", or something like that.
> I think the current name makes the following call very clear:
It is still a strange name nevertheless.
>>> +char *write_commit_graph(const char *obj_dir)
>>> +{
>>> + struct packed_oid_list oids;
>>> + struct packed_commit_list commits;
>>> + struct sha1file *f;
>>> + int i, count_distinct = 0;
>>> + DIR *info_dir;
>>> + struct strbuf tmp_file = STRBUF_INIT;
>>> + struct strbuf graph_file = STRBUF_INIT;
>>> + unsigned char final_hash[GIT_MAX_RAWSZ];
>>> + char *graph_name;
>>> + int fd;
>>> + uint32_t chunk_ids[5];
>>> + uint64_t chunk_offsets[5];
>>> + int num_chunks;
>>> + int num_long_edges;
>>> + struct commit_list *parent;
>>> +
>>> + oids.nr = 0;
>>> + oids.alloc = (int)(0.15 * approximate_object_count());
>> Heh, traditionalist would probably avoid unnecessary use of float
>> and use something like 1/4 or 1/8 ;-) After all, it is merely a
>> ballpark guestimate.
>>
>>> + num_long_edges = 0;
>> This again is about naming, but I find it a bit unnatural to call
>> the edge between a chind and its octopus parents "long". Individual
>> edges are not long--the only thing that is long is your "list of
>> edges". Some other codepaths in this patch seems to call the same
>> concept with s/long/large/, which I found somewhat puzzling.
>
> How about "num_extra_edges"...
Yes, "extra" in the name makes it very understandable.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
2018-02-23 17:59 ` Derrick Stolee@ 2018-02-23 19:33 ` Junio C Hamano
2018-02-23 19:41 ` Derrick Stolee0 siblings, 1 reply; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:33 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> The (unlikely, but possible) race condition involves two processes (P1
> and P2):
>
> 1. P1 reads from graph-latest to see commit graph file F1.
> 2. P2 updates graph-latest to point to F2 and deletes F1.
> 3. P1 tries to read F1 and fails.
>
> I could explicitly mention this condition in the message, or we can
> just let P2 fail by deleting all files other than the one referenced
> by 'graph-latest'. Thoughts?
The established way we do this is not to have -latest pointer, I
would think, and instead, make -latest be the actual thing. That is
how $GIT_DIR/index is updated, for example, by first writing into a
temporary file and then moving it to the final destination. Is
there a reason why the same pattern cannot be used?
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
2018-02-23 19:33 ` Junio C Hamano@ 2018-02-23 19:41 ` Derrick Stolee
2018-02-23 19:51 ` Junio C Hamano0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 19:41 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/23/2018 2:33 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> The (unlikely, but possible) race condition involves two processes (P1
>> and P2):
>>
>> 1. P1 reads from graph-latest to see commit graph file F1.
>> 2. P2 updates graph-latest to point to F2 and deletes F1.
>> 3. P1 tries to read F1 and fails.
>>
>> I could explicitly mention this condition in the message, or we can
>> just let P2 fail by deleting all files other than the one referenced
>> by 'graph-latest'. Thoughts?
> The established way we do this is not to have -latest pointer, I
> would think, and instead, make -latest be the actual thing. That is
> how $GIT_DIR/index is updated, for example, by first writing into a
> temporary file and then moving it to the final destination. Is
> there a reason why the same pattern cannot be used?
My thought was that using a fixed name (e.g.
.git/objects/info/commit-graph) would block making the graph
incremental. Upon thinking again, this is not the case. That feature
could be designed with a fixed name for the small, frequently-updated
file and use .../info/graph-<hash>.graph names for the "base" graph files.
I'll spend some time thinking about the ramifications of this fixed-name
concept. At the moment, it would remove two commits from this patch
series, which is nice.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
2018-02-23 19:30 ` Junio C Hamano@ 2018-02-23 19:48 ` Junio C Hamano
2018-02-23 20:02 ` Derrick Stolee1 sibling, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:48 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Junio C Hamano <gitster@pobox.com> writes:
>> I think the current name makes the following call very clear:
>
> It is still a strange name nevertheless.
Sorry for simply repeating "strange" without spelling out why in the
previous message. This certainly is subjective and depends on your
cultural background, but in our codebase, I tried to name functions
after "what" they do and "why", rather than "how" they do so. In a
way, it's the same kind of uneasiness I feel when I see variables
named in hungarian notation.
You would inspect the object and treat 'data' as a list and add to
the object if it is a commit, and if_packed_commit_add_to_list()
certainly is a name that describes all of that well, but does it
give readers a good answer when they wonder why the function is
doing so? You described with the name of the function how it
collects commits that are in the pack, without explicitly saying
that you want to collect packed commits and that is why you are
inspecting for type and doing so only for commit (i.e.
"if_packed_commit" part of the name) and why you are adding to a
list.
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 08/13] commit-graph: implement --delete-expired
2018-02-23 19:41 ` Derrick Stolee@ 2018-02-23 19:51 ` Junio C Hamano0 siblings, 0 replies; 146+ messages in thread
From: Junio C Hamano @ 2018-02-23 19:51 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> My thought was that using a fixed name
> (e.g. .git/objects/info/commit-graph) would block making the graph
> incremental. Upon thinking again, this is not the case. That feature
> could be designed with a fixed name for the small, frequently-updated
> file and use .../info/graph-<hash>.graph names for the "base" graph
> files.
I guess that is in line with the way how split-index folks did their
thing ;-)
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
2018-02-23 19:30 ` Junio C Hamano
2018-02-23 19:48 ` Junio C Hamano@ 2018-02-23 20:02 ` Derrick Stolee1 sibling, 0 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-02-23 20:02 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, Derrick Stolee
On 2/23/2018 2:30 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> jt/binsearch-with-fanout introduces one when there is a 256-entry
>> fanout table (not the case here).
>>
>> The bsearch() method in search.h (and used in
>> pack-write.c:need_large_offset) does not return the _position_ of a
>> found element.
>>
>> Neither of these suit my needs, but I could just be searching for the
>> wrong strings. Also, I could divert my energies in this area to create
>> a generic search in the style of jt/binsearch-with-fanout.
> ... me goes and digs ...
>
> What I had in mind was the one in sha1-lookup.c, actually. Having
> said that, hand-rolling another one is not too huge a deal;
> eventually people will notice and consolidate code after the series
> stabilizes anyway ;-)
Ah, sha1_pos(). That definitely satisfies my use case. Thanks! My local
branch has this replacement.
>
>>>> + num_large_edges++;
>>>> + parent = parent->next;
>>>> + } while (parent);
>>> It feels somewhat wasteful to traverse the commit's parents list
>>> only to count, without populating the octopus table (which I
>>> understand is assumed to be minority case under this design).
>> Since we are writing the commit graph file in-order, we cannot write
>> the octopus table until after the chunk lengths are known.
> Oh, my "minority case" comment was me wondering "since we expect
> there are only a few, why not keep them in memory while we discover
> them here, so that the writing phase that come later do not have to
> go through all commits again counting their parents? would it be
> more performant and a better trade-off?" We can certainly do such
> an optimization later (iow, it is not all that crucial issue and
> certainly I didn't mention the above as something that needs to be
> "fixed"--there is nothing broken).
>
>> store the octopus table in memory and then dump it into the file
>> later, but walking the parents is quite fast after all the commits are
>> loaded. I'm not sure the time optimization merits the extra complexity
>> here. (I'm happy to revisit this if we do see this performance
>> lacking.)
>>
>> P.S. I really like the name "octopus table" and will use that for
>> informal discussions of this format.
> I actually do not mind that name used as the official term. I find
> it far more descriptive and understandable than "long edge" / "large
> edge" at least to a Git person.
I will consider using this in the format and design documents.
>
>> You're right that int_id isn't great, and your more-specific
>> "oid_table_pos" shows an extra reason why it isn't great: when we add
>> the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value
>> is NOT a table position.
> Perhaps I am somewhat biased, but it is quite natural for our
> codebase and internal API to say something like this:
>
> x_pos(table, key) function's return value is the non-negative
> position for the key in the table when the key is there; when it
> returns a negative value, it is (-1 - position) where the "position"
> is the position in the table they key would have been found if
> it was in there.
>
> and store the return value from such a function in a variable called
> "pos". Surely, sometimes "pos" does not have _the_ position, but
> that does not make it a bad name.
>
> Saying "MISSING is a special value that denotes 'nothing is here'"
> and allowing it to be set to a variable that meant to hold the
> position is not such a bad thing, though. After all, that is how
> you use NULL as a special value for a pointer variable ;-).
>
> Same for using the high bit to mean something else. Taking these
> together you would explain "low 31-bit of pos holds the position for
> the item in the table. MISSING is a special value that you can use
> to denote there is nothing. The LAST_EDGE bit denotes that one
> group of positions ends there", or something like that.
>
>> I think the current name makes the following call very clear:
> It is still a strange name nevertheless.
>
>>>> +char *write_commit_graph(const char *obj_dir)
>>>> +{
>>>> + struct packed_oid_list oids;
>>>> + struct packed_commit_list commits;
>>>> + struct sha1file *f;
>>>> + int i, count_distinct = 0;
>>>> + DIR *info_dir;
>>>> + struct strbuf tmp_file = STRBUF_INIT;
>>>> + struct strbuf graph_file = STRBUF_INIT;
>>>> + unsigned char final_hash[GIT_MAX_RAWSZ];
>>>> + char *graph_name;
>>>> + int fd;
>>>> + uint32_t chunk_ids[5];
>>>> + uint64_t chunk_offsets[5];
>>>> + int num_chunks;
>>>> + int num_long_edges;
>>>> + struct commit_list *parent;
>>>> +
>>>> + oids.nr = 0;
>>>> + oids.alloc = (int)(0.15 * approximate_object_count());
>>> Heh, traditionalist would probably avoid unnecessary use of float
>>> and use something like 1/4 or 1/8 ;-) After all, it is merely a
>>> ballpark guestimate.
>>>
>>>> + num_long_edges = 0;
>>> This again is about naming, but I find it a bit unnatural to call
>>> the edge between a chind and its octopus parents "long". Individual
>>> edges are not long--the only thing that is long is your "list of
>>> edges". Some other codepaths in this patch seems to call the same
>>> concept with s/long/large/, which I found somewhat puzzling.
>> How about "num_extra_edges"...
> Yes, "extra" in the name makes it very understandable.
>
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-02-19 18:53 ` [PATCH v4 00/13] " Derrick Stolee
` (12 preceding siblings ...)
2018-02-19 18:53 ` [PATCH v4 13/13] commit-graph: build graph from starting commits Derrick Stolee
@ 2018-03-30 11:10 ` Jakub Narebski
2018-04-02 13:02 ` Derrick Stolee13 siblings, 1 reply; 146+ messages in thread
From: Jakub Narebski @ 2018-03-30 11:10 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
I hope that I am addressing the most recent version of this series.
Derrick Stolee <stolee@gmail.com> writes:
> As promised [1], this patch contains a way to serialize the commit graph.
> The current implementation defines a new file format to store the graph
> structure (parent relationships) and basic commit metadata (commit date,
> root tree OID) in order to prevent parsing raw commits while performing
> basic graph walks. For example, we do not need to parse the full commit
> when performing these walks:
>
> * 'git log --topo-order -1000' walks all reachable commits to avoid
> incorrect topological orders, but only needs the commit message for
> the top 1000 commits.
>
> * 'git merge-base <A> <B>' may walk many commits to find the correct
> boundary between the commits reachable from A and those reachable
> from B. No commit messages are needed.
>
> * 'git branch -vv' checks ahead/behind status for all local branches
> compared to their upstream remote branches. This is essentially as
> hard as computing merge bases for each.
>
> The current patch speeds up these calculations by injecting a check in
> parse_commit_gently() to check if there is a graph file and using that
> to provide the required metadata to the struct commit.
That's nice.
What are the assumptions about the serialized commit graph format? Does
it need to be:
- extensible without rewriting (e.g. append-only)?
- like the above, but may need rewriting for optimal performance?
- extending it needs to rewrite whole file?
Excuse me if it waas already asked and answered.
>
> The file format has room to store generation numbers, which will be
> provided as a patch after this framework is merged. Generation numbers
> are referenced by the design document but not implemented in order to
> make the current patch focus on the graph construction process. Once
> that is stable, it will be easier to add generation numbers and make
> graph walks aware of generation numbers one-by-one.
As the serialized commit graph format is versioned, I wonder if it would
be possible to speed up graph walks even more by adding to it FELINE
index (pair of numbers) from "Reachability Queries in Very Large Graphs:
A Fast Refined Olnine Search Approach" (2014) - available at
http://openproceedings.org/EDBT/2014/paper_166.pdf
The implementation would probably need adjustments to make it
unambiguous and unambiguously extensible; unless there is place for
indices that are local-only and need to be recalculated from scratch
when graph changes (to cover all graph).
>
> Here are some performance results for a copy of the Linux repository
> where 'master' has 704,766 reachable commits and is behind 'origin/master'
> by 19,610 commits.
>
> | Command | Before | After | Rel % |
> |----------------------------------|--------|--------|-------|
> | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% |
> | branch -vv | 0.42s | 0.27s | -35% |
> | rev-list --all | 6.4s | 1.0s | -84% |
> | rev-list --all --objects | 32.6s | 27.6s | -15% |
That's the "Rel %" of "Before", that is delta/before, isn't it?
> To test this yourself, run the following on your repo:
>
> git config core.commitGraph true
> git show-ref -s | git commit-graph write --set-latest --stdin-commits
>
> The second command writes a commit graph file containing every commit
> reachable from your refs. Now, all git commands that walk commits will
> check your graph first before consulting the ODB. You can run your own
> performance comparisions by toggling the 'core.commitgraph' setting.
Good. It is nicely similar to how bitmap indices (of reachability) are
handled.
I just wonder what happens in the (rare) presence of grafts (old
mechanism), or "git replace"-d commits...
Regards,
--
Jakub Narębski
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 01/13] commit-graph: add format document
2018-02-19 18:53 ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
2018-02-20 20:49 ` Junio C Hamano
2018-02-21 19:23 ` Stefan Beller@ 2018-03-30 13:25 ` Jakub Narebski
2018-04-02 13:09 ` Derrick Stolee2 siblings, 1 reply; 146+ messages in thread
From: Jakub Narebski @ 2018-03-30 13:25 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> +== graph-*.graph files have the following format:
What is this '*' here?
[...]
> + The remaining data in the body is described one chunk at a time, and
> + these chunks may be given in any order. Chunks are required unless
> + otherwise specified.
Does Git need to understand all chunks, or could there be optional
chunks that can be safely ignored (like in PNG format)? Though this may
be overkill, and could be left for later revision of the format if
deemed necessary.
> +
> +CHUNK DATA:
> +
> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
> + The ith entry, F[i], stores the number of OIDs with first
> + byte at most i. Thus F[255] stores the total
> + number of commits (N).
All right, it is small enough that can be required even for a very small
number of commits.
> +
> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
> + The OIDs for all commits in the graph, sorted in ascending order.
> +
> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
Do commits need to be put here in the ascending order of OIDs?
If so, this would mean that it is not possible to add information about
new commits by only appending data and maybe overwriting some fields, I
think. You would need to do full rewrite to insert new commit in
appropriate place.
> + * The first H bytes are for the OID of the root tree.
> + * The next 8 bytes are for the int-ids of the first two parents
> + of the ith commit. Stores value 0xffffffff if no parent in that
> + position. If there are more than two parents, the second value
> + has its most-significant bit on and the other bits store an array
> + position into the Large Edge List chunk.
> + * The next 8 bytes store the generation number of the commit and
> + the commit time in seconds since EPOCH. The generation number
> + uses the higher 30 bits of the first 4 bytes, while the commit
> + time uses the 32 bits of the second 4 bytes, along with the lowest
> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> + commit time.
> +
> + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> + This list of 4-byte values store the second through nth parents for
> + all octopus merges. The second parent value in the commit data stores
> + an array position within this list along with the most-significant bit
> + on. Starting at that array position, iterate through this list of int-ids
> + for the parents until reaching a value with the most-significant bit on.
> + The other bits correspond to the int-id of the last parent.
All right, that is one chunk that cannot use fixed-length records; this
shouldn't matter much, as we iterate only up to the number of parents
less two.
A question: what happens to the last list of parents? Is there a
guardian value of 0xffffffff at last place?
> +
> +TRAILER:
> +
> + H-byte HASH-checksum of all of the above.
> +
Best,
--
Jakub Narębski
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-03-30 11:10 ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
@ 2018-04-02 13:02 ` Derrick Stolee
2018-04-02 14:46 ` Jakub Narebski0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 13:02 UTC (permalink / raw)
To: Jakub Narebski
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
On 3/30/2018 7:10 AM, Jakub Narebski wrote:
> I hope that I am addressing the most recent version of this series.
Hi Jakub. Thanks for the interest in this patch series.
The most-recent version is v6 [1], but I will re-roll to v7 soon (after
v2.17.0 is marked).
[1]
https://public-inbox.org/git/20180314192736.70602-1-dstolee@microsoft.com/T/#u> Derrick Stolee <stolee@gmail.com> writes:
>
>> As promised [1], this patch contains a way to serialize the commit graph.
>> The current implementation defines a new file format to store the graph
>> structure (parent relationships) and basic commit metadata (commit date,
>> root tree OID) in order to prevent parsing raw commits while performing
>> basic graph walks. For example, we do not need to parse the full commit
>> when performing these walks:
>>
>> * 'git log --topo-order -1000' walks all reachable commits to avoid
>> incorrect topological orders, but only needs the commit message for
>> the top 1000 commits.
>>
>> * 'git merge-base <A> <B>' may walk many commits to find the correct
>> boundary between the commits reachable from A and those reachable
>> from B. No commit messages are needed.
>>
>> * 'git branch -vv' checks ahead/behind status for all local branches
>> compared to their upstream remote branches. This is essentially as
>> hard as computing merge bases for each.
>>
>> The current patch speeds up these calculations by injecting a check in
>> parse_commit_gently() to check if there is a graph file and using that
>> to provide the required metadata to the struct commit.
> That's nice.
>
> What are the assumptions about the serialized commit graph format? Does
> it need to be:
> - extensible without rewriting (e.g. append-only)?
> - like the above, but may need rewriting for optimal performance?
> - extending it needs to rewrite whole file?
>
> Excuse me if it waas already asked and answered.
It is not extensible without rewriting. Reducing write time was not a
main goal, since the graph will be written only occasionally during data
management phases (like 'gc' or 'repack'; this integration is not
implemented yet).
>
>> The file format has room to store generation numbers, which will be
>> provided as a patch after this framework is merged. Generation numbers
>> are referenced by the design document but not implemented in order to
>> make the current patch focus on the graph construction process. Once
>> that is stable, it will be easier to add generation numbers and make
>> graph walks aware of generation numbers one-by-one.
> As the serialized commit graph format is versioned, I wonder if it would
> be possible to speed up graph walks even more by adding to it FELINE
> index (pair of numbers) from "Reachability Queries in Very Large Graphs:
> A Fast Refined Olnine Search Approach" (2014) - available at
> http://openproceedings.org/EDBT/2014/paper_166.pdf
>
> The implementation would probably need adjustments to make it
> unambiguous and unambiguously extensible; unless there is place for
> indices that are local-only and need to be recalculated from scratch
> when graph changes (to cover all graph).
The chunk-based format is intended to allow extra indexes like the one
you recommend, without needing to increase the version number. Using an
optional chunk allows older versions of Git to read the file without
error, since the data is "extra", and newer versions can take advantage
of the acceleration.
At one point, I was investigating these reachability indexes (I read
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
Ruan, Dey, and Xu [2]) but find the question that these indexes target
to be lacking for most of the Git uses. That is, they ask the boolean
question "Can X reach Y?". More often, Git needs to answer "What is the
set of commits reachable from X but not from Y" or "Topologically sort
commits reachable from X" or "How many commits are in each part of the
symmetric difference between reachable from X or reachable from Y?"
The case for "Can X reach Y?" is mostly for commands like 'git branch
--contains', when 'git fetch' checks for forced-updates of branches, or
when the server decides enough negotiation has occurred during a 'git
fetch'. While these may be worth investigating, they also benefit
greatly from the accelerated graph walk introduced in the current format.
I would be happy to review any effort to extend the commit-graph format
to include such indexes, as long as the performance benefits outweigh
the complexity to create them.
[2]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf>
>> Here are some performance results for a copy of the Linux repository
>> where 'master' has 704,766 reachable commits and is behind 'origin/master'
>> by 19,610 commits.
>>
>> | Command | Before | After | Rel % |
>> |----------------------------------|--------|--------|-------|
>> | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% |
>> | branch -vv | 0.42s | 0.27s | -35% |
>> | rev-list --all | 6.4s | 1.0s | -84% |
>> | rev-list --all --objects | 32.6s | 27.6s | -15% |
> That's the "Rel %" of "Before", that is delta/before, isn't it?
I do mean the relative change.
>
>> To test this yourself, run the following on your repo:
>>
>> git config core.commitGraph true
>> git show-ref -s | git commit-graph write --set-latest --stdin-commits
>>
>> The second command writes a commit graph file containing every commit
>> reachable from your refs. Now, all git commands that walk commits will
>> check your graph first before consulting the ODB. You can run your own
>> performance comparisions by toggling the 'core.commitgraph' setting.
> Good. It is nicely similar to how bitmap indices (of reachability) are
> handled.
>
> I just wonder what happens in the (rare) presence of grafts (old
> mechanism), or "git replace"-d commits...
In the design document, I mention that the current implementation does
not work with grafts (it will ignore them). A later patch will refactor
the graft code so we can access it from the commit-graph parsing of a
commit without copy-pasting the code out of parse_commit_gently().
The commit-graph is only a compact representation of the object
database. If a commit is replaced with 'git replace' before 'git
commit-graph write' then the commit-graph write will write the replaced
object. I haven't tested what happens when a commit-graph is written and
then a commit is replaced, but my guess is that the replacement does not
occur until a full parse is attempted (i.e. when reading author or
commit message information). This will lead to unknown results.
Thanks for pointing out the interaction with 'git replace'. I have items
to fix grafts and replaced commits before integrating commit-graph
writes into automatic actions like 'gc.auto'.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 01/13] commit-graph: add format document
2018-03-30 13:25 ` Jakub Narebski@ 2018-04-02 13:09 ` Derrick Stolee
2018-04-02 14:09 ` Jakub Narebski0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 13:09 UTC (permalink / raw)
To: Jakub Narebski
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
On 3/30/2018 9:25 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> +== graph-*.graph files have the following format:
> What is this '*' here?
No longer necessary. It used to be a placeholder for a hash value, but
now the graph is stored in objects/info/commit-graph.
>
> [...]
>> + The remaining data in the body is described one chunk at a time, and
>> + these chunks may be given in any order. Chunks are required unless
>> + otherwise specified.
> Does Git need to understand all chunks, or could there be optional
> chunks that can be safely ignored (like in PNG format)? Though this may
> be overkill, and could be left for later revision of the format if
> deemed necessary.
In v6, the format and design documents are edited to make clear the use
of optional chunks, specifically for future extension without increasing
the version number.
>
>> +
>> +CHUNK DATA:
>> +
>> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>> + The ith entry, F[i], stores the number of OIDs with first
>> + byte at most i. Thus F[255] stores the total
>> + number of commits (N).
> All right, it is small enough that can be required even for a very small
> number of commits.
>
>> +
>> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>> + The OIDs for all commits in the graph, sorted in ascending order.
>> +
>> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
> Do commits need to be put here in the ascending order of OIDs?
Yes.
> If so, this would mean that it is not possible to add information about
> new commits by only appending data and maybe overwriting some fields, I
> think. You would need to do full rewrite to insert new commit in
> appropriate place.
That is the idea. This file is not updated with every new commit, but
instead will be updated on some scheduled cleanup events. The
commit-graph file is designed in a way to be non-critical, and not tied
to the packfile layout. This allows flexibility for when to do the write.
For example, in GVFS, we will write a new commit-graph when there are
new daily prefetch packs.
This could also integrate with 'gc' and 'repack' so whenever they are
triggered the commit-graph is written as well.
Commits that do not exist in the commit-graph file will load from the
object database as normal (after a failed lookup in the commit-graph file).
>> + * The first H bytes are for the OID of the root tree.
>> + * The next 8 bytes are for the int-ids of the first two parents
>> + of the ith commit. Stores value 0xffffffff if no parent in that
>> + position. If there are more than two parents, the second value
>> + has its most-significant bit on and the other bits store an array
>> + position into the Large Edge List chunk.
>> + * The next 8 bytes store the generation number of the commit and
>> + the commit time in seconds since EPOCH. The generation number
>> + uses the higher 30 bits of the first 4 bytes, while the commit
>> + time uses the 32 bits of the second 4 bytes, along with the lowest
>> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
>> + commit time.
>> +
>> + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>> + This list of 4-byte values store the second through nth parents for
>> + all octopus merges. The second parent value in the commit data stores
>> + an array position within this list along with the most-significant bit
>> + on. Starting at that array position, iterate through this list of int-ids
>> + for the parents until reaching a value with the most-significant bit on.
>> + The other bits correspond to the int-id of the last parent.
> All right, that is one chunk that cannot use fixed-length records; this
> shouldn't matter much, as we iterate only up to the number of parents
> less two.
Less one: the second "parent" column of the commit data chunk is used to
point into this list, so (P-1) parents are in this chunk for a commit
with P parents.
> A question: what happens to the last list of parents? Is there a
> guardian value of 0xffffffff at last place?
The termination condition is in the position of the last parent, since
the most-significant bit is on. The other 31 bits contain the int-id of
the parent.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 01/13] commit-graph: add format document
2018-04-02 13:09 ` Derrick Stolee@ 2018-04-02 14:09 ` Jakub Narebski0 siblings, 0 replies; 146+ messages in thread
From: Jakub Narebski @ 2018-04-02 14:09 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> On 3/30/2018 9:25 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> +== graph-*.graph files have the following format:
>> What is this '*' here?
>
> No longer necessary. It used to be a placeholder for a hash value, but
> now the graph is stored in objects/info/commit-graph.
All right.
Excuse me replying to v4 instead of v6 of the patch series, where it
would be answered or rather made moot already.
>>
>> [...]
>>> + The remaining data in the body is described one chunk at a time, and
>>> + these chunks may be given in any order. Chunks are required unless
>>> + otherwise specified.
>> Does Git need to understand all chunks, or could there be optional
>> chunks that can be safely ignored (like in PNG format)? Though this may
>> be overkill, and could be left for later revision of the format if
>> deemed necessary.
>
> In v6, the format and design documents are edited to make clear the
> use of optional chunks, specifically for future extension without
> increasing the version number.
That's good.
>>> +CHUNK DATA:
>>> +
>>> + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
>>> + The ith entry, F[i], stores the number of OIDs with first
>>> + byte at most i. Thus F[255] stores the total
>>> + number of commits (N).
>> All right, it is small enough that can be required even for a very small
>> number of commits.
>>
>>> +
>>> + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
>>> + The OIDs for all commits in the graph, sorted in ascending order.
>>> +
>>> + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>> Do commits need to be put here in the ascending order of OIDs?
>
> Yes.
>
>> If so, this would mean that it is not possible to add information about
>> new commits by only appending data and maybe overwriting some fields, I
>> think. You would need to do full rewrite to insert new commit in
>> appropriate place.
>
> That is the idea. This file is not updated with every new commit, but
> instead will be updated on some scheduled cleanup events. The
> commit-graph file is designed in a way to be non-critical, and not
> tied to the packfile layout. This allows flexibility for when to do
> the write.
>
> For example, in GVFS, we will write a new commit-graph when there are
> new daily prefetch packs.
>
> This could also integrate with 'gc' and 'repack' so whenever they are
> triggered the commit-graph is written as well.
I wonder if it would be possible to use existing hooks...
> Commits that do not exist in the commit-graph file will load from the
> object database as normal (after a failed lookup in the commit-graph
> file).
Ah. I thought wrongly that it would (or at least could) be something
that can be kept up to date, and extended when adding any new commit.
>>> + * The first H bytes are for the OID of the root tree.
>>> + * The next 8 bytes are for the int-ids of the first two parents
>>> + of the ith commit. Stores value 0xffffffff if no parent in that
>>> + position. If there are more than two parents, the second value
>>> + has its most-significant bit on and the other bits store an array
>>> + position into the Large Edge List chunk.
>>> + * The next 8 bytes store the generation number of the commit and
>>> + the commit time in seconds since EPOCH. The generation number
>>> + uses the higher 30 bits of the first 4 bytes, while the commit
>>> + time uses the 32 bits of the second 4 bytes, along with the lowest
>>> + 2 bits of the lowest byte, storing the 33rd and 34th bit of the
>>> + commit time.
>>> +
>>> + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>>> + This list of 4-byte values store the second through nth parents for
>>> + all octopus merges. The second parent value in the commit data stores
>>> + an array position within this list along with the most-significant bit
>>> + on. Starting at that array position, iterate through this list of int-ids
>>> + for the parents until reaching a value with the most-significant bit on.
>>> + The other bits correspond to the int-id of the last parent.
>>
>> All right, that is one chunk that cannot use fixed-length records; this
>> shouldn't matter much, as we iterate only up to the number of parents
>> less two.
>
> Less one: the second "parent" column of the commit data chunk is used
> to point into this list, so (P-1) parents are in this chunk for a
> commit with P parents.
Right.
>> A question: what happens to the last list of parents? Is there a
>> guardian value of 0xffffffff at last place?
>
> The termination condition is in the position of the last parent, since
> the most-significant bit is on. The other 31 bits contain the int-id
> of the parent.
Ah. I have misunderstood the format: I thought that first entry is
marked with most-significant bit set to 1, and all the rest to 0, while
it is last entry (last parent) has most-significant bit set, while all
others (if any) do not. So there is no need for guardian value.
Best regards,
--
Jakub Narębski
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-04-02 13:02 ` Derrick Stolee@ 2018-04-02 14:46 ` Jakub Narebski
2018-04-02 15:02 ` Derrick Stolee0 siblings, 1 reply; 146+ messages in thread
From: Jakub Narebski @ 2018-04-02 14:46 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> On 3/30/2018 7:10 AM, Jakub Narebski wrote:
>> I hope that I am addressing the most recent version of this series.
>
> Hi Jakub. Thanks for the interest in this patch series.
>
> The most-recent version is v6 [1], but I will re-roll to v7 soon
> (after v2.17.0 is marked).
>
> [1] https://public-inbox.org/git/20180314192736.70602-1-dstolee@microsoft.com/T/#u
Ooops. Sorry about that.
>> Derrick Stolee <stolee@gmail.com> writes:
[...]
>> What are the assumptions about the serialized commit graph format? Does
>> it need to be:
>> - extensible without rewriting (e.g. append-only)?
>> - like the above, but may need rewriting for optimal performance?
>> - extending it needs to rewrite whole file?
>>
>> Excuse me if it waas already asked and answered.
>
> It is not extensible without rewriting. Reducing write time was not a
> main goal, since the graph will be written only occasionally during
> data management phases (like 'gc' or 'repack'; this integration is not
> implemented yet).
Ah. I thought that it could be something easily extensible in-place,
and thus easy to keep up to date on each commit.
Recalculating it on 'gc' or 'repack' is still good, especially that it
works even when there are come commits outside commit-graph, without
this information.
>>
>>> The file format has room to store generation numbers, which will be
>>> provided as a patch after this framework is merged. Generation numbers
>>> are referenced by the design document but not implemented in order to
>>> make the current patch focus on the graph construction process. Once
>>> that is stable, it will be easier to add generation numbers and make
>>> graph walks aware of generation numbers one-by-one.
>>>
>> As the serialized commit graph format is versioned, I wonder if it would
>> be possible to speed up graph walks even more by adding to it FELINE
>> index (pair of numbers) from "Reachability Queries in Very Large Graphs:
>> A Fast Refined Olnine Search Approach" (2014) - available at
>> http://openproceedings.org/EDBT/2014/paper_166.pdf
>>
>> The implementation would probably need adjustments to make it
>> unambiguous and unambiguously extensible; unless there is place for
>> indices that are local-only and need to be recalculated from scratch
>> when graph changes (to cover all graph).
>
> The chunk-based format is intended to allow extra indexes like the one
> you recommend, without needing to increase the version number. Using
> an optional chunk allows older versions of Git to read the file
> without error, since the data is "extra", and newer versions can take
> advantage of the acceleration.
That's good.
> At one point, I was investigating these reachability indexes (I read
> "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
> Ruan, Dey, and Xu [2]) but find the question that these indexes target
> to be lacking for most of the Git uses. That is, they ask the boolean
> question "Can X reach Y?". More often, Git needs to answer "What is
> the set of commits reachable from X but not from Y" or "Topologically
> sort commits reachable from X" or "How many commits are in each part
> of the symmetric difference between reachable from X or reachable from
> Y?"
In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).
I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario).
Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph. That would help to answer
even more Git questions.
> The case for "Can X reach Y?" is mostly for commands like 'git branch
> --contains', when 'git fetch' checks for forced-updates of branches,
> or when the server decides enough negotiation has occurred during a
> 'git fetch'. While these may be worth investigating, they also benefit
> greatly from the accelerated graph walk introduced in the current
> format.
>
> I would be happy to review any effort to extend the commit-graph
> format to include such indexes, as long as the performance benefits
> outweigh the complexity to create them.
>
> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.
>>
>>> Here are some performance results for a copy of the Linux repository
>>> where 'master' has 704,766 reachable commits and is behind 'origin/master'
>>> by 19,610 commits.
>>>
>>> | Command | Before | After | Rel % |
>>> |----------------------------------|--------|--------|-------|
>>> | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% |
>>> | branch -vv | 0.42s | 0.27s | -35% |
>>> | rev-list --all | 6.4s | 1.0s | -84% |
>>> | rev-list --all --objects | 32.6s | 27.6s | -15% |
>>
>> That's the "Rel %" of "Before", that is delta/before, isn't it?
>
> I do mean the relative change.
But is it relative to the state before, or relative to the state after?
[...]
>> I just wonder what happens in the (rare) presence of grafts (old
>> mechanism), or "git replace"-d commits...
>
> In the design document, I mention that the current implementation does
> not work with grafts (it will ignore them). A later patch will
> refactor the graft code so we can access it from the commit-graph
> parsing of a commit without copy-pasting the code out of
> parse_commit_gently().
>
> The commit-graph is only a compact representation of the object
> database. If a commit is replaced with 'git replace' before 'git
> commit-graph write' then the commit-graph write will write the
> replaced object. I haven't tested what happens when a commit-graph is
> written and then a commit is replaced, but my guess is that the
> replacement does not occur until a full parse is attempted (i.e. when
> reading author or commit message information). This will lead to
> unknown results.
>
> Thanks for pointing out the interaction with 'git replace'. I have
> items to fix grafts and replaced commits before integrating
> commit-graph writes into automatic actions like 'gc.auto'.
Note that you can make Git ignore replacements with appropriate command
line option for "git" wrapper; the transfer mechanism can safely
ignore replacements (treating refs/replacements just like it would any
other refs/).
Best regards,
--
Jakub Narębski
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-04-02 14:46 ` Jakub Narebski@ 2018-04-02 15:02 ` Derrick Stolee
2018-04-02 17:35 ` Stefan Beller
2018-04-07 22:37 ` Jakub Narebski0 siblings, 2 replies; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 15:02 UTC (permalink / raw)
To: Jakub Narebski
Cc: git, git, peff, jonathantanmy, szeder.dev, sbeller, gitster,
Derrick Stolee
On 4/2/2018 10:46 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
[...]
>> At one point, I was investigating these reachability indexes (I read
>> "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
>> Ruan, Dey, and Xu [2]) but find the question that these indexes target
>> to be lacking for most of the Git uses. That is, they ask the boolean
>> question "Can X reach Y?". More often, Git needs to answer "What is
>> the set of commits reachable from X but not from Y" or "Topologically
>> sort commits reachable from X" or "How many commits are in each part
>> of the symmetric difference between reachable from X or reachable from
>> Y?"
> In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
> Meira and Zaki FELINE-index work, authors mention SCARAB as something
> that can be used in addition to FELINE-index, as a complementary data
> (FELINE-SCARAB in the work, section 4.4).
>
> I see the FELINE-index as a stronger form of generation numbers (called
> also level of the vertex / node), in that it allows to negative-cut even
> more, pruning paths that are known to be unreachable (or marking nodes
> known to be unreachable in the "calculate difference" scenario).
>
> Also, FELINE-index uses two integer numbers (coordinates in 2d space);
> one of those indices can be the topological numbering (topological
> sorting order) of nodes in the commit graph. That would help to answer
> even more Git questions.
This two-dimensional generation number is helpful for non-reachability
queries, but is something that needs the "full" commit graph in order to
define the value for a single commit (hence the O(N lg N) performance
mentioned below). Generation numbers are effective while being easy to
compute and immutable.
I wonder if FELINE was compared directly to a one-dimensional index (I
apologize that I have not read the paper in detail, so I don't
understand the indexes they compare against). It also appears the graphs
they use for their tests are not commit graphs, which have a different
shape than many of the digraphs studies by that work.
This is all to say: I would love to see an interesting study in this
direction, specifically comparing this series' definition of generation
numbers to the 2-dimensional system in FELINE, and on a large sample of
commit graphs available in open-source data sets (Linux kernel, Git,
etc.) and possibly on interesting closed-source data sets.
>
>> The case for "Can X reach Y?" is mostly for commands like 'git branch
>> --contains', when 'git fetch' checks for forced-updates of branches,
>> or when the server decides enough negotiation has occurred during a
>> 'git fetch'. While these may be worth investigating, they also benefit
>> greatly from the accelerated graph walk introduced in the current
>> format.
>>
>> I would be happy to review any effort to extend the commit-graph
>> format to include such indexes, as long as the performance benefits
>> outweigh the complexity to create them.
>>
>> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
> storage complexity is 2*|V|.
>
This would be very easy to add as an optional chunk, since it can use
one row per commit.
Thanks,
-Stolee
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-04-02 15:02 ` Derrick Stolee@ 2018-04-02 17:35 ` Stefan Beller
2018-04-02 17:54 ` Derrick Stolee
2018-04-07 22:37 ` Jakub Narebski1 sibling, 1 reply; 146+ messages in thread
From: Stefan Beller @ 2018-04-02 17:35 UTC (permalink / raw)
To: Derrick Stolee
Cc: Jakub Narebski, git, Jeff Hostetler, Jeff King, Jonathan Tan,
SZEDER Gábor, Junio C Hamano, Derrick Stolee
On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2]
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
>>
>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>> storage complexity is 2*|V|.
>>
>
> This would be very easy to add as an optional chunk, since it can use one
> row per commit.
Given this discussion, I wonder if we want to include generation numbers
as a first class citizen in the current format. They could also go as
an optional
chunk and we may want to discuss further if we want generation numbers or
FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
mechanism AFAICT.
In case we decide against generation numbers in the long run,
the row of mandatory generation numbers would be dead weight
that we'd need to carry.
I only glanced at the paper, but it looks like a "more advanced 2d
generation number" that seems to be able to answer questions
that gen numbers can answer, but that paper also refers
to SCARAB as well as GRAIL as the state of the art, so maybe
there are even more papers to explore?
Stefan
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-04-02 17:35 ` Stefan Beller@ 2018-04-02 17:54 ` Derrick Stolee
2018-04-02 18:02 ` Stefan Beller0 siblings, 1 reply; 146+ messages in thread
From: Derrick Stolee @ 2018-04-02 17:54 UTC (permalink / raw)
To: Stefan Beller
Cc: Jakub Narebski, git, Jeff Hostetler, Jeff King, Jonathan Tan,
SZEDER Gábor, Junio C Hamano, Derrick Stolee
On 4/2/2018 1:35 PM, Stefan Beller wrote:
> On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee <stolee@gmail.com> wrote:
>>>> I would be happy to review any effort to extend the commit-graph
>>>> format to include such indexes, as long as the performance benefits
>>>> outweigh the complexity to create them.
>>>>
>>>> [2]
>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
>>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>>> storage complexity is 2*|V|.
>>>
>> This would be very easy to add as an optional chunk, since it can use one
>> row per commit.
> Given this discussion, I wonder if we want to include generation numbers
> as a first class citizen in the current format. They could also go as
> an optional
> chunk and we may want to discuss further if we want generation numbers or
> FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
> mechanism AFAICT.
> In case we decide against generation numbers in the long run,
> the row of mandatory generation numbers would be dead weight
> that we'd need to carry.
Currently, the format includes 8 bytes to share between the generation
number and commit date. Due to alignment concerns, we will want to keep
this as 8 bytes or truncate it to 4-bytes. Either we would be wasting at
least 3 bytes or truncating dates too much (presenting the 2038 problem
[1] since dates are signed).
> I only glanced at the paper, but it looks like a "more advanced 2d
> generation number" that seems to be able to answer questions
> that gen numbers can answer, but that paper also refers
> to SCARAB as well as GRAIL as the state of the art, so maybe
> there are even more papers to explore?
The biggest reason I can say to advance this series (and the small
follow-up series that computes and consumes generation numbers) is that
generation numbers are _extremely simple_. You only need to know your
parents and their generation numbers to compute your own. These other
reachability indexes require examining the entire graph to create "good"
index values.
The hard part about using generation numbers (or any other reachability
index) in Git is refactoring the revision-walk machinery to take
advantage of them; current code requires O(reachable commits) to
topo-order instead of O(commits that will be output). I think we should
table any discussion of these advanced indexes until that work is done
and a valuable comparison can be done. "Premature optimization is the
root of all evil" and all that.
Thanks,
-Stolee
[1] https://en.wikipedia.org/wiki/Year_2038_problem^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-04-02 17:54 ` Derrick Stolee@ 2018-04-02 18:02 ` Stefan Beller0 siblings, 0 replies; 146+ messages in thread
From: Stefan Beller @ 2018-04-02 18:02 UTC (permalink / raw)
To: Derrick Stolee
Cc: Jakub Narebski, git, Jeff Hostetler, Jeff King, Jonathan Tan,
SZEDER Gábor, Junio C Hamano, Derrick Stolee
> Currently, the format includes 8 bytes to share between the generation
> number and commit date. Due to alignment concerns, we will want to keep this
> as 8 bytes or truncate it to 4-bytes. Either we would be wasting at least 3
> bytes or truncating dates too much (presenting the 2038 problem [1] since
> dates are signed).
Good point. I forgot about them while writing the previous email.
That is reason enough to keep the generation numbers, sorry
for the noise.
>
>> I only glanced at the paper, but it looks like a "more advanced 2d
>> generation number" that seems to be able to answer questions
>> that gen numbers can answer, but that paper also refers
>> to SCARAB as well as GRAIL as the state of the art, so maybe
>> there are even more papers to explore?
>
>
> The biggest reason I can say to advance this series (and the small follow-up
> series that computes and consumes generation numbers) is that generation
> numbers are _extremely simple_. You only need to know your parents and their
> generation numbers to compute your own. These other reachability indexes
> require examining the entire graph to create "good" index values.
Yes, that is a good point, too. Generation numbers can be computed
"commit locally" and do not need expensive setups, which the others
presumably need.
> The hard part about using generation numbers (or any other reachability
> index) in Git is refactoring the revision-walk machinery to take advantage
> of them; current code requires O(reachable commits) to topo-order instead of
> O(commits that will be output). I think we should table any discussion of
> these advanced indexes until that work is done and a valuable comparison can
> be done. "Premature optimization is the root of all evil" and all that.
agreed,
Stefan
^permalinkrawreply [flat|nested] 146+ messages in thread

*Re: [PATCH v4 00/13] Serialized Git Commit Graph
2018-04-02 15:02 ` Derrick Stolee
2018-04-02 17:35 ` Stefan Beller@ 2018-04-07 22:37 ` Jakub Narebski1 sibling, 0 replies; 146+ messages in thread
From: Jakub Narebski @ 2018-04-07 22:37 UTC (permalink / raw)
To: Derrick Stolee
Cc: git, Jeff Hostetler, Jeff King, Jonathan Tan, Szeder Gábor,
Stefan Beller, Junio C Hamano, Derrick Stolee
Derrick Stolee <stolee@gmail.com> writes:
> On 4/2/2018 10:46 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
> [...]
>> I see the FELINE-index as a stronger form of generation numbers (called
>> also level of the vertex / node), in that it allows to negative-cut even
>> more, pruning paths that are known to be unreachable (or marking nodes
>> known to be unreachable in the "calculate difference" scenario).
>>
>> Also, FELINE-index uses two integer numbers (coordinates in 2d space);
>> one of those indices can be the topological numbering (topological
>> sorting order) of nodes in the commit graph. That would help to answer
>> even more Git questions.
>
> This two-dimensional generation number is helpful for non-reachability
> queries, but is something that needs the "full" commit graph in order
> to define the value for a single commit (hence the O(N lg N)
> performance mentioned below). Generation numbers are effective while
> being easy to compute and immutable.
O(N log N) is the cost of sort algorithm, namely peforming topological
sorting of commits. Which Git can currently do.
We can use the main idea behind FELINE index, namely weak dominance
drawing, while perhaps changing the detail. The main idea is that if
you draw edges of DAG always to be in selected quadrant -- the default
is drawing them up and to the right, then paths would also always be in
the same quadrant; if target node is not in given quadrant with respect
to source node, then target node cannot be reached from source node.
For fast answering of reachability queries it is important to have
minimal number of false positives (falsely implied paths), that is
situation where FELINE index does imply that there could be a path, but
in fact target is not reachable from the source. The authors propose a
heuristics: use positions in [some] topological order for X coordinates
of FELINE index, and generate new optimal locally topological sorting to
use for Y coordinates.
Generation numbers (which can be used together with FELINE index, and
what paper does use -- calling them Level Filter) have the advantage of
being able to be calculated incrementally. I think this is what you
wanted to say.
I think it would be possible to calculate FELINE index incrementally,
too. If Git's topological order is stable, at least for X coordinates
of FELINE index it would be easy.
I have started experimenting with this approach in Jupyter Notebook,
available on Google Colaboratory (you can examine it, run it and edit it
from web browser -- the default is to use remote runtime on Google
cloud). Currently I am at the stage of reproducing the theoretical
parts of the FELINE paper.
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzghttps://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing> I wonder if FELINE was compared directly to a one-dimensional index (I
> apologize that I have not read the paper in detail, so I don't
> understand the indexes they compare against).
They have compared FELINE using real graphs (like ArXiv, Pubmed,
CiteseerX, Uniprot150m) and synthetic DAG against:
- GRAIL (Graph Reachability Indexing via RAndomized Interval Labeling)
- FERRARI (Flexible and Efficient Reachability Range Assignment for
gRapth Indexing), with interval set compression to approximate ones
- Nuutila's INTERVAL, which extracts complete transitive closure of
the graph and stores it using PWAH bit-vector compression
- TF-Label (Topological Folding LABELling)
> It also appears the
> graphs they use for their tests are not commit graphs, which have a
> different shape than many of the digraphs studies by that work.
I plan to check how FELINE index works for commit graphs in
above-mentioned Jupyter Notebook.
> This is all to say: I would love to see an interesting study in this
> direction, specifically comparing this series' definition of
> generation numbers to the 2-dimensional system in FELINE, and on a
> large sample of commit graphs available in open-source data sets
> (Linux kernel, Git, etc.) and possibly on interesting closed-source
> data sets.
I wonder if there are more recent works, with even faster solutions.
>>> The case for "Can X reach Y?" is mostly for commands like 'git branch
>>> --contains', when 'git fetch' checks for forced-updates of branches,
>>> or when the server decides enough negotiation has occurred during a
>>> 'git fetch'. While these may be worth investigating, they also benefit
>>> greatly from the accelerated graph walk introduced in the current
>>> format.
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
I wonder if next low-hanging branch would be to store topological
ordering of commits. It could be done, I think, with two chunks (or two
parts of one chunk): first to store position in topological order for
each commit (entries sorted by hash), second to store list of commits in
topological order (entries sorted by topological sort).
>>
>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>> storage complexity is 2*|V|.
>>
>
> This would be very easy to add as an optional chunk, since it can use
> one row per commit.
Right.
--
Jakub Narębski
^permalinkrawreply [flat|nested] 146+ messages in thread