Martin Fick <mf...@codeaurora.org> writes:
> I wanted to explore the idea of exploiting knowledge about
> previous repacks to help speed up future repacks.
>
> I had various ideas that seemed like they might be good
> places to start, but things quickly got away from me.
> Mainly I wanted to focus on reducing and even sometimes
> eliminating reachability calculations since that seems to be
> be the one major unsolved slow piece during repacking.
>
> My first line of thinking goes like this: "After a full
> repack, reachability of the current refs is known. Exploit
> that knowledge for future repacks." There are some very
> simple scenarios where if we could figure out how to
> identify them reliably, I think we could simply avoid
> reachability calculations entirely, and yet end up with the
> same repacked files as if we had done the reachability
> calculations. Let me outline some to see if they make sense
> as starting place for further discussion.
>
> -------------
>
> * Setup 1:
>
> Do a full repack. All loose and packed objects are added
> to a single pack file (assumes git config repack options do
> not create multiple packs).
>
> * Scenario 1:
>
> Start with Setup 1. Nothing has changed on the repo
> contents (no new object/packs, refs all the same), but
> repacking config options have changed (for example
> compression level has changed).
>
> * Scenario 2:
>
> Starts with Setup 1. Add one new pack file that was
> pushed to the repo by adding a new ref to the repo (existing
> refs did not change).
>
> * Scenario 3:
>
> Starts with Setup 1. Add one new pack file that was
> pushed to the repo by updating an existing ref with a fast
> forward.
>
> * Scenario 4:
>
> Starts with Setup 1. Add some loose objects to the repo
> via a local fast forward ref update (I am assuming this is
> possible without adding any new unreferenced objects?)
>
>
> In all 4 scenarios, I believe we should be able to skip
> history traversal and simply grab all objects and repack
> them into a new file?

Advertising

If nothing else has happened in the repository, perhaps, but I
suspect that the real problem is how you would prove it. For
example, I am guessing that your Scenario 4 could be something like:
: setup #1
$ git repack -a -d -f
$ git prune
: scenario #4
$ git commit --allow-empty -m 'new commit'
which would add a single loose object to the repository, advancing
the current branch ref by one commit, fast-forwarding relative to
the state you were in after setup #1.
But how would you efficiently prove that it was the only thing that
happened? The user could have done this instead of a single commit:
: scenario #4 look-alike
$ git commit --allow-empty -m 'lost commit'
$ git reset --hard HEAD^
$ git commit --allow-empty -m 'new commit'
and the reflog entry for HEAD or the current branch ref for that
lost commit may be already ancient when you looked at this state.
Your object database has two loose commits, and you would want to
lose the older one 'lost commit' which is not reachable.
Also with Scenario #2, how would you prove that the new pack does
not contain any cruft that is not reachable? When receiving a pack
and updating our refs, we only prove that we have all the objects
needed to complete updated refs---we do not reject packs with crufts
that are not necessary.
These two are only examples, and we might be able to convince
ourselves that not pruning (or ejecting cruft from packs) is OK, but
that is introducing a different mode of operation, not optimizing
the repacking without changing what "repacking" means (I am not
saying it is bad to change the meaning if we can make a good
argument between pros-and-cons; a small bloat might be acceptable
relative to a good enough performance gain, but only unless the user
is using repack && prune as a way to eradicate undesirable contents
out of the object database).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html