My masterplan for git migration (+ looking for infra to test it)

My masterplan for git migration (+ looking for infra to test it)

Hi,

I'm quite tired of promises and all that perfectionist non-sense which
locks us up with CVS for next 10 years of bikeshed. Therefore, I have
prepared a plan how to do git migration, and I believe it's doable in
less than 2 weeks (plus the testing). Of course, that assumes infra is
going to cooperate quickly or someone else is willing to provide the
infra for it.

I can provide some testing repos once someone is willing to provide
the hardware.

What needs to be done
---------------------

I can do most of the scripting. What I need others to do is provide
the hosting for git repos. We can't use public services like github
since they don't allow us to set our own update hook, so we can't
enforce signing policies etc.

Once basic infra is ready, I think the following is the best way to
switch:

1. send announcement to devs to explain how to use git,

2. lock CVS out to read-only,

3. create all the git repos, get hooks rolling,

4. enable R/W access to the repos.

With some luck, no more than 2 hours downtime.

The infra
---------

The general idea is based on 3-level structure that's extension of how
Funtoo works. The following ultimately pretty picture explains that:

We have main developer repo where developers work & commit and are
relatively happy. For every push into developer repo, automated magic
thingie merges stuff into user sync repo and updates the metadata cache
there.

User sync repo is for power users than want to fetch via git. It's quite
fast and efficient for frequent updates, and also saves space by being free
of ChangeLogs.

On top of user sync repo rsync is propagated. The rsync tree is populated
with all old ChangeLogs copied from CVS (stored in 30M git repo), new
ChangeLogs are generated from git logs and Manifests are expanded.

Main developer repo
-------------------

I was able to create a start git repository that takes around 66M
as a git pack (this is how much you will have to fetch to start working
with it). The repository is stripped clean of history and ChangeLogs,
and has thin Manifests only.

This means we don't have to wait till someone figures out the perfect
way of converting the old CVS repository. You don't need that history
most of the time, and you can play with CVS to get it if you really do.
In any case, we would likely strip the history anyway to get a small
repo to work with.

I have prepared a basic git update hook that keeps master clean
and attached it to the bug [1]. It enforces basic policies, prevents
forced updates and checks GPG signatures on left-most history line. It
can also be extended to do more extensive tree checks.

For GPG signing, I relied upon gpg to do the right thing. That is, git
checks the signatures and we accept only trusted signatures. So
an external tool (gentoo-keys) need to play with gpg to import, trust
and revoke developer keys.

I think we should also merge gentoo-news & glsa & herds.xml into
the repository. They all reference Gentoo packages at a particular
state in time, and it would be much nicer to have them synced properly.

IMO this will be the most useful syncing method. The user syncing repo
is updated automatically for developer repo commits, and afterwards
md5-cache is regenerated and committed. Also other repositories (like
DTDs, glsas and others if you dislike the previous idea) are merged
into it.

This repo is still free of ChangeLogs (since git logs are more
efficient) and has thin Manifests. It's the space-efficient Gentoo
variant. And commits are signed so users can verify the trust.

The rsync tree
--------------

We'd also propagate things to rsync. We'd have to populate it with old
ChangeLogs, new ChangeLog entries (autogenerated from git) and thick
Manifests. So users won't notice much of a change.

The remaining issue is signing of stuff. We could supposedly sign
Manifests but IMO it's a waste of resources considered how poor
the signing system is for non-git repos.

Re: My masterplan for git migration (+ looking for infra to test it)

Hi,

14.09.14 14:03, Michał Górny написав(ла):
> Hi,
>
> I'm quite tired of promises and all that perfectionist non-sense which
> locks us up with CVS for next 10 years of bikeshed. Therefore, I have
> prepared a plan how to do git migration, and I believe it's doable in
> less than 2 weeks (plus the testing). Of course, that assumes infra is
> going to cooperate quickly or someone else is willing to provide the
> infra for it.
>

as always, nice effort, but I foresee lots of bikeshedding in this thread. )

> This means we don't have to wait till someone figures out the perfect
> way of converting the old CVS repository. You don't need that history
> most of the time, and you can play with CVS to get it if you really do.
> In any case, we would likely strip the history anyway to get a small
> repo to work with.
>

Is it so difficult to convert CVS history?

>
> The rsync tree
> --------------
>
> We'd also propagate things to rsync. We'd have to populate it with old
> ChangeLogs, new ChangeLog entries (autogenerated from git) and thick
> Manifests. So users won't notice much of a change.
>

How will user check the ebuild integrity with thick manifests using rsync?

> The remaining issue is signing of stuff. We could supposedly sign
> Manifests but IMO it's a waste of resources considered how poor
> the signing system is for non-git repos.
>

Again, how will user check the integrity and authenticity if Manifests are unsigned?

Also, it would be a good idea to add automatic signature checking to portage for overlays that use signing (or is it already done?).

Re: My masterplan for git migration (+ looking for infra to test it)

> I think we should also merge gentoo-news & glsa & herds.xml into the
> repository. They all reference Gentoo packages at a particular state
> in time, and it would be much nicer to have them synced properly.

Not a good idea, because we may want to grant commit access to these
repos for people who are not necessarily ebuild devs.

Re: My masterplan for git migration (+ looking for infra to test it)

Am Sonntag 14 September 2014, 15:17:41 schrieb Ulrich Mueller:
> >>>>> On Sun, 14 Sep 2014, Michał Górny wrote:
> > I think we should also merge gentoo-news & glsa & herds.xml into the
> > repository. They all reference Gentoo packages at a particular state
> > in time, and it would be much nicer to have them synced properly.
>
> Not a good idea, because we may want to grant commit access to these
> repos for people who are not necessarily ebuild devs.
>
> Ulrich

This could be solved by a pull requests review tool (gerrit, reviewboard,
gitlab etc).

Re: My masterplan for git migration (+ looking for infra to test it)

Another question: will it be possible to maintain a copy of tree on github to make contributions for users simpler (similarly to e.g. science overlay)? (Can it somehow be combined with proposed signing mechanism?)

Re: My masterplan for git migration (+ looking for infra to test it)

14.09.14 15:23, Jauhien Piatlicki написав(ла):
> Another question: will it be possible to maintain a copy of tree on github to make contributions for users simpler (similarly to e.g. science overlay)? (Can it somehow be combined with proposed signing mechanism?)
>
>

Re: My masterplan for git migration (+ looking for infra to test it)

On 09/14/14 08:24 PM, Jauhien Piatlicki wrote:
> 14.09.14 15:23, Jauhien Piatlicki написав(ла):
>> Another question: will it be possible to maintain a copy of tree on github to make contributions for users simpler (similarly to e.g. science overlay)? (Can it somehow be combined with proposed signing mechanism?)
>>
>>
> Or well, have our own pull requests review tool.
NIH? What would be the benefit of that.. before going down this path.. I
think there's some good tools around which may at least serve as a base
to (fork) from before starting a ground up project.

Sorry to jump in the middle of the conversation, but I know 1st hand how
much is involved here.

Re: My masterplan for git migration (+ looking for infra to test it)

14.09.14 15:25, "C. Bergström" написав(ла):

> On 09/14/14 08:24 PM, Jauhien Piatlicki wrote:
>> 14.09.14 15:23, Jauhien Piatlicki написав(ла):
>>> Another question: will it be possible to maintain a copy of tree on github to make contributions for users simpler (similarly to e.g. science overlay)? (Can it somehow be combined with proposed signing mechanism?)
>>>
>>>
>> Or well, have our own pull requests review tool.
> NIH? What would be the benefit of that.. before going down this path.. I think there's some good tools around which may at least serve as a base to (fork) from before starting a ground up project.
>
> Sorry to jump in the middle of the conversation, but I know 1st hand how much is involved here.
>

Re: My masterplan for git migration (+ looking for infra to test it)

> Am Sonntag 14 September 2014, 15:17:41 schrieb Ulrich Mueller:
>> >>>>> On Sun, 14 Sep 2014, Michał Górny wrote:
>> > I think we should also merge gentoo-news & glsa & herds.xml into the
>> > repository. They all reference Gentoo packages at a particular state
>> > in time, and it would be much nicer to have them synced properly.
>>
>> Not a good idea, because we may want to grant commit access to
>> these repos for people who are not necessarily ebuild devs.

> This could be solved by a pull requests review tool (gerrit,
> reviewboard, gitlab etc).

Second argument is that gentoo-x86 is large enough as it is, and we
shouldn't make it even larger by merging in things that are not
strictly necessary. Especially glsa has a non negligible size.

How long does the md5-cache regeneration process take? Are you sure it
will be able to keep up with the rate of pushes to the repo during
"peak hours"? If not, maybe we could use a time-based thing similar to
the current cvs->rsync synchronization.

[...]

> Main developer repo
> -------------------
>
> I was able to create a start git repository that takes around 66M
> as a git pack (this is how much you will have to fetch to start working
> with it). The repository is stripped clean of history and ChangeLogs,
> and has thin Manifests only.
>
> This means we don't have to wait till someone figures out the perfect
> way of converting the old CVS repository. You don't need that history
> most of the time, and you can play with CVS to get it if you really do.

+1

> In any case, we would likely strip the history anyway to get a small
> repo to work with.
>
> I have prepared a basic git update hook that keeps master clean
> and attached it to the bug [1]. It enforces basic policies, prevents
> forced updates and checks GPG signatures on left-most history line. It
> can also be extended to do more extensive tree checks.

Are we going to disallow merge commits and ask devs to rebase local
changes in order to keep the history "clean"?

Re: My masterplan for git migration (+ looking for infra to test it)

Jauhien Piatlicki:
>
> Again, how will user check the integrity and authenticity if Manifests are unsigned?
>

While this is an issue to be solved, it shouldn't be a blocker for the
git migration.

There is no regression if this isn't solved. There is no sane automated
method for verifying signed Manifests yet (that should be on PM level)
and signing them isn't even enforced throughout the tree. Moreover I
highly doubt that there is any user who runs around ebuild directories
and checks Manifest signatures by hand.

People who really care use emerge-webrsync.
If we use the proposed solution, then there is an additional method via
the User syncing repo, so it's a win.

We can put more effort into solving this for rsync mirrors later, but
I'd rather focus on the git migration.

Re: My masterplan for git migration (+ looking for infra to test it)

>> Main developer repo
>> -------------------
>>
>> I was able to create a start git repository that takes around 66M
>> as a git pack (this is how much you will have to fetch to start working
>> with it). The repository is stripped clean of history and ChangeLogs,
>> and has thin Manifests only.
>>
>> This means we don't have to wait till someone figures out the perfect
>> way of converting the old CVS repository. You don't need that history
>> most of the time, and you can play with CVS to get it if you really do.
>
> +1
>

+1

>> In any case, we would likely strip the history anyway to get a small
>> repo to work with.
>>
>> I have prepared a basic git update hook that keeps master clean
>> and attached it to the bug [1]. It enforces basic policies, prevents
>> forced updates and checks GPG signatures on left-most history line. It
>> can also be extended to do more extensive tree checks.
>
> Are we going to disallow merge commits and ask devs to rebase local
> changes in order to keep the history "clean"?
>

I'd say it doesn't make sense to create merge commits for conflicts that
arise by someone having pushed earlier than you.

Merge commits should only be there if they give useful information.

Also... if you merge from a _user_ who is untrusted and allow a
fast-forward merge, then the signature verification fails. That means
for such pull requests you either have to use "git am" or "git merge
--no-ff".

Re: My masterplan for git migration (+ looking for infra to test it)

> Davide Pesavento:
>>> In any case, we would likely strip the history anyway to get a small
>>> repo to work with.
>>>
>>> I have prepared a basic git update hook that keeps master clean
>>> and attached it to the bug [1]. It enforces basic policies, prevents
>>> forced updates and checks GPG signatures on left-most history line. It
>>> can also be extended to do more extensive tree checks.
>>
>> Are we going to disallow merge commits and ask devs to rebase local
>> changes in order to keep the history "clean"?
>>
>
> I'd say it doesn't make sense to create merge commits for conflicts that
> arise by someone having pushed earlier than you.
>
> Merge commits should only be there if they give useful information.
>

I totally agree. But is there a way to automatically enforce this?

> Also... if you merge from a _user_ who is untrusted and allow a
> fast-forward merge, then the signature verification fails. That means
> for such pull requests you either have to use "git am" or "git merge
> --no-ff".
>

Right. In that case you can either sign the merge commit or amend the
user's commit and sign it yourself (re-signing could be needed anyway
if you have to rebase).

Re: My masterplan for git migration (+ looking for infra to test it)

This means we don't have to wait till someone figures out the perfect
way of converting the old CVS repository. You don't need that history
most of the time, and you can play with CVS to get it if you really do.

Once somebody works this out, you can also simply make it available as a "replacement" ref.

See 'git replace'

This would mean, essentially, you could push a ref called 'refs/replace/oldcvs' of value "firstsha1 oldcvssha1" and anyone who wanted it could manually fetch it, and any one who did fetch it would get the full history in all of its glory, and then git would transparently pretend that history was always there anyway.

Re: My masterplan for git migration (+ looking for infra to test it)

> 14.09.14 14:03, Michał Górny написав(ла):
> > Hi,
> >
> > I'm quite tired of promises and all that perfectionist non-sense which
> > locks us up with CVS for next 10 years of bikeshed. Therefore, I have
> > prepared a plan how to do git migration, and I believe it's doable in
> > less than 2 weeks (plus the testing). Of course, that assumes infra is
> > going to cooperate quickly or someone else is willing to provide the
> > infra for it.
> >
>
> as always, nice effort, but I foresee lots of bikeshedding in this thread. )

Yes. I'm planning to ignore most of bikeshed and take only serious
answers into consideration. Otherwise, we will be stuck with CVS.

> > This means we don't have to wait till someone figures out the perfect
> > way of converting the old CVS repository. You don't need that history
> > most of the time, and you can play with CVS to get it if you really do.
> > In any case, we would likely strip the history anyway to get a small
> > repo to work with.
>
> Is it so difficult to convert CVS history?

It may be difficult to convert it properly, especially considering
the splitting of ebuild+Manifest commit. Then we need to somehow check
if it was converted properly. I don't even want to waste my time on
this. IMO the history doesn't have such a great value.

> > The rsync tree
> > --------------
> >
> > We'd also propagate things to rsync. We'd have to populate it with old
> > ChangeLogs, new ChangeLog entries (autogenerated from git) and thick
> > Manifests. So users won't notice much of a change.
> >
>
> How will user check the ebuild integrity with thick manifests using rsync?

The same way he currently does :).

> > The remaining issue is signing of stuff. We could supposedly sign
> > Manifests but IMO it's a waste of resources considered how poor
> > the signing system is for non-git repos.
>
> Again, how will user check the integrity and authenticity if Manifests are unsigned?

As far as I'm concerned, user can use the user git tree to get proper
signatures or any other method that has proper signing support already.

Re: My masterplan for git migration (+ looking for infra to test it)

On Sun, Sep 14, 2014 at 8:03 AM, Michał Górny <[hidden email]> wrote:
>
> I'm quite tired of promises and all that perfectionist non-sense which
> locks us up with CVS for next 10 years of bikeshed.

While I tend to agree with the sentiment, I don't think you're
actually targeting the problems that aren't already solved here.

> Of course, that assumes infra is
> going to cooperate quickly or someone else is willing to provide the
> infra for it.

The infra components to a git infrastructure are one of the main
blockers at this point. I don't really see cooperation as the issue -
just lack of manpower or interest.

>
> I can provide some testing repos once someone is willing to provide
> the hardware.

We already have plenty of testing repos (well, minus all the back-end stuff).

>
> 1. send announcement to devs to explain how to use git,

This is one of the blockers. We haven't actually decided how we want
to use git.

Sure, everybody knows how to use git. The problem is that there are a
dozen different ways we COULD use git, and nobody has picked the ONE
way we WILL use it.

This isn't as trivial as you might think. We have a fairly high
commit rate and with a single repository that means that in-between a
pull-merge/rebase-push there is a decent chance of another commit that
will make the resulting push a non-fast-forward.

People love to point out linux and its insane commit rate. The thing
is, the mainline git repo with all those commits has exactly one
committer - Linus himself. They don't have one big repo with one
master branch that everybody pushes to. At least, that is my
understanding (and there are certainly others here who are more
involved with kernel development).

>
> 2. lock CVS out to read-only,
>
> 3. create all the git repos, get hooks rolling,
>
> 4. enable R/W access to the repos.
>
> With some luck, no more than 2 hours downtime.

I agree that the actual conversion should be able to done quickly.

> On top of user sync repo rsync is propagated. The rsync tree is populated
> with all old ChangeLogs copied from CVS (stored in 30M git repo), new
> ChangeLogs are generated from git logs and Manifests are expanded.

So, I don't really have a problem with your design. I still question
whether we still need to be generating changelogs - they seem
incredibly redundant. But, if people really want a redundant copy of
the git log, whatever...

> Main developer repo
> -------------------
>
> I was able to create a start git repository that takes around 66M
> as a git pack (this is how much you will have to fetch to start working
> with it). The repository is stripped clean of history and ChangeLogs,
> and has thin Manifests only.
>
> This means we don't have to wait till someone figures out the perfect
> way of converting the old CVS repository. You don't need that history
> most of the time, and you can play with CVS to get it if you really do.
> In any case, we would likely strip the history anyway to get a small
> repo to work with.

We already have a migration process that coverts the old CVS
repository, generating both a shallow repository that lacks history
and a full repository that contains all of history. Additionally,
these two are consistent - that is the last branch of the full
repository has the same commit ID as the base of the shallow
repository. Basically we generate the full history and then trim out
99% of it so that the commit in the shallow repository points to a
parent that isn't in the packed repository.

Actually doing the conversion is basically a solved problem. If this
were actually the blocker I'd be all for just sticking the history in
a different repo and starting from scratch with a new one.

>
> I think we should also merge gentoo-news & glsa & herds.xml into
> the repository. They all reference Gentoo packages at a particular
> state in time, and it would be much nicer to have them synced properly.
>

I can see the pros/cons here, but I don't personally have an issue
with merging them. As has been brought up elsewhere herds.xml may
just go away.

If somebody can come up with a set of hooks/scripts that will create
the various trees and the only thing that is left is to get infra to
host them, I think we can make real progress. I don't think this is
something that needs to take a long time. The pieces are mostly there
- they just have to be assembled.

Re: My masterplan for git migration (+ looking for infra to test it)

> >>>>> On Sun, 14 Sep 2014, Michał Górny wrote:
>
> > I think we should also merge gentoo-news & glsa & herds.xml into the
> > repository. They all reference Gentoo packages at a particular state
> > in time, and it would be much nicer to have them synced properly.
>
> Not a good idea, because we may want to grant commit access to these
> repos for people who are not necessarily ebuild devs.

We may want to add metadata.xml access to those people too.

If you really are that distrustful of our contributors, I believe we
can do per-path filtering in the 'update' hook, or use pull request
or intermediate-repository based workflow.

Re: My masterplan for git migration (+ looking for infra to test it)

> Another question: will it be possible to maintain a copy of tree on github to make contributions for users simpler (similarly to e.g. science overlay)? (Can it somehow be combined with proposed signing mechanism?)

Yes. I'm planning to have a mirror on github and bitbucket,
and auto-pushing to both.

However, I'm wondering if it would be possible to restrict people from
accidentally committing straight into github (e.g. merging pull
requests there instead of to our main server).

In fact, I would start my experiments straight into github if not
the fact that they don't allow us to set our own update hooks.

Re: My masterplan for git migration (+ looking for infra to test it)

> On Sun, Sep 14, 2014 at 2:03 PM, Michał Górny <[hidden email]> wrote:
> > We have main developer repo where developers work & commit and are
> > relatively happy. For every push into developer repo, automated magic
> > thingie merges stuff into user sync repo and updates the metadata cache
> > there.
>
> How long does the md5-cache regeneration process take? Are you sure it
> will be able to keep up with the rate of pushes to the repo during
> "peak hours"? If not, maybe we could use a time-based thing similar to
> the current cvs->rsync synchronization.

This strongly depends on how much data is there to update. A few
ebuilds are quite fast, eclass change isn't ;). I was thinking of
something along the lines of, in pseudo-code speaking:

systemctl restart cache-regen

That is, we start the regen on every update. If it finishes in time, it
commits the new metadata. If another update occurs during regen, we
just restart it to let it catch the new data.

Of course, if we can't spare the resources to do intermediate updates,
we may as well switch to cron-based update method.

> [...]
> > In any case, we would likely strip the history anyway to get a small
> > repo to work with.
> >
> > I have prepared a basic git update hook that keeps master clean
> > and attached it to the bug [1]. It enforces basic policies, prevents
> > forced updates and checks GPG signatures on left-most history line. It
> > can also be extended to do more extensive tree checks.
>
> Are we going to disallow merge commits and ask devs to rebase local
> changes in order to keep the history "clean"?

I don't think we should cripple git. Just to be clear, 'accidental'
merges won't happen because the automatic merges are unsigned
and the 'update' hook will refuse them.

The developers will have to either rebase and resign the commits, or
use a signed merge commit whichever makes more sense in particular
context.

Signed merge commits will also allow merging user-submitted changes
while preserving original history.