Re: autopkgtest-build-lxd failing with bionic

On 14.02.2018 22:03, Dimitri John Ledkov wrote:

> Hi,
>
> I am on bionic and managed to build bionic container for testing using:
>
> $ autopkgtest-build-lxd ubuntu-daily:bionic/amd64
>
> Note this uses Ubuntu Foundations provided container as the base,
> rather than the third-party image that you are using from "images"
> remote.
>
> Why are you using images: remote?

Because that's what the manpage suggests :)

> Is the failure reproducible with ubuntu-daily:bionic?
>
> If you can build images with ubuntu-daily:bionic, then you need to
> contact and file an issue with images: remote provider.

ubuntu-daily: works, images: fails for artful and bionic while xenial
works, and the image server is:

Right, and quite deliberately. At least back in "my days", the ubuntu: and
ubuntu-daily: images had a lot of fat in them which made them both
unnecessarily slow (extra download time, requires more RAM/disk, etc.) and also
undesirable for test correctness, as having all of the unnecessary bits
preinstalled easily hides missing dependencies.

The latter can be alleviated by purging stuff of course, and that's what
happens for the cloud VM images in OpenStack:

But this takes even more time, and so far just hasn't been necessary as the
images: ones were just right - they contain exactly what a generic container
image is supposed to contain and are pleasantly small and fast.

> > Is the failure reproducible with ubuntu-daily:bionic?
> >
> > If you can build images with ubuntu-daily:bionic, then you need to
> > contact and file an issue with images: remote provider.
>
> ubuntu-daily: works, images: fails for artful and bionic while xenial
> works, and the image server is:
>
> https://images.linuxcontainers.org/

These are being advertised and used a lot, so maybe Stephane's LXD team can
help with fixing these? Them having no network at all sounds like a grave bug
which should be fixed either way.

Re: autopkgtest-build-lxd failing with bionic

On Thu, Feb 15, 2018 at 04:10:01PM +0100, Martin Pitt wrote:

> Hello Timo,
>
> Timo Aaltonen [2018-02-15 16:50 +0200]:
> > On 14.02.2018 22:03, Dimitri John Ledkov wrote:
> > > Hi,
> > >
> > > I am on bionic and managed to build bionic container for testing using:
> > >
> > > $ autopkgtest-build-lxd ubuntu-daily:bionic/amd64
> > >
> > > Note this uses Ubuntu Foundations provided container as the base,
> > > rather than the third-party image that you are using from "images"
> > > remote.
> > >
> > > Why are you using images: remote?
> >
> > Because that's what the manpage suggests :)
>
> Right, and quite deliberately. At least back in "my days", the ubuntu: and
> ubuntu-daily: images had a lot of fat in them which made them both
> unnecessarily slow (extra download time, requires more RAM/disk, etc.) and also
> undesirable for test correctness, as having all of the unnecessary bits
> preinstalled easily hides missing dependencies.
>
> The latter can be alleviated by purging stuff of course, and that's what
> happens for the cloud VM images in OpenStack:
>
> https://anonscm.debian.org/cgit/autopkgtest/autopkgtest.git/tree/setup-commands/setup-testbed#n242>
> But this takes even more time, and so far just hasn't been necessary as the
> images: ones were just right - they contain exactly what a generic container
> image is supposed to contain and are pleasantly small and fast.
>
> > > Is the failure reproducible with ubuntu-daily:bionic?
> > >
> > > If you can build images with ubuntu-daily:bionic, then you need to
> > > contact and file an issue with images: remote provider.
> >
> > ubuntu-daily: works, images: fails for artful and bionic while xenial
> > works, and the image server is:
> >
> > https://images.linuxcontainers.org/>
> These are being advertised and used a lot, so maybe Stephane's LXD team can
> help with fixing these? Them having no network at all sounds like a grave bug
> which should be fixed either way.
>
> That said, it could of course be that the setup script just needs some
> adjustments for the netplan changes:
> https://anonscm.debian.org/cgit/autopkgtest/autopkgtest.git/tree/setup-commands/setup-testbed> As this doesn't know about netplan at all, just ifupdown.
>
> Martin

Re: autopkgtest-build-lxd failing with bionic

> Hello Timo,
>
> Timo Aaltonen [2018-02-15 16:50 +0200]:
> > On 14.02.2018 22:03, Dimitri John Ledkov wrote:
> > > Hi,
> > >
> > > I am on bionic and managed to build bionic container for testing using:
> > >
> > > $ autopkgtest-build-lxd ubuntu-daily:bionic/amd64
> > >
> > > Note this uses Ubuntu Foundations provided container as the base,
> > > rather than the third-party image that you are using from "images"
> > > remote.
> > >
> > > Why are you using images: remote?
> >
> > Because that's what the manpage suggests :)
>
> Right, and quite deliberately. At least back in "my days", the ubuntu: and
> ubuntu-daily: images had a lot of fat in them which made them both
> unnecessarily slow (extra download time, requires more RAM/disk, etc.) and also
> undesirable for test correctness, as having all of the unnecessary bits
> preinstalled easily hides missing dependencies.
>
> The latter can be alleviated by purging stuff of course, and that's what
> happens for the cloud VM images in OpenStack:
>
> https://anonscm.debian.org/cgit/autopkgtest/autopkgtest.git/tree/setup-commands/setup-testbed#n242>
> But this takes even more time, and so far just hasn't been necessary as the
> images: ones were just right - they contain exactly what a generic container
> image is supposed to contain and are pleasantly small and fast.
>
> > > Is the failure reproducible with ubuntu-daily:bionic?
> > >
> > > If you can build images with ubuntu-daily:bionic, then you need to
> > > contact and file an issue with images: remote provider.
> >
> > ubuntu-daily: works, images: fails for artful and bionic while xenial
> > works, and the image server is:
> >
> > https://images.linuxcontainers.org/>
> These are being advertised and used a lot, so maybe Stephane's LXD team can
> help with fixing these? Them having no network at all sounds like a grave bug
> which should be fixed either way.

That's not what's going on at all. They do have working networking, but the
network does not come up fast enough. The apt update is not retried because
it exits with 0 because all it sees are transient errors.

Re: autopkgtest-build-lxd failing with bionic

On 15.02.2018 18:04, Julian Andres Klode wrote:

> On Thu, Feb 15, 2018 at 04:10:01PM +0100, Martin Pitt wrote:
>> Hello Timo,
>>
>> Timo Aaltonen [2018-02-15 16:50 +0200]:
>>> On 14.02.2018 22:03, Dimitri John Ledkov wrote:
>>>> Hi,
>>>>
>>>> I am on bionic and managed to build bionic container for testing using:
>>>>
>>>> $ autopkgtest-build-lxd ubuntu-daily:bionic/amd64
>>>>
>>>> Note this uses Ubuntu Foundations provided container as the base,
>>>> rather than the third-party image that you are using from "images"
>>>> remote.
>>>>
>>>> Why are you using images: remote?
>>>
>>> Because that's what the manpage suggests :)
>>
>> Right, and quite deliberately. At least back in "my days", the ubuntu: and
>> ubuntu-daily: images had a lot of fat in them which made them both
>> unnecessarily slow (extra download time, requires more RAM/disk, etc.) and also
>> undesirable for test correctness, as having all of the unnecessary bits
>> preinstalled easily hides missing dependencies.
>>
>> The latter can be alleviated by purging stuff of course, and that's what
>> happens for the cloud VM images in OpenStack:
>>
>> https://anonscm.debian.org/cgit/autopkgtest/autopkgtest.git/tree/setup-commands/setup-testbed#n242>>
>> But this takes even more time, and so far just hasn't been necessary as the
>> images: ones were just right - they contain exactly what a generic container
>> image is supposed to contain and are pleasantly small and fast.
>>
>>>> Is the failure reproducible with ubuntu-daily:bionic?
>>>>
>>>> If you can build images with ubuntu-daily:bionic, then you need to
>>>> contact and file an issue with images: remote provider.
>>>
>>> ubuntu-daily: works, images: fails for artful and bionic while xenial
>>> works, and the image server is:
>>>
>>> https://images.linuxcontainers.org/>>
>> These are being advertised and used a lot, so maybe Stephane's LXD team can
>> help with fixing these? Them having no network at all sounds like a grave bug
>> which should be fixed either way.
>
> That's not what's going on at all. They do have working networking, but the
> network does not come up fast enough. The apt update is not retried because
> it exits with 0 because all it sees are transient errors.

True, I added a 'sleep 10' in front of the apt-get update line, and now
it works..

On Thu, Feb 15, 2018 at 10:28:05AM -0500, Stéphane Graber wrote:
> […]
> And confirmed that networking inside both of them works fine here.
>
> I wonder if it's a netplan vs ifupdown thing hitting autopkgtest in this case?

I can build images: images(!) quite fine here, but when actually using
them I see these temporary resolution failures most of the time during
the initial apt-get update.

I tracked this down to a race condition - basically we try to do the
`apt-get update' before networking is fully up. (OK, I just saw Julian's
post which came in while I was writing this and says the same thing...)

There's a patch attached here which fixes the problem for me. I'm not
sure if there's a better way to do this - basically it starts
network-online.target and waits for it to become active, with a timeout.
Review appreciated.

> > I wonder if it's a netplan vs ifupdown thing hitting autopkgtest in this case?

> I can build images: images(!) quite fine here, but when actually using
> them I see these temporary resolution failures most of the time during
> the initial apt-get update.

> I tracked this down to a race condition - basically we try to do the
> `apt-get update' before networking is fully up. (OK, I just saw Julian's
> post which came in while I was writing this and says the same thing...)

> There's a patch attached here which fixes the problem for me. I'm not
> sure if there's a better way to do this - basically it starts
> network-online.target and waits for it to become active, with a timeout.
> Review appreciated.

It's a bit odd to be "start"ing a target in this manner. Is it even
necessary to start the target, or would it be sufficient to just check
is-active in a loop?

> > I wonder if it's a netplan vs ifupdown thing hitting autopkgtest in this case?

> I can build images: images(!) quite fine here, but when actually using
> them I see these temporary resolution failures most of the time during
> the initial apt-get update.

> I tracked this down to a race condition - basically we try to do the
> `apt-get update' before networking is fully up. (OK, I just saw Julian's
> post which came in while I was writing this and says the same thing...)

> There's a patch attached here which fixes the problem for me. I'm not
> sure if there's a better way to do this - basically it starts
> network-online.target and waits for it to become active, with a timeout.
> Review appreciated.

It's a bit odd to be "start"ing a target in this manner. Is it even
necessary to start the target, or would it be sufficient to just check
is-active in a loop?

Re: autopkgtest-build-lxd failing with bionic

Iain Lane [2018-02-15 18:48 +0000]:
> There's a patch attached here which fixes the problem for me. I'm not
> sure if there's a better way to do this - basically it starts
> network-online.target and waits for it to become active, with a timeout.
> Review appreciated.

I wouldn't pick on any of these: network-online.target is a sloppily defined
shim for SysV init backwards compatibility, and may not ever get started (in
fact, that's the goal ☺); and the container might not use networkd, so I
wouldn't use s-n-wait-online either. I think querying

[ -n "$(ip route show to 0/0)" ]

is asking the question more directly, i. e. "do I have a default route", and is
ignorant of exactly how the network is brought up (by networkd, NM, ifupdown,
or not explicitly at all as the container might share the host's network
namespace).

Re: autopkgtest-build-lxd failing with bionic

On Thu, Feb 15, 2018 at 09:55:47PM +0100, Martin Pitt wrote:

> Hello Iain, all,
>
> Iain Lane [2018-02-15 18:48 +0000]:
> > There's a patch attached here which fixes the problem for me. I'm not
> > sure if there's a better way to do this - basically it starts
> > network-online.target and waits for it to become active, with a timeout.
> > Review appreciated.
>
> I wouldn't pick on any of these: network-online.target is a sloppily defined
> shim for SysV init backwards compatibility, and may not ever get started (in
> fact, that's the goal ☺); and the container might not use networkd, so I
> wouldn't use s-n-wait-online either. I think querying

Interesting. I thought that it was the systemd way to say 'I am online
now' --- i.e. nm-online or systemd-networkd-wait-online, which is the
question I wanted to get a positive answer to. I can see that the SysV
implementation isn't great, but it's not clear to me that it was ill
defined for this case.

> [ -n "$(ip route show to 0/0)" ]

This is better though, and works too. Please take a look at the attached
patch. Thanks! :-)

Re: autopkgtest-build-lxd failing with bionic

On Thu, Feb 15, 2018 at 12:00:41PM -0800, Steve Langasek wrote:
> It's a bit odd to be "start"ing a target in this manner. Is it even
> necessary to start the target, or would it be sufficient to just check
> is-active in a loop?

Yeah, it is - it needs to be pulled in by something to get started, but
in this case it's not so we do the same thing in code. It's like this so
you don't end up blocking the boot unnecessarily waiting for the network
to be "up" when nothing needs it to be.

Re: autopkgtest-build-lxd failing with bionic

> > Iain Lane [2018-02-15 18:48 +0000]:
> > > There's a patch attached here which fixes the problem for me. I'm not
> > > sure if there's a better way to do this - basically it starts
> > > network-online.target and waits for it to become active, with a timeout.
> > > Review appreciated.

> > I wouldn't pick on any of these: network-online.target is a sloppily defined
> > shim for SysV init backwards compatibility, and may not ever get started (in
> > fact, that's the goal ☺); and the container might not use networkd, so I
> > wouldn't use s-n-wait-online either. I think querying

> Interesting. I thought that it was the systemd way to say 'I am online
> now' --- i.e. nm-online or systemd-networkd-wait-online, which is the
> question I wanted to get a positive answer to. I can see that the SysV
> implementation isn't great, but it's not clear to me that it was ill
> defined for this case.

> > [ -n "$(ip route show to 0/0)" ]

> This is better though, and works too. Please take a look at the attached
> patch. Thanks! :-)

Actually no, this is racy, because the route comes up before DNS resolution
is in place.

Re: autopkgtest-build-lxd failing with bionic

On Fri, Feb 16, 2018 at 11:12:32AM -0800, Steve Langasek wrote:

> On Fri, Feb 16, 2018 at 11:52:05AM +0000, Iain Lane wrote:
> > On Thu, Feb 15, 2018 at 09:55:47PM +0100, Martin Pitt wrote:
> > > Hello Iain, all,
>
> > > Iain Lane [2018-02-15 18:48 +0000]:
> > > > There's a patch attached here which fixes the problem for me. I'm not
> > > > sure if there's a better way to do this - basically it starts
> > > > network-online.target and waits for it to become active, with a timeout.
> > > > Review appreciated.
>
> > > I wouldn't pick on any of these: network-online.target is a sloppily defined
> > > shim for SysV init backwards compatibility, and may not ever get started (in
> > > fact, that's the goal ☺); and the container might not use networkd, so I
> > > wouldn't use s-n-wait-online either. I think querying
>
> > Interesting. I thought that it was the systemd way to say 'I am online
> > now' --- i.e. nm-online or systemd-networkd-wait-online, which is the
> > question I wanted to get a positive answer to. I can see that the SysV
> > implementation isn't great, but it's not clear to me that it was ill
> > defined for this case.
>
> > > [ -n "$(ip route show to 0/0)" ]
>
> > This is better though, and works too. Please take a look at the attached
> > patch. Thanks! :-)
>
> Actually no, this is racy, because the route comes up before DNS resolution
> is in place.
>
> It's also not forwards-compatible with ipv6-only deploys.
>
> I think the network-online.target is the better thing to key on.

I think we should just grep the apt output and retry if it fails with
connection error messages. This should be fine until I have an improved
solution in apt itself, one of

(1) "there are no transient errors"
(2) one source must have updated
(3) all sources must have updated

Re: autopkgtest-build-lxd failing with bionic

On Fri, Feb 16, 2018 at 08:15:35PM +0100, Julian Andres Klode wrote:
> > I think the network-online.target is the better thing to key on.
>
> I think we should just grep the apt output and retry if it fails with
> connection error messages.

The problem is a general one though. It's not specific to apt. Any time
we use automation on a container or VM, we need to wait until it's
finished booting.

In uvtool this is what "uvt-kvm wait" provides, which currently waits
for upstart runlevel 2 or systemd runlevel 5 and then asks cloud-init
(since a script might also have asked cloud-init to do things it expects
done when the container is "ready"). Of course that's cloud-init
specific.

The script may need fixing, but Ubuntu should agree upon a general
answer to the common question. Even if the answer provides multiple
specified points if multiple points in time are appropriate to solve
different problems.

Re: autopkgtest-build-lxd failing with bionic

> > I wouldn't pick on any of these: network-online.target is a sloppily defined
> > shim for SysV init backwards compatibility, and may not ever get started (in
> > fact, that's the goal ☺); and the container might not use networkd, so I
> > wouldn't use s-n-wait-online either. I think querying
>
> Interesting. I thought that it was the systemd way to say 'I am online
> now' --- i.e. nm-online or systemd-networkd-wait-online, which is the
> question I wanted to get a positive answer to. I can see that the SysV
> implementation isn't great, but it's not clear to me that it was ill
> defined for this case.

"ill defined" is too strong, but it's "sloppy", just as the mere question of
what "the network is up" means in a world of dynamic interfaces, proxies, VPNs,
dynamic resolvers, etc.

> > [ -n "$(ip route show to 0/0)" ]
>
> This is better though, and works too. Please take a look at the attached
> patch. Thanks! :-)

Cheers! I reworked it a bit, applied the same strategy to LXC (which is
equally affected), tested it, and landed

Re: autopkgtest-build-lxd failing with bionic

Steve Langasek [2018-02-16 11:12 -0800]:
> > > [ -n "$(ip route show to 0/0)" ]
>
> > This is better though, and works too. Please take a look at the attached
> > patch. Thanks! :-)
>
> Actually no, this is racy, because the route comes up before DNS resolution
> is in place.

I'm not actually sure if network-online.target would actually guard against
that with all implementations. But in practice, in most cases you'll get DNS
either via static configuration (in which case there's nothing further to wait
for) or via DHCP (in which case your address and DNS solvers ought to arrive at
the same time). And there's still the "apt retries several times" fallback
(which is why I do see the initial apt failure, but the retry works).

> It's also not forwards-compatible with ipv6-only deploys.

Right now the container network config created by lxc/lxd/netplan assumes IPv4
only, so let's cross that bridge when we get to it. Indeed adding an
alternative `ip -6 show...` would easily rectify that.

> I think the network-online.target is the better thing to key on.

I still don't like that much, though:
- there is no requirement that this actually gets "implemented" or even
started (it's a passive target)

- it's supposed to be a SysV backwards compat shim for LSB's "network"
dependency, and not well-defined

- These tools should also work with Debian containers, which in theory could
also run sysvinit. This is also the reason why they still use `runlevel`
instead of `systemctl is-system-running` or something similar.

All of these are just heuristics, though; you could have all sorts of cases
where all of these break, like sharing the host's network namespace, having no
default route but a route to the configured apt proxy, etc. Maybe the closest
approximation to this would be to grab the archive URL from
/etc/apt/sources.list and put it in a curl loop, but (1) neither wget nor curl
are in minimal installs, and (2) at that point it could just as well be an
apt-get retry loop.

So in summary, IMHO the "wait for default route" heuristics is simple and
effective enough for now.

Re: autopkgtest-build-lxd failing with bionic

> > I think the network-online.target is the better thing to key on.
>
> I still don't like that much, though:
> - there is no requirement that this actually gets "implemented" or even
> started (it's a passive target)
>
> - it's supposed to be a SysV backwards compat shim for LSB's "network"
> dependency, and not well-defined
>
> - These tools should also work with Debian containers, which in theory
> could also run sysvinit. This is also the reason why they still use
> `runlevel` instead of `systemctl is-system-running` or something similar.
>
> All of these are just heuristics, though; you could have all sorts of cases
> where all of these break, like sharing the host's network namespace, having
> no default route but a route to the configured apt proxy, etc. Maybe the
> closest approximation to this would be to grab the archive URL from
> /etc/apt/sources.list and put it in a curl loop, but (1) neither wget nor
> curl are in minimal installs, and (2) at that point it could just as well
> be an apt-get retry loop.

So what's the right systemd way to ensure the network is up? I continue to
fight bugs in the postfix unit file both in Debian and Ubuntu over things
happening before the network is up. As far as I can determine from the
documentation, network-online.target should work, but I agree it doesn't do so
reliably.

Re: autopkgtest-build-lxd failing with bionic

> Hello all,
>
> Iain Lane [2018-02-16 11:52 +0000]:
> > > I wouldn't pick on any of these: network-online.target is a sloppily defined
> > > shim for SysV init backwards compatibility, and may not ever get started (in
> > > fact, that's the goal ☺); and the container might not use networkd, so I
> > > wouldn't use s-n-wait-online either. I think querying
> >
> > Interesting. I thought that it was the systemd way to say 'I am online
> > now' --- i.e. nm-online or systemd-networkd-wait-online, which is the
> > question I wanted to get a positive answer to. I can see that the SysV
> > implementation isn't great, but it's not clear to me that it was ill
> > defined for this case.
>
> "ill defined" is too strong, but it's "sloppy", just as the mere question of
> what "the network is up" means in a world of dynamic interfaces, proxies, VPNs,
> dynamic resolvers, etc.
>
> > > [ -n "$(ip route show to 0/0)" ]
> >
> > This is better though, and works too. Please take a look at the attached
> > patch. Thanks! :-)
>
> Cheers! I reworked it a bit, applied the same strategy to LXC (which is
> equally affected), tested it, and landed
>
> https://anonscm.debian.org/cgit/autopkgtest/autopkgtest.git/commit/?id=20f479254

Aren't _all_ types of testbed affected by this in some way or another?