This comment has been minimized.

I think that it would make sense to introduce mac/ip again in the vm status with this pr. @vladikr what do you think?

I think putting the IP address in the Status section makes a lot of sense. The IP address is actually something we know before the VM gets scheduled. We can write the IP address to the Status field before assigning it to a node even in virt-controller.

I'm uncertain if reporting the MAC in the VM's status matters. I don't see a reason to expose that unless we have a use case. I can think of lots of use-cases for the IP address though.

This comment has been minimized.

Since we are changing the pod interface, we either first need to collect all data about the interface, and save it in an extra file for retries (that is what the code right now tries to do), or we say that a pod failed, if we can't set this up (exit with non-zero exit code).

This comment has been minimized.

There's an entrypoint.sh script that's the actual pid 1 of the container. I'm going to add logic in that script to send SIGTERM and eventually SIGKILL to any qemu processes that could still be running as a result of virt-launcher panicking. This means we'll still have a sane shutdown for qemu even if virt-launcher segfaults/panics/etc...

This comment has been minimized.

That still looks like we would retry starting the vm. It is very likely that we can't configure the dhcp server correct anymore, one this failed since the network device is most likely already partly deconfigured. Retrying would be pointless.

This comment has been minimized.

Ugh... you're right. It's technically possible this preStartHook to get called twice (for example, if there's an error defining the domain with libvirt after the preStartHook completes it will result in preStartHook occurring again next try)

As it's written now, a second call to SetupPodNetwork will 100% cause virt-launcher to panic.

Here's how I think we can fix this.

Make SetupPodNetwork Idempotent. This could be done by caching the result at the end of SetupPodNetwork to a flat file and checking to see if that file exists at the beginning of the function. If the file exists, ensure that the domain object has the right info in it and fast succeed.

Be certain that any error we return from SetupPodNetwork has the ability to recover in the event of a second execution of SetupPodNetwork. Any unrecoverable errors should result in virt-launcher exiting.

This comment has been minimized.

Let's be pragmatic. virt-handler guarantees that it does not try to work on the same vm in parallel. That means we can just touch a file after we have collected the data. We can add something to this file or another file, once we successfully started the dhcpserver the first time. Later invocations can then just skip the network setup. The dhcpserver can panic if something goes wrong.

edited

@vladikr@rmohr What would you all think about a dhcp manager go routine that starts up when virt-launcher is initializing?

There would be a cmd channel and error channel used to communicate with the dhcp manager.

This dhcp manager would wait for a start command on the cmd channel. A start cmd would result in another goroutine launching with the dhcp.Serve(). Any additional calls to start the dhcp server would just fast succeed.

dhcp errors would get sent back on the errors channel. We'd have the ability to inspect the state of the actual VM process to decide how to proceed with handling the error.

This comment has been minimized.

Ha, that's one way to do it. I guess ginkgo doesn't have a BeforeAll() function. I like the idea of setting up the environment in parallel before the tests start. It cuts down on test time quite a bit.

@vladikr@rmohr What would you all think about a dhcp manager go routine that starts up when virt-launcher is initializing?

I think we can pretty much leave it as it is. I think we need to make sure anyway that the network setup is either reentrant or that we skip it after the first successful run. I don't see any issue starting the dhcp server with the right config directly.

This comment has been minimized.

Let's be pragmatic. virt-handler guarantees that it does not try to work on the same vm in parallel. That means we can just touch a file after we have collected the data. We can add something to this file or another file, once we successfully started the dhcpserver the first time. Later invocations can then just skip the network setup. The dhcpserver can panic if something goes wrong.

This comment has been minimized.

I found one minor thing when testing this locally that's just related to running the functional tests on a single node. I made a comment about how to address that.

For CI, jenkins has been able to verify that vm<->pod and node->vm connectivity works. However there's something going on that is preventing vm->internet from working. I believe that issue is most likely related to DNS on the jenkin's host machines. If the vm->internet connectivity continues to fail in jenkins, I'm in favor of disabling that test and following up on that issue in a separate PR. I've tested these changes locally in multiple cluster configurations and never encountered the connectivity issue Jenkins hit.

Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.