Bug Description

I've recently encountered a problem with deploying nodes on which Secure Boot is enabled. The symptoms are:

1. The node enlists and commissions fine
2. The node boots and begin deploying fine
3. After deployment completes, the node reboots
4. When booting at this point, after showing a few routine messages,
including a GRUB menu, the node displays the following text on its
screen:

Disabling Secure Boot on the node enables it to boot. If this is done quickly enough, deployment will succeed.

I've encountered this problem on two systems managed by two MAAS servers: An Intel NUC DC53247HYE and a Cisco UCS C-240 M4 (VIC). One MAAS server is running 2.2.2 (6099-g8751f91-0ubuntu1~16.04.1) and the other is running 2.2.1 (6078-g2a6d96e-0ubuntu1~16.04.1). I'm attaching log files from the first server to this bug report. The affected node is brennan on that server.

Further observations:

* Once booted, I see that there's no kernel with a .efi.signed extension on
the hard disk. Installing such a kernel does NOT fix the problem;
however, it may be necessary to install such a kernel for a proper fix.
* If I force a boot directly through the Shim and GRUB installed on the
hard disk, the system boots correctly, even with Secure Boot enabled.

I found a copy of the error message in Shim source code, and reports of this message on Fedora as early as 2014:

It looks to me as if the Shim that MAAS uses for the post-deployment boot has been updated/changed to include this strict verification that the kernel is honoring Secure Boot rules; but the Shim installed to the hard disk, and used during enlistment and commissioning, does not perform this check. OTOH, I can't find any evidence of separate Shim binaries on the MAAS server.

Not sure which machines in the logs to look at so, walking through a few:

brenan:
installs grub-efi-amd64-signed, shim, shim-signed
during grub install, it reports that it can't access efi vars
I don't recall if that's fatal. Nothing for *curtin* here, if
anything maybe bug against grub or other secureboot packages in their post-inst scripts?

Ryan, neither of the bugs you reference is a duplicate of this one. This bug is new. Both the nodes I've tested have been successfully deployed with Secure Boot active in the past. In fact, brennan had been successfully deployed with Secure Boot but then failed to boot from that very deployment some days or weeks later (I'm not sure how long it had been since I last booted it), which suggests to me that the Shim/GRUB provided by the MAAS server when PXE-booting had changed in that period.

Bug #1680917 is about what happens when the MAAS server becomes unavailable and a node tries to boot. This problem would exist with or without Secure Boot being active.

Bug #1687729 is more similar to this new bug, but that problem does NOT affect all systems. My read on that bug report is that it was an incompatibility between our Shim and the Secure Boot implementation in some computers. The bug I'm reporting now appears to be a problem caused by Shim refusing to allow a GRUB that doesn't check the validity of a loaded kernel to boot.

In some sense, if my analysis is correct, the problem is caused by Shim "tightening the screws" on Secure Boot policy; however, those changes are done for a reason (to improve security), so the solution should be to ensure that the GRUB versions MAAS and curtin deploy perform the checks that Shim wants, and that the kernels we install are signed. AFAIK, we have all the required pieces in the standard Ubuntu toolset, but clearly, a deployed system does not have signed kernels. As my tests show, though, that doesn't seem to be enough; it LOOKS LIKE the GRUB that MAAS is using does not enforce Secure Boot checks on the kernels it loads. This used to be the case for Ubuntu until (IIRC) 16.04, but our more recent GRUB binaries do perform such checks. As noted in my original report, though, I couldn't find the exact binary that's to blame. This calls into question at least some of my analysis, so take the above with a grain of salt -- but I might just not know where MAAS tucks away all its boot loader files, so I may have missed the file.

In the logs from the MAAS server I've provided, you can ignore kzanol; that system does not support Secure Boot. Brennan is the machine I used for testing, and that exhibits the problem. (The other computer is on another MAAS server with dozens of deployed nodes, so its log files would be VERY cluttered by comparison.) I recall noticing warnings about an inability to access efivars filesystems in the past, but AFAIK this is not correlated with any problem. In fact, this problem manifests before the Linux kernel is loaded -- that's the problem, in fact, because the problem reported in this bug is that the kernel won't load and then the node shuts down.

In some sense, if my analysis is correct, the problem is caused by Shim
> "tightening the screws" on Secure Boot policy; however, those changes
> are done for a reason (to improve security), so the solution should be
> to ensure that the GRUB versions MAAS and curtin deploy perform the
> checks that Shim wants, and that the kernels we install are signed.
>

Curtin/MAAS will install the linux-image-generic kernel for the specific
release
unless otherwise specified by MAAS in their kernel config mapping.

If there is a specific kernel package that *should* be selected instead of
the linux-image-generic kernel then MAAS/Curtin need to know:

1) what is that package name
2) how to know when to use (1) instead of linux-image-generic

A quick search of apt-cache shows

linux-signed-image-< >

Which appears to be what we'd want to use in the Secure Boot path.
In one of the other bugs I believe I had asked how curtin or MAAS can
detect whether a platform is configured for SecureBoot, but I didn't see
a definitive answer.

> Curtin/MAAS will install the linux-image-generic kernel for the specific
> release unless otherwise specified by MAAS in their kernel config mapping.

It should be noted that there are changes pending to the kernel packaging, such that installing linux-image *always* gives you the UEFI-signed vmlinuz, and you don't have to worry about having two copies of vmlinuz (signed and unsigned) installed to /boot.

However, in the meantime it would be best if curtin always used linux-signed-image-generic when installing on any UEFI system, so that the system isn't rendered unbootable if the user enables SecureBoot post-install. This is the current behavior of ubiquity and d-i.

None of this explains the behavior of an unsigned kernel failing to boot post-install. The current boot process is:
- boot shim
- verify signature on grub and boot it
- if kernel signature verifies, boot it
- if kernel signature does not verify, call ExitBootServices() and boot it anyway

However, we don't even package this in grub-efi-amd64-signed, it's only available for download from the archive. If maas or curtin are pulling this binary in at install time, instead of using the grubx64.efi.signed from the package, that's definitely a bug.

And if nothing is pulling gsbx64.efi.signed, then the bug is somewhere else but I'm not sure where. It's worth checking whether this problem mysteriously resolves once linux-signed is being pulled in; if it does, then it's possible we have a bug in grub (enforcing signature when it's not supposed to) or simply a bug in firmware.

I agree with Steve that installing linux-signed-image-generic as a default is best at the moment. Such kernels should boot whether or not Secure Boot is enabled, and AFAIK they'll even boot on BIOS-based computers (but I've not checked that, and there may be dependency issues on such systems). There is the caveat that this will increase the space used in /boot, which could cause out-of-space errors if /boot is a separate partition that's too small (see bug #1465050). I recommend 500 MB as a MINIMUM size for /boot these days.

If you want to detect whether Secure Boot is enabled, you can check the file /sys/firmware/efi/efivars/SecureBoot-8be4df61-93ca-11d2-aa0d-00e098032b8c. If its sixth byte is 0x01, then Secure Boot is enabled. For instance:

As Steve notes, though, the user might enable Secure Boot post-deployment, and if a signed kernel is not installed, the boot would then fail.

One more point: In the past, nodes deployed via MAAS have booted by MAAS's PXE server delivering GRUB to the node, and that GRUB then loading the GRUB on the hard disk. If this hasn't changed, it could be that Shim or GRUB is getting confused by this sequence. (Shim unloads itself after a handoff from one EFI program to another.)

If it would help for debugging this, I can give whoever needs it access to one of the affected systems.

I don't think this requires gsbx64.efi at all, it looks to me like a different issue; but one that we'd likely hit in the future anyway when enforcing signatures on kernels.

So, how this works is that a shim retrieved over TFTP loads a grub retrieved over TFTP, which is provided some config that tells it to chainload more stuff (in the "boot from disk" case which is failing here).

Since booting straight to disk, with SB enabled, we can infer that the portion on disk of the chain is valid and working correctly -- otherwise you would have validation failures when trying that.

I can only work with the assumption that all the bits in the chain are signed either by a Microsoft key that is known by the firmware (for shim), or by the Canonical key (which itself is known by shim) in the case of grub. Everything is signed, so there's no reason for things to fail validation -- it has to be that something isn't validating signatures.

Now, that points to a grub bug, but I'm not sure how it fails here -- by my read, you'd have grub go through validating even chainloaded images for their key by asking shim to validate them. This can fail, but then we'll need the output of that boot process with:

set debug="chain,secureboot"

You set this in grub.cfg.

We don't do multiple builds of shim -- there's only one shim that exists in the archive, so it's not likely to be the culprit, and grub already manages to validate things successfully and chainload them correctly in the context of UEFI Secure Boot. This was in fact a recently fixed bug in grub that caused it to fail to chainload Windows in UEFI.

One further thing to try would be to grab grubnetx64.efi from the archive and test with replacing the grubx64.efi file in MaaS with it? That would establish that the issue is a regression in grub:

To get this in the bug report: Replacing /var/lib/maas/boot-resources/current/bootloader/uefi/amd64/grubx64.efi with the grubnetx64.efi.signed file specified by Mathieu results in a successful boot. My understanding is that this is NOT good news. :(

Set the Grub2 task to High to grab attention (and because it's at least a High, if not Critical, bug). My gut says this should be critical as it's blocking the deployment of systems from multiple vendors in multiple datacenter and lab environments anytime SecureBoot is enabled.

If maas+curtin are not installing the signed variant of the linux-image package on UEFI systems, this is not invalid for maas+curtin - when we rev the grub secureboot policy (ETA January), these systems will be unbootable BY DESIGN. Regardless of whether this configuration has tickled a regression in grub, this MUST be fixed.

To be clear, although installing the signed kernel package is necessary, a failure to do this is NOT the source of this bug, which seems to relate to how Shim and/or GRUB handle the MAAS boot path, which involves Shim and GRUB being PXE-booted and then chainloaded to (Shim and?) GRUB on the hard disk. I am available for testing of proposed fixes; I have one system with Secure Boot available on my home network and sporadic access to others in 1SS (from OIL; we can transfer them over to the certification network from time to time).

I've tried this and the problem persists. Note that MAAS *IS* installing the signed kernel, which is necessary but insufficient for a fix; the problem seems to be that Shim/GRUB is becoming confused by the handoff from the PXE-boot version of GRUB to the GRUB stored on the hard disk. If my analysis is correct, this will require either:

* Changes to Shim/GRUB so that it works in this configuration. This used to
be the case, but the Shim/GRUB configuration has been tightening
security, which introduced this bug as a side effect.
* A change in the way MAAS/curtin configures the PXE-booted GRUB so that it
boots the system directly, without chainloading to GRUB on the hard disk.
Note that this approach to a solution used to be used on ARM64 EFI
systems, but that created a (now-fixed) bug #1582070. Thus, if this
approach is used, care will have to be taken to not cause a regression on
that bug.

I've updated lp:maas-images to produce new images using the linux-signed kernel on AMD64. New images are produced when http://cloud-images.ubuntu.com/daily/ adds new images so it may take a few days for signed kernels to appear in the stream. Unsupported releases are no longer updated so we'll have to manually regenerate them if we want signed kernels.

The stream also contains all bootloaders including the shim. Once a new shim-signed package is released to Xenial the stream will automatically ingest the the update. Let me know if we want to test an updated bootloader, I can produce a new proposed stream.

I'd just like to emphasize that, although a change to always install the linux-signed kernel on AMD64 systems is necessary to fix this bug, it's not sufficient to fix the bug. As noted in my comment #25 (and elsewhere), another change is also required -- either a change to Shim or GRUB (I don't know which) or a change to how MAAS handles the boot process (to have the PXE-booted GRUB read the configuration file from the hard disk rather than chainload to GRUB on the hard disk; or perhaps a change to the way the handoff is done, if some tweak could bypass the bug).

> It's worth checking whether this problem
> mysteriously resolves once linux-signed is being pulled in; if it does,
> then it's possible we have a bug in grub (enforcing signature when it's
> not supposed to) or simply a bug in firmware.

It would appear that despite the change to linux-signed, there is still a
bug.
In that light, can we get next steps on debugging grub or firmware or
whateever
else is needed to push this along?

> I'd just like to emphasize that, although a change to always install the
> linux-signed kernel on AMD64 systems is necessary to fix this bug, it's
> not sufficient to fix the bug. As noted in my comment #25 (and
> elsewhere), another change is also required -- either a change to Shim
> or GRUB (I don't know which) or a change to how MAAS handles the boot
> process (to have the PXE-booted GRUB read the configuration file from
> the hard disk rather than chainload to GRUB on the hard disk; or perhaps
> a change to the way the handoff is done, if some tweak could bypass the
> bug).
>
> As before, I remain able and willing to test potential fixes.
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1711203
>
> Title:
> Deployments fail when Secure Boot enabled
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1711203/+subscriptions
>

Yes, it's absolutely possible to recreate the environment for testing this without MAAS -- there's nothing all that special to it, chainloading *any* image should work and maintain a Secure Boot-verified chain provided all the links in the chain validate images.

This looks to be pretty clearly a bug in chainloader's validation of images, it used to work, but only because it wasn't actually verifying much of it in the first place.

While reading through #1730493 and #1437024 I noticed both had various UEFI bootloader issues fixed by switching to the Artful version of grub and the shim. I've updated http://162.213.35.187/proposed/streams/v1/index.json to use boot loaders from Artful in case anyone wants to test.

That's not going to change anything -- grub is doing exactly what it should: ask shim to validate the image it tries to chainload; and the image *does* validate successfully. The chain of trust is technically preserved, but shim doesn't manage to make sense of things, and refuses to continue loading.

This is a "bug" in shim, in that it's not a use case that was anticipated. Shim makes sense of the shim->fallback->shim->grub case because in that case things do go through the steps of calling load_image() and start_image() in firmware.

It also seems to me like a bug in grub because we ought to be loading things in such a way that shim would be able to make sense of it -- currently, that's not quite the case because some relocations and other image mangling needs to happen. I have an idea of a hack to fix this, but I think the "right" fix would be in shim.

What happens is that given that load_image() isn't called directly, when the second shim runs it doesn't uninstall the protocols and we end up validating against the first loaded shim when we try to verify the kernel's signature. This is effectively a variation on an issue that was fixed in shim for the fallback EFI binary.

In the meantime, there's also a valid workaround: you should be able to chainload *grub* rather than shim from the disk, and thus maintain the chain of trust for Secure Boot:

Lee, I tried http://162.213.35.187/proposed/streams/v1/index.json earlier, in response to Andres' suggestion, and that stream did not help. (See comments #24 and #25.) If you think that stream has changed since I did my testing on November 27, I'm happy to try again; but if not, it doesn't help.

I have provided a workaround in comment #36, has this not been applied? Landing a fix for this is going to take time, as it depends on a full roundtrip of getting shim prepared, tested, and signed by Microsoft.

I'm at a loss to explain that. This works quite well in my netboot testing when I remove MAAS from the equation. You *are* meant to be able to chainload grub from another grub; and the reason why grub can't chainload shim is that you then get the wrong set of shim protocols to properly validate the next binary. This will need more testing; I will need to know what hardware this is and what exactly is the content of the grub configs.

So I've enabled secure boot on my Intel NUC's and have *not* used to workaround in #36, and the machines deployed just fine (that is, they pxe boot off MAAS and they are told to load the shim). The same scenario is when using workaround in #36.

That said, the interesting bit is I remember testing these machines with secure boot enabled when having the non-signed kernel, and they didn't deploy. With the signed kernel, they started deploying.

So, I would like to test and see the difference in other machine other than a NUC.

MAAS version: 2.3.0 (6434-gd354690-0ubuntu1~16.04.1)
This is my observation on a Lenovo RS140 with workaround enabled from comment #36:
Also, to be sure it's not something we've injected, I am using the default curtin_userdata, NOT our customized cert one.

Booting local disk...
error: no such device: /efi/ubuntu/grubx64.efi.
error: File not found.

Press any key to continue...

Failed to boot both default and fallback entries.

Press any key to continue.

I retried this with Xenial and got the same failure to boot on the initial reboot.

This is what I have in the template per comments #36 and #38 above:
bladernr@critical-maas:/usr/lib/python3/dist-packages/provisioningserver/templates/uefi$ cat config.local.amd64.template
set default="0"
set timeout=0

Now, at this point, I'm stuck unbooted on the initial post-deployment reboot. So I reset the node by hand (poked the reset button) and disabled SecureBoot in the config and rebooted it again.

This time, the node booted, pxe booted, got the edict to boot local, and successfully booted locally.

If I do not take this step to disable secure boot during this post-deployment reboot cycle, the system fails to boot and eventually is marked as "Failed Deployment" once MAAS times out waiting for an update.

By manually intervening here, MAAS gets the proper message from the node and markes the deployment as successful (Sets node to Deployed state).

The workaround in #36 is now working for me on my home network, too. Perhaps when I tested it in December (comment #39) I had different software versions; or maybe I didn't correctly reproduce the changes in comment #36.

I did a diff on what you posted in #48, Jeff, and it exactly matches what I'm using, and what Andres put on weavile, so I don't think your result is caused by an error in your configuration file.

> > Is /efi/ubuntu/grubx64.efi on your EFI System Partition definitely the
> > Canonical-signed image from grub-efi-amd64-signed?
>
> I presume so? dpkg says it is:
>
> ubuntu@xwing:/boot/efi/EFI/ubuntu$ dpkg -S grubx64.efi
> grub-efi-amd64-signed: /usr/lib/grub/x86_64-efi-signed/grubx64.efi.signed
>
> That's the only thing that provides the file (that I can tell).
>
> > Which version of Ubuntu's grub are you booting via pxe?
>
> ubuntu@xwing:/boot/efi/EFI/ubuntu$ dpkg -l |grep grub|awk '{print $2":
> "$3}'
> grub-common: 2.02~beta2-36ubuntu3.16
> grub-efi-amd64: 2.02~beta2-36ubuntu3.16
> grub-efi-amd64-bin: 2.02~beta2-36ubuntu3.16
> grub-efi-amd64-signed: 1.66.16+2.02~beta2-36ubuntu3.16
> grub-pc: 2.02~beta2-36ubuntu3.16
> grub-pc-bin: 2.02~beta2-36ubuntu3.16
> grub2-common: 2.02~beta2-36ubuntu3.16
>
> That is what is installed on the node.
>
> > If you re-enable SecureBoot and configure this system to boot directly
> from
> > local disk instead of booting pxe first and chainloading, does it boot
> > successfully?
>
> So I re-enabled SecureBoot and removed all NICs from the boot order. I
> added in the HDD (since this is an EFI boot, the HDD is an entry called
> "Ubuntu" under "OTHER" in the boot order)
>
> This fails to boot, I get an error from the system:
>
> Error 1962: No operating system found. Boot sequence will automatically
> repeat.
>
> Because I have no NICs listed in the boot order, this just churns as it
> keeps retrying the HDD entry.
>
> So next, I went back and disabled SecureBoot once more. It immediately
> booted straight from the HDD.
>
> I also just tried a USB install with Secure Boot enabled. I was able to
> install bionic from USB, but it too fails to boot with the same error.
>
> To be fair at this point, given that this does work elsewhere, I'm
> suspicious that this is possibly an issue with my server.
>
> That said, I'd like to see this verified on that Cisco C240 system as an
> extra data point.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1711203
>
> Title:
> Deployments fail when Secure Boot enabled
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1711203/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=curtin; status=Invalid; importance=Undecided;
> assignee=None;
> Launchpad-Bug: product=dellserver; status=New; importance=Undecided;
> assignee=None;
> Launchpad-Bug: product=maas; milestone=2.3.0; status=In Progress;
> importance=High; <email address hidden>;
> Launchpad-Bug: product=maas; productseries=2.3; milestone=2.3.1; status=In
> Progress; importance=High; <email address hidden>;
> Launchpad-Bug: product=maas-images; status=Fix Released;
> importance=Critical; <email address hidden>;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=shim; component=main;
> status=...

>> > Which version of Ubuntu's grub are you booting via pxe?
>
>> ubuntu@xwing:/boot/efi/EFI/ubuntu$ dpkg -l |grep grub|awk '{print $2": "$3}'
>> grub-common: 2.02~beta2-36ubuntu3.16
>> grub-efi-amd64: 2.02~beta2-36ubuntu3.16
>> grub-efi-amd64-bin: 2.02~beta2-36ubuntu3.16
>> grub-efi-amd64-signed: 1.66.16+2.02~beta2-36ubuntu3.16
>> grub-pc: 2.02~beta2-36ubuntu3.16
>> grub-pc-bin: 2.02~beta2-36ubuntu3.16
>> grub2-common: 2.02~beta2-36ubuntu3.16
>
>> That is what is installed on the node.
>
> Sorry, I was asking about the other end of this: what version of
> grubnetx64.efi is being served by maas?

I have no idea. Andres?

As far as I can tell, it's serving up a copy of grubx64.efi out of
/var/lib/maas/boot-resources/current

>
> (But it is also good to confirm what version of grub is installed on the
> node's disk.)
>
>> So I re-enabled SecureBoot and removed all NICs from the boot order. I
>> added in the HDD (since this is an EFI boot, the HDD is an entry called
>> "Ubuntu" under "OTHER" in the boot order)
>
>> This fails to boot, I get an error from the system:
>
>> Error 1962: No operating system found. Boot sequence will automatically
>> repeat.
>
>> Because I have no NICs listed in the boot order, this just churns as it
>> keeps retrying the HDD entry.
>
>> So next, I went back and disabled SecureBoot once more. It immediately
>> booted straight from the HDD.
>
>> I also just tried a USB install with Secure Boot enabled. I was able to
>> install bionic from USB, but it too fails to boot with the same...

Whichever is the latest version in -updates at the time the streams were
created.

But yes, the latest version on the bootloader stream.

>
>
>
> >
> > (But it is also good to confirm what version of grub is installed on the
> > node's disk.)
> >
> >> So I re-enabled SecureBoot and removed all NICs from the boot order. I
> >> added in the HDD (since this is an EFI boot, the HDD is an entry called
> >> "Ubuntu" under "OTHER" in the boot order)
> >
> >> This fails to boot, I get an error from the system:
> >
> >> Error 1962: No operating system found. Boot sequence will automatic...

The Cisco C-240 M4 (boldore) that originally produced this bug seems to have been returned to OIL, so I can't test with it, at least not quickly; however, I did just run a test with feebas, a Cisco C220 M4. I was able to deploy Ubuntu 16.04 and boot it with Secure Boot enabled, and verified SB was enabled on the deployed system, by using the workaround in post #36.