Go to page

Go to page

Working on updating from Mojave to Catalina. Ran into some issue and will post that in a bit but I wanted to drop this here for anyone who needs to unlock their BIOS. Or if you have a specific bios you need to be unlocked let me know (I can also preset settings need for your Hackintosh).

Hi RehabMan, I can't figure out why sometimes my mouse isn't working after sleep.Other devices work perfectly. I properly injected USB-UIAC/USBX SSDT as you can see in my IOReg. I have another question: how I can see the log output of verbose mode? - can't find any boot.log or kernel.log in...

After much trouble i got past apfs_module_start 1683, then prohabition sign, now im stuck on gi0screenlockstate 3 which causes it to just reboot, attached is my clover folder. Any help would be greatly appreciated.

Specs:
MB: Asus X99-Pro
CPU: 5930k
GPU: rx 5700xt

Attachments

I updated my EFI distribution with the EC fix. I was stumped with the Catalina installer not booting and I researched the fixed. The EC patches are added for those who have a Sabertoooth X99 setup. It should work on other Asus X99 boards as well.

After much trouble i got past apfs_module_start 1683, then prohabition sign, now im stuck on gi0screenlockstate 3 which causes it to just reboot, attached is my clover folder. Any help would be greatly appreciated.

Sounds like a kernel panic during graphics initialization. You didn't say if this is booting the installer, or the installed system. If the latter, you might have a report left behind in /Library/Logs/DiagnosticReports that you can read from the installer in Terminal.

Yes, I have seen that panic a number of times. I can tell you exactly what the cause is too, and how to fix it... or that you can't fix it, depending. If you aren't interested in the cause, just skip all this technical mumbo jumbo and go to the bottom of this post, where it says, 'OK, good news/bad news time.'

You may also see panics that reference TLB Invalidation, amongst other things. There are a few different situations that lead to this panic, so there are a few superficially different panic messages you may see, but they're all actually caused by the same issue. The smoking gun is this panic will always manifest as one or more CPUs locking up, so you'll see something like 'unresponsive CPU', timeout, NMI (non-maskable interrupt - something that should override whatever a CPU is doing and force it to perform a priority task), IPI timeouts, etc.

The BSD process it occurs in is irrelevant - this is simply whatever process was running on that CPU before a context switch (switching to a different process so it might have its share of CPU time) occurred (or more accurately, before the context switch was attempted... because instead of completing, the machine locked up and/or got a kernel panic).

You might notice that 'tlb' is in the name of the function in the stack trace you referenced. These panics always occur due to something involving the Translation lookaside buffer (TLB). This is table that typically lives in either the L1 or L2 cache of a CPU core. If you're curious, there is a fairly technical description of TLBs on wikipedia, but the quick and dirty description is that a TLB is table that maps virtual memory addresses being used for the currently running process on a CPU to physical memory addresses. In other words, it lets a CPU know where to look for whatever a running process has in memory.

Unfortunately, the moment a context switch occurs that is switching to a different process (versus merely switching to a different thread of the same process), this effectively means a completely different program is now being run by a given CPU, and the data it has stored in memory and addresses to that data is, understandably, completely different.

This causes a TLB invalidation, which is a fancy way of saying that the TLB is outdated and a new one generated. TLB flushing is part of this process, specifically clearing out the old table. So this is something that occurs when most (though there are certain exceptions) process-level context switches occur on any CPU core.

And a bug in Intel's microcode for certain CPU families causes TLB invalidation (which is handled in hardware by the CPU) to (very rarely) take much, much longer than it should.

This bug effects all CPUs in the following families:

Intel Xeon E5-2600-v4

Intel Xeon E5-4600-v4

Intel Xeon E7-8800-v4

Intel Xeon E7-4800-v4

Intel Xeon D-1500

In other words, your CPU, a Xeon E5-2690 v4, is most definitely one of the CPUs effected by this microcode bug. Understand that this is nothing involving macOS, and you can expect similar panics to occur (usually mentioning the same thing, TLB invalidation/flushing, and/or failed acks/NMI/IPI/processor lockup/unresponsive processor) in any operating system. It will occur in Linux as well as Windows, though it may simply manifest as an unrecoverable freeze/lockup, or may trigger a panic, it is really down to how the kernel is set up to handle this. And this (TLB invalidation taking an impossibly long time) is not really something that should ever happen, hence being handled in ungraceful ways like just locking up without even printing panic text.

This actually is likely happening to you several times without you even realizing it, but only sometimes is it severe enough to cause a panic.

A CPU should respond to an interrupt within a certain amount of time. Regardless, a CPU can only respond to an interrupt in between instructions. In other words, the transition where it has completed one instruction, but has yet to start the next one. TLB invalidation has a specific assembly instruction on x86 processors, INVLP. So this bug which causes TLB invalidation to take much longer than it should also occurs entirely during an instruction. Meaning that the CPU will not respond to any interrupt during this time. The panics are perfectly accurate - that CPU core was, in fact, completely unresponsive because it was processing this TLB invalidation instruction.

There are 3 ways this can play out:

1. A TLB invalidation takes an unusually long amount of time to complete, but it still completes quickly enough that the CPU core responds to an interrupt before some hard coded kernel/OS-specific timeout elapsed. As long as this timeout is never reached, there is never a panic. This sort of erroneous behavior, for reasons (maybe) only known to intel, tend to be clustered, meaning many will begin occurring over the span of a few seconds. This can manifest as sudden cursor choppiness or the system clock failing to update in the menu for a few seconds, only to skip ahead once the issue resolves. Besides briefly reducing system performance slightly, these are generally harmless, albeit annoying if you notice.
2. A TLB invalidation again takes a lot longer than it should, but this time it runs out an interrupt timeout timer in the OS, which results in a panic, but it is at least able to log the panic text etc.
3. Sometimes another core will timeout while the panic is being logged for another, initial timeout. What results is kind of hilarious: a panic log that is truncated by a second panic occurring in the middle of writing out the log of the first panic. Literally a panic inside a panic. This can be very bizarre to see in a panic log. Though, there often will not be one when this happens. Depending on the timing, this often simply results in a hard lockup of the system, just totally frozen (cursor movement may or may not be functional), but all other screen redraws will simply stop. No log, no panic text, the computer simply... stops. It'll burn watts in this state as power management is no longer functional as well. I have personally observed this one to occur in Windows and Ubuntu in addition to macOS.

No one outside of intel knows the exact nature of this bug, but it seems to have a very strong stochastic element to it. Don't try to look for any sort of pattern - these panics truly are random. The only thing that really has an impact is how many context switches are occurring. If you imagine that every context switch has an extremely small chance of causing this CPU hang to occur, then the more context switches that occur in a given time, the higher the chances of this lock up occurring. If you max out every core and all your RAM running a bunch of docker containers, you can almost certainly trigger this reliably in a span of 4-6 hours. Other times, it can occur every 1-3 days, and sometimes seem to vanish for a week or even 2 weeks, but it will never go away. It will inevitably crash your system. It is just as likely to occur while your computer is mostly idle or you aren't even using it (but it is awake), as when you are actively using it.

OK, good news/bad news time.

The good news is that if you have a production/retail version of any of the CPUs in any of the Xeon families I listed earlier (E5-2600-v4, E5-4600-v4, E7-8800-v4, and D-1500), then intel released a microcode update that fixes the bug entirely, and this will completely resolve these panics and TLB invalidation will no longer take anything except the normal amount of time to complete. Microcode is included in the BIOS, so the fix is, hopefully, as simple as updating your motherboard's BIOS to the latest version. Many mobo manufacturers aren't always the best at making sure to include the latest microcode in their BIOSes however, so if the latest Asus BIOS doesn't have microcode newer than, in your case, 0x0b00001b, then you're still not out of luck. You can replace the microcode with the newer version in the BIOS file before you flash it. The Win-raid forums will have all the information you need to do this safely, but such an involved procedure is beyond the scope of this post.

The bad news is that if you do not have a retail version of the E5-2690 v4, and instead have any of the 'ES' (engineering sample) processor steppings (you can tell by clock speed - retail is clocked at 2.6GHz, the engineering samples are clocked at either 2.1GHz or 2.4GHz)... then your CPU does not use the microcode mentioned above. It has a completely different microcode specific to the ES steppings, and intel has not released any updated microcode for these CPUs. The bug is still there, and there is no fix. Simply put, you're **** out of luck. Intel has not even released updated microcodes to fix critical security vulnerabilities like Spectre or Meltdown, none the less this bug, and its been over 2 years since the fix was issued for the retail cpu microcodes. It is safe to say that intel is never going to release updated microcodes that fix this bug for ES versions of these CPUS.

I hope you have the retail version of the E5-2690 v4. If you have an engineering sample, there is nothing you can do. You will suffer sporadic crashes that occur at random, usually every few days to couple of weeks, for as long as you use an E5 v4 engineering sample of any kind. Your options are to simply live with it, shell out the (very high) price for a full retail Xeon CPU, or downgrade to a non-xeon core cpu that fits in an LGA2011 v3. Or another option might be to switch to a v3 (Haswell) Xeon. Engineering samples included. I do not believe this bug effects the Haswell versions, retail or otherwise. I am not 100% certain about that however.

But I mean, this is the kind of risk one takes on when purchasing an engineering sample. The discounted price isn't due to being clocked 100 or 200MHz lower (and really, who cares), it is due to the chance of things like these. If microcode bugs are discovered, or security exploits, or what have you, then they aren't going to be fixed for the engineering sample steppings. Most of the time, ES versions of chips are totally fine and you get a sweet chip at a sweet price. But in this case, if you bought an ES version of any E5-26xx v4 Xeon, your luck ran out.

@metacollin, Wow, thanks for that! I definitely didn't skip to the end - it was fascinating!

As for my exact choice of CPU, yeah, I'm out of luck. I knew that I wouldn't be getting any updates from Intel when I got it, heh. That's too bad, though. I can live with it. The lockup is more like every couple weeks, for me.

This will give me an interesting story to tell the students who use this computer, if it happens while they're working on something. Thanks!