Linux - KernelThis forum is for all discussion relating to the Linux kernel.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Introduction to Linux - A Hands on Guide

This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.

I’m working on developing a system running on a custom board design running a Yocto-built Linux with kernel V4.1.27 running on a relatively new design Intel 4-core SoC in x64 mode. We continuously run a pretty intensive application with a lot of data moving and number crunching.

I want to start with an apology. Sorry for the excessive detail. Feel free to skip over this post if you want to, or to skip straight to my questions at the end if it helps.

I think I'm seeing a hardware issue, but might be misunderstanding something. I've laid out all the detail

so you can see if my understand is incorrect

because it might help someone else understand something about debugging

So let's launch into it...

Now, at unpredictable intervals, about every few days or once a week, my system crashes with a “BUG: unable to handle kernel paging request”. So, I’ve built the kernel with KGDB and KDB enabled and set Oops to panic. I've connected GDB over the console and broken at panic() when the issue occurred.

If I’m understanding correctly, it seems to show that the application has invoked the brk() syscall to request more memory. This invoked do_brk() which in turn is using find_vma_links() to check the existing vm allocation. Finally, a page fault has occurred inside find_vma_links() and this page fault could not be resolved.

Looking deeper into the situation by disassembling find_vma_links() in frame 8 I see the instruction pointer at the following point

Now, this is the point at which I am uncertain due to lack of familiarity both GDB and Intel processors. (My background is bare-metal ARM). However, the way it looks to me is that the instruction at 0xffffffff8119bfad, mov (%rbx),%rax, caused the page fault. That instruction would read the value from address stored in rbx and store the value in rax. The address in rbx is 0xffff880073245508 (seen in console backtrace and confirmed in GDB) which I have confirmed is a valid vm address. The address that caused the page fault is 0x0000000040916000 = 1083269120 (in CR2 register and page fault calls). This value exists in rax and r12, but I don't see any instructions in the disassembly using an address from those registers.

So, I'm now left with the following questions to ponder...

It looks to me like the instruction at 0xffffffff8119bfad caused the page fault and would be re-executed if the page fault was resolved. Is my understanding correct?

The address in rbx is not the address that that caused the page fault, but that address is present in another register. What could cause this page fault other than a hardware error?

Am I missing something or misunderstanding something?

So there you have it. I'll post a conclusion if I get do actually come to one.

The important part of all this is, the kernel was attempting to address: 0x0000000040916000, but the system wouldn't allow it. Unable to handle paging request means there's an invalid pointer in the code. You can just report the bug and let the kernel team fix it.

Thanks @AwesomeMachine. That is my understanding of exception too. And thanks for the advice about reporting the bug.

In the meantime...

A. I've read up more about exception handling on Intel. Fault exceptions are arranged so that the RIP saved on the stack is pointing to the instruction that caused the fault. That way if the fault is resolved the instruction will be restarted. So my understanding of that is correct also.

B. Regarding the kernel version, 4.1.27 is quite old. The latest 4.1 release is 4.1.46 and the file where the exception occurred, mm/mmap.c, has a lot of changes between the two versions. I am limited to the specific kernel versions for which Intel have published the patches that I require to run OpenCL for the Intel GPU. However, they have recently published the patches for kernel 4.1.42 so I'll try that out before reporting a kernel bug.

OK an update to the information above after spending some time studying the issue.

Further results
I have now been studying several systems for the past few weeks and have analysed several similar crashes. The one thing I had in common in them all was that there always seemed to have been a hardware malfunction, most likely an instruction was read incorrectly from RAM resulting in various different failures.

Finally
I have just updated the BIOS for our system with one change - enabled spread-spectrum clocking for the eMMC interface. I don't know why, but this has caused the problem to disappear. We implemented this change to the BIOS for another reason. Anyway - happy days - it's not often you get a fix for free.