Reply inline…
On Nov 16, 2016, at 4:29 PM, Bernhard Urban via android-devel <android-devel at lists.dot.net> wrote:
> everytime I look at a runtime bug on Android, something doesn't feel right. Reports look different to each other. So I tried to get a better understanding on how we handle a SIGSEGV in the runtime and what the output is supposed to be. There are three basic steps [1]:
>> (1) we print a managed stacktrace.
> (2) we print a native stacktrace: we do that either via libunwind or libcorkscrew depending on what is available. if neither is available, we do nothing.
> (3) we call `exit (-1)`, which might give us more information such as a register dump.
Unfortunately, there are (implicitly!) *more* than three basic steps, and I’m fairly sure I still don’t understand what all is going on. For more wonderful context:
https://github.com/mono/mono/commit/5d07b77a67f61576318a30e8b1c5f65f7f26b1cf
> when a process crashes on Android, ideally:
>> 1. The Android signal handler is executed,
> 2. Bionic will attempt to connect to /system/bin/debuggerd.
> 3. debuggerd will try to connect to the crashing process, then
> retrieve "useful" information from the crashing process (stack
> trace, register values, etc.)
The “fun” is in trying to intermix Mono’s SIGSEGV handling mechanism in with Android’s infrastructure, which involves having an extra process (`debuggerd`) connect to the process to dump process state.
Additionally, I *believe* — but have not retested or reverified — that the `exit(-1)` within `mini-exceptions.c` won’t be executed, because of the Xamarin.Android calls `mono_set_crash_chaining(1)`, which sets `mono_do_crash_chaining` to 1:
https://github.com/xamarin/xamarin-android/blob/f862032/src/monodroid/jni/monodroid-glue.c#L2802
Not that any of the above in any way helps further improve reliability…
> That's the idea, unfortunately that is not always what we get. In order to see the behaviour across different devices and versions of Android, I made this simple crashing app: [2]. As soon as you click the button the application segfaults. For that I wrote a UI test and sent it off to Xamarin Test Cloud and collected the logs: [3]. Note that every device ran the same APK.
>> Out of 19 devices, there are really only two devices where the crash report looks like it should: samsung_google_nexus_10-4.4.txt and xiaomi_mi_4-4.4.4.txt. On many devices we only get a managed stacktrace and then the fun is over.
>> Why?
>> Good question. Luckily I have a device on my desk where this is the case, so I did a bit of printf debugging. What I figured out is, that the call to `mono_exception_native_unwind ()` in [4] is where the fun stops. The message I see on adb logcat:
>> 11-15 20:51:44.790 7093 7093 E audit : type=1701 msg=audit(1479239504.790:1839): auid=4294967295 uid=10288 gid=10288 ses=4294967295 subj=u:r:untrusted_app:s0:c512,c768 pid=14937 comm="artup.lulzcrash" exe="/system/bin/app_process32" sig=11
Are there any other `adb logcat` messages? The above looks like an SELinux-related message. (I have no idea what it *means*, but that’s what it looks like…)
> I see the text of a printf right before that call. printf at the beginning of the function doesn't happen. If I move `mono_exception_native_unwind ()` right before the managed stack unwinding, it crashes there, so it isn't a timeout mechanism. I have no idea why on earth this is the case. Unfortunately there is no clue from which PC the signal is coming from (maybe we cause another fault in the handler? maybe android interferes somehow?)
`debuggerd`?
> Anyone has some idea? Please tell me I overlook something obvious here. (I haven't had success yet with gdb/lldb)
I’ve only had success with gdb when using 32-bit targets. 64-bit targets give me gdb protocol mismatch errors. :-(
> Regardless, I want to suggest some things:
>> (a) we should get rid of the dynamic loading of libunwind/libcorkscrew. Some devices don't ship it. Instead, we should include it in the runtime. I think it's worth the extra footprint (if that is the concern why it wasn't done in the first place).
This is *absolutely* something we should consider. This is even more important in the context of Android 7.0 Nougat, which won’t allow us to load those native libraries, even if they do exist.
- Jon