Here is the kernel patch for this tool, The idea is to output user space stack call-chain from /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend it to output user space call chain in hex format

Here is the kernel patch for this tool, The idea is to output user space stack call-chain from /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend it to output user space call chain in hex format

On Thu, 2012-04-19 at 11:50 +0800, Cong Wang wrote: > On 04/17/2012 10:37 PM, Tu, Xiaobing wrote: > > Resend the patch because of the log is too long on a single line. > > > > From: xiaobing tu<xiaobing.tu [at] intel> > > > > Here is the kernel patch for this tool, The idea is to output user space stack call-chain from > > /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend > > it to output user space call chain in hex format > > > > Can you teach me why we still need this as we have pstack? Cong,

Sorry for replying so late. Xiaobing told me you sent him email and I didn't receive the 1st one you sent out.

I tried pstack and it does work. It means developers in the world wanted the tool long long ago.

Although not checking the source codes of pstack (sorry, I'm busy in debugging many critical issues), I think pstack is based on ptrace interface, which means: 1) It need traps into system for many times to collect call frames of one task. 2) It need send signal to the ptraced process to stop it. Such behavior might have some impact if the ptraced process also processes many signals. 3) The data parsing to get symbols might not be split from data collection. I mean, it collects call frames of one process, then parses it; then collects the 2nd task's. If there are many processes, it couldn't collect the data just at the monitor time point.

Why do we work out the tools? The original requirement is from real work. We are enabling Android on Medfield. One typical error of Android is ANR. When a process couldn't respond in 5 seconds, Android reports an ANR error, and dumps JAVA call stack. However, it couldn't dump userspace lib (such like bionic, written by C or C++). In addition, Android just dumps the stack of the non-responding process. It doesn't dump stack of others. As binder is basic framework in Android, processes communicate by binder in the model of client/server. When one process is not responding quickly, maybe another process blocks it. We need dump that process status.

Many teams complained it's hard to debug such ANR issues, especially the ones which are triggered at MTBF testing. Sometimes, an ANR happens after MTBF testing runs for one week. Developers ask us to implement such tool over and over again.

Besides ANR, sometimes, system might not respond to any user operation. Usually, kernel or firmware would reset system. At that time, we also need get the call chains of all the user space processes before system is reset.

With our tool, 1) We could collect the HEX-format call chain data and /proc/XXX/maps of all the processes quickly, then parse them either after rebooting, or after the issue is reported. It could catch the scene just at the time point when the error happens. Our experiments shows the tool could collect the data of all processes within 200ms. 2) The new tool won't stop the processes and have less impact on them. Considering a scenario of performance bottleneck investigation, statistics collection shouldn't have big impact on running processes. 3) It could support both i386 and x86-64. I tried pstack and it doesn't work with x86-64. 4) It follows /proc/XXX/stack interface and it's easy to use it.

Besides this tool, we are considering to extend it to collect user space call chain of current process from kernel when kernel detects some other abnormal behavior.

On 04/19/2012 01:17 PM, Yanmin Zhang wrote: > On Thu, 2012-04-19 at 11:50 +0800, Cong Wang wrote: >> On 04/17/2012 10:37 PM, Tu, Xiaobing wrote: >>> Resend the patch because of the log is too long on a single line. >>> >>> From: xiaobing tu<xiaobing.tu [at] intel> >>> >>> Here is the kernel patch for this tool, The idea is to output user space stack call-chain from >>> /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend >>> it to output user space call chain in hex format >>> >> >> Can you teach me why we still need this as we have pstack? > Cong, > > Sorry for replying so late. Xiaobing told me you sent him email and I > didn't receive the 1st one you sent out.

Based on the length of your reply and the description of the patch, you hide lots of information in your patch description.

> > I tried pstack and it does work. It means developers in the world wanted > the tool long long ago. > > Although not checking the source codes of pstack (sorry, I'm busy in debugging > many critical issues), I think pstack is based on ptrace interface, which means: > 1) It need traps into system for many times to collect call frames of one > task. > 2) It need send signal to the ptraced process to stop it. Such behavior > might have some impact if the ptraced process also processes many signals. > 3) The data parsing to get symbols might not be split from data collection. > I mean, it collects call frames of one process, then parses it; then collects the 2nd > task's. If there are many processes, it couldn't collect the data just at the monitor > time point.

Yet another one who wants to "fix" ptrace. ;-)

> > Why do we work out the tools? The original requirement is from real work. > We are enabling Android on Medfield. One typical error of Android is ANR. > When a process couldn't respond in 5 seconds, Android reports an ANR error, > and dumps JAVA call stack. However, it couldn't dump userspace lib (such like > bionic, written by C or C++). In addition, Android just dumps the stack of > the non-responding process. It doesn't dump stack of others. As binder is basic > framework in Android, processes communicate by binder in the model of client/server. > When one process is not responding quickly, maybe another process blocks it. We > need dump that process status. > > Many teams complained it's hard to debug such ANR issues, especially the ones which > are triggered at MTBF testing. Sometimes, an ANR happens after MTBF testing runs > for one week. Developers ask us to implement such tool over and over again. > > Besides ANR, sometimes, system might not respond to any user operation. Usually, > kernel or firmware would reset system. At that time, we also need get the call > chains of all the user space processes before system is reset.

I am not familiar with Andriod at all, so a quick question is if this is only for Andriod, why you introduce this for all? IOW, why not provide a Kconfig?

BTW, I am sure you need to put the above paragraphs into your patch description, to make it clear why the patch is needed.

> > With our tool, > 1) We could collect the HEX-format call chain data and /proc/XXX/maps > of all the processes quickly, then parse them either after rebooting, or > after the issue is reported. It could catch the scene just at the time point > when the error happens. Our experiments shows the tool could collect the data > of all processes within 200ms. > 2) The new tool won't stop the processes and have less impact on them. > Considering a scenario of performance bottleneck investigation, statistics collection > shouldn't have big impact on running processes. > 3) It could support both i386 and x86-64. I tried pstack and it doesn't work > with x86-64. > 4) It follows /proc/XXX/stack interface and it's easy to use it. > > Besides this tool, we are considering to extend it to collect user space > call chain of current process from kernel when kernel detects some other > abnormal behavior. >

In my previous reply, I ran 'pstrack' on my x86-64 machine, don't understand why you said it doesn't work with x86-64? I guess pstack supports more than just x86, as ptrace is available in other arch's too.

On Thu, 2012-04-19 at 14:13 +0800, Cong Wang wrote: > On 04/19/2012 01:17 PM, Yanmin Zhang wrote: > > On Thu, 2012-04-19 at 11:50 +0800, Cong Wang wrote: > >> On 04/17/2012 10:37 PM, Tu, Xiaobing wrote: > >>> Resend the patch because of the log is too long on a single line. > >>> > >>> From: xiaobing tu<xiaobing.tu [at] intel> > >>> > >>> Here is the kernel patch for this tool, The idea is to output user space stack call-chain from > >>> /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend > >>> it to output user space call chain in hex format > >>> > >> > >> Can you teach me why we still need this as we have pstack? > > Cong, > > > > Sorry for replying so late. Xiaobing told me you sent him email and I > > didn't receive the 1st one you sent out. > > > Based on the length of your reply and the description of the patch, you > hide lots of information in your patch description. Indeed, we need add more info there.

> > > > > I tried pstack and it does work. It means developers in the world wanted > > the tool long long ago. > > > > Although not checking the source codes of pstack (sorry, I'm busy in debugging > > many critical issues), I think pstack is based on ptrace interface, which means: > > 1) It need traps into system for many times to collect call frames of one > > task. > > 2) It need send signal to the ptraced process to stop it. Such behavior > > might have some impact if the ptraced process also processes many signals. > > 3) The data parsing to get symbols might not be split from data collection. > > I mean, it collects call frames of one process, then parses it; then collects the 2nd > > task's. If there are many processes, it couldn't collect the data just at the monitor > > time point. > > > Yet another one who wants to "fix" ptrace. ;-) Agree. But usually, it's hard to fix very old codes. Ptrace is used by gdb and people don't touch the kernel part.

> > > > > Why do we work out the tools? The original requirement is from real work. > > We are enabling Android on Medfield. One typical error of Android is ANR. > > When a process couldn't respond in 5 seconds, Android reports an ANR error, > > and dumps JAVA call stack. However, it couldn't dump userspace lib (such like > > bionic, written by C or C++). In addition, Android just dumps the stack of > > the non-responding process. It doesn't dump stack of others. As binder is basic > > framework in Android, processes communicate by binder in the model of client/server. > > When one process is not responding quickly, maybe another process blocks it. We > > need dump that process status. > > > > Many teams complained it's hard to debug such ANR issues, especially the ones which > > are triggered at MTBF testing. Sometimes, an ANR happens after MTBF testing runs > > for one week. Developers ask us to implement such tool over and over again. > > > > Besides ANR, sometimes, system might not respond to any user operation. Usually, > > kernel or firmware would reset system. At that time, we also need get the call > > chains of all the user space processes before system is reset. > > > I am not familiar with Andriod at all, so a quick question is if this is > only for Andriod, why you introduce this for all? IOW, why not provide a > Kconfig? Although working on Android, we think it might be useful to use the tool to resolve similar issue. For example, I worked on performance tuning years ago and got headache why there was performance drop on a large-scale server. From kernel part, I couldn't find enough info to debug it. Eventually, I root caused some issues by gdb attach, then manually checking the user space call chain. It's painful.

In addition, the new tool consists of kernel patch and user space parse tool. The kernel patch is quite simple and shouldn't hurt system. It reuses existing CONFIG_USER_STACKTRACE_SUPPORT.

> > BTW, I am sure you need to put the above paragraphs into your patch > description, to make it clear why the patch is needed. It's a good idea definitely.

> > > > > With our tool, > > 1) We could collect the HEX-format call chain data and /proc/XXX/maps > > of all the processes quickly, then parse them either after rebooting, or > > after the issue is reported. It could catch the scene just at the time point > > when the error happens. Our experiments shows the tool could collect the data > > of all processes within 200ms. > > 2) The new tool won't stop the processes and have less impact on them. > > Considering a scenario of performance bottleneck investigation, statistics collection > > shouldn't have big impact on running processes. > > 3) It could support both i386 and x86-64. I tried pstack and it doesn't work > > with x86-64. > > 4) It follows /proc/XXX/stack interface and it's easy to use it. > > > > Besides this tool, we are considering to extend it to collect user space > > call chain of current process from kernel when kernel detects some other > > abnormal behavior. > > > > In my previous reply, I ran 'pstrack' on my x86-64 machine, don't > understand why you said it doesn't work with x86-64? I guess pstack > supports more than just x86, as ptrace is available in other arch's too. Ok. I use the latest ubuntu on my workstation and apt-get to install pstack without recompiling it. The default pstack executable reported failure on 64bit os. I was wrong and might check pstack again.

On Thu, 2012-04-19 at 13:17 +0800, Yanmin Zhang wrote: > Although not checking the source codes of pstack (sorry, I'm busy in debugging > many critical issues), I think pstack is based on ptrace interface, which means: > 1) It need traps into system for many times to collect call frames of one > task. > 2) It need send signal to the ptraced process to stop it. Such behavior > might have some impact if the ptraced process also processes many signals.

Yeah, but who cares.. its debugging stuff..

> 3) The data parsing to get symbols might not be split from data collection. > I mean, it collects call frames of one process, then parses it; then collects the 2nd > task's. If there are many processes, it couldn't collect the data just at the monitor > time point.

On Thu, 2012-04-19 at 13:17 +0800, Yanmin Zhang wrote: > 1) We could collect the HEX-format call chain data and /proc/XXX/maps > of all the processes quickly, then parse them either after rebooting, or > after the issue is reported. It could catch the scene just at the time point > when the error happens. Our experiments shows the tool could collect the data > of all processes within 200ms.

No you can't, ever heard of address space randomization?

> 2) The new tool won't stop the processes and have less impact on them. > Considering a scenario of performance bottleneck investigation, statistics collection > shouldn't have big impact on running processes.

Maybe.. on these tiny systems you're working on most tasks will not be runnable anyway since you only have 1 (maybe 2) cpus and what's running is your dumper process, so most everything isn't runnable, attaching and dumping stack of all tasks isn't really much more expensive than this.

the open/read/close you do on the proc files, along with the readdir etc.. are system-calls just like the ptrace alternative.

> 3) It could support both i386 and x86-64. I tried pstack and it doesn't work > with x86-64.

Yeah, and you'll need to extend it to ARM/MIPS/etc.. whereas there is plenty of userspace around that can already work on all those platforms -- if pstack cannot its weird, I'd think it would use all the regular binutils muck that already supports all the platforms.

> 4) It follows /proc/XXX/stack interface and it's easy to use it.

Uhm, not so very much, see your ASLR issue. Furthermore it requires all userspace be build with framepointers enabled -- which I think would be a good thing anyway -- but with which reality seems to disagree.

On Fri, 2012-04-20 at 11:38 +0200, Peter Zijlstra wrote: > On Thu, 2012-04-19 at 13:17 +0800, Yanmin Zhang wrote: > > Although not checking the source codes of pstack (sorry, I'm busy in debugging > > many critical issues), I think pstack is based on ptrace interface, which means: > > 1) It need traps into system for many times to collect call frames of one > > task. > > 2) It need send signal to the ptraced process to stop it. Such behavior > > might have some impact if the ptraced process also processes many signals. > > Yeah, but who cares.. its debugging stuff.. Real developers, real debuggers care it. End users don't care it.

> > > 3) The data parsing to get symbols might not be split from data collection. > > I mean, it collects call frames of one process, then parses it; then collects the 2nd > > task's. If there are many processes, it couldn't collect the data just at the monitor > > time point. > > This is equally true for your silly patch. Not true for my patch. We did many experiments. Originally, we used the similar method like pstack based on ptrace and found it's very slow to do so when getting all the stacks of all processes when system reports an issue.

> secondly the implementation is crappy, I agree it's a little crappy. Other methods like ptrace is slow although the codes look like clean. When ptrace is slow and might have other bad impact, could we also say it's crappy?

When implementing it, we had similar questions. 1) Could the parser get the correct data in time, especially when the process is running fast and doesn't sleep? 2) If /proc/XXX/maps is changing, i.e. it mmaps/munmaps frequently, could the parser parse data correctly? That's an issue indeed. But Most applications don't do so.

Currently, in user space, we collect both the HEX stack data and maps, then save them to a trace file. After all collection is done, we do the symbol parsing. That mitigates 2) dramatically.

The safest way is to stop the process at a special state, like TASK_STOPPED or TASK_TRACED. Then, get data and parse. Then, let the process restore. Such way is used to check more detailed data like variables, and change them by gdb. But it's too slow and might have bad impact on running system. You might say it's debugging stuff and debugger should look for a good approach. I don't think such speaking is to resolve the issue. We need provide more tools to help developers.

Usage scenario: A) Performance debug: When debuggers want to do a quick checking, they don't care if the first try could get the exact data as system is RUNNING. They would get the data for many times. So above question 1) doesn't hurt the utilization. B) Android ANR issue debug: We did root cause some ANR issues with our tools.

On Fri, 2012-04-20 at 11:54 +0200, Peter Zijlstra wrote: > On Thu, 2012-04-19 at 13:17 +0800, Yanmin Zhang wrote: > > 1) We could collect the HEX-format call chain data and /proc/XXX/maps > > of all the processes quickly, then parse them either after rebooting, or > > after the issue is reported. It could catch the scene just at the time point > > when the error happens. Our experiments shows the tool could collect the data > > of all processes within 200ms. > > No you can't, ever heard of address space randomization? No. I googled it a moment ago. Here is my understanding. ALSR is a security feature. OS arranges the mmap areas randomly. It means the mmap space might be changed when the same executable runs twice.

Is my understanding correct?

Answer: With our tool, we collect both user space stack data and /proc/XXX/maps, and save them to a trace file. Then, parse them either immediately, or after system reboots.

> > > 2) The new tool won't stop the processes and have less impact on them. > > Considering a scenario of performance bottleneck investigation, statistics collection > > shouldn't have big impact on running processes. > > Maybe.. on these tiny systems you're working on most tasks will not be > runnable anyway since you only have 1 (maybe 2) cpus and what's running > is your dumper process, so most everything isn't runnable, attaching and > dumping stack of all tasks isn't really much more expensive than this. I raised at least 2 usage scenarios. The one is Android OS ANR issue. The other is the performance bottleneck investigation on _server_. Android OS does run on a small system with 1 to 2 cpu (might with 4 cores, but not popular now). It's not so simple like what you said to collect the user space stacks of all processes by ptrace interface. We did the experiment and the collection is time-consuming, sometimes even not endurable.

In addition, We extended the patch on our system to dump the user stacks of all processes when system hangs. Current patch sent to LKML doesn't include it.

> the open/read/close you do on the proc files, along with the readdir > etc.. are system-calls just like the ptrace alternative. Good point. 1) With ptrace, there is a syscall when fetching only one call frame. With our tool, there is only one mostly. 2) With ptrace, we need stop the processes. With Android OS on a small system, it seems ok like what you said. But with performance tuning on a large server, it's not ok.

> > > 3) It could support both i386 and x86-64. I tried pstack and it doesn't work > > with x86-64. > > Yeah, and you'll need to extend it to ARM/MIPS/etc.. It's a problem. We implement it on x86 firstly. If it's good, others would port it to other platforms.

> whereas there is > plenty of userspace around that can already work on all those platforms > -- if pstack cannot its weird, I'd think it would use all the regular > binutils muck that already supports all the platforms. Would you like to give me a pointer about the tools in binutils?

> > > 4) It follows /proc/XXX/stack interface and it's easy to use it. > > Uhm, not so very much, see your ASLR issue.

> Furthermore it requires all > userspace be build with framepointers enabled -- which I think would be > a good thing anyway -- but with which reality seems to disagree. You are right indeed. The tool is for debugger and developers.

In addition, I am thinking if we might extend the tool to dump user stack with blur (or not precise) data, which could just dump the stack from $esp. The final symbol dump looks like the symbol lines with ? in the output of dump_stack. For example, we define the shorted distance between calling in the stack, and check if the data in the stack maps in a real VMA. With the blur data, developers could get some good hint at least. As you know, sometimes, we couldn't get the source codes of some libraries and recompile them.

Yeah, but what is the above meant to achieve? it doesn't actually stop the task or anything, it will just trap the remote cpu, by the time you do your stack walk below the cpu might be running another task entirely or you're walking a life stack with all the 'fun' issues that'll bring.

On Tue, 2012-04-24 at 09:30 +0800, Yanmin Zhang wrote: > Would you like to point out the workable userspace stack walker? > If there is, we would check if we could reuse it. > > arch/x86/kernel/cpu/perf_event.c:perf_callchain_user(), it also deals with compat stuffs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo [at] vger More majordomo info at http://vger.kernel.org/majordomo-info.htmlPlease read the FAQ at http://www.tux.org/lkml/

On Tue, 2012-04-24 at 12:11 +0200, Peter Zijlstra wrote: > On Tue, 2012-04-24 at 09:30 +0800, Yanmin Zhang wrote: > > Would you like to point out the workable userspace stack walker? > > If there is, we would check if we could reuse it. > > > > > arch/x86/kernel/cpu/perf_event.c:perf_callchain_user(), it also deals > with compat stuffs. Yes, it does. But it just collects the user stack call chain of _current_ task.

When Xiaobing worked out the patch, we did think over if we could implement it based on perf. We also checked ftrace. Both ftrace and perf collect user stack of _current_ task.

Xiaobing wrote a similar tool based on ptrace one year ago and gave it up as it was slow.

I was thinking if we could use the powerful symbol parsing capability of perf to do the user space parse. I was busy and Xiaobing just changed his old codes to work out a prototype quickly as other developers pushed hard for the tool.

The IPI is to make sure the task could trap into kernel at least once, so we could get its regs->bp. If the task is running on another cpu for a long time, the regs->bp might be too old. I am also a little worried about that if the task restores to user space to run quickly after the IPI, regs->bp might be ruined. If it's true, we might get bad data, or couldn't get useful data.