Thursday, 5 September 2013

health-check: a tool to diagnose resource usage

Recently I have been focused on ways to reduce power consumption on Ubuntu phones.
To identify resource hungry issues one can use tools such as top, strace and gdb, or inspect per process properties in /proc/$pid, however this can be time consuming and not practical with many processes in a system. Instead, I decided it would be profitable to write a tool to inspect and monitor a running program and report on areas where it appeared to be sub-optimal. And so health-check was born.

One provides health-check with a list of one or more processes and it will monitor all the associated threads (and child processes) and then report back on the resources used. Heath-check will report on:

CPU utilisation

Wakeup events

Context Switches

File I/O operations (Open/Read/Write/Close using fnotify)

System calls (using ptrace)

Analysis of polling system calls

Memory utilisation (including memory growth)

Network connections

Wakelock activity

CPU utilisation, wakeup events and context switches are useful just to see how much CPU related activity is occurring. For example, a program that has multiple threads that frequently ping-pongs between them will have a high context switch rate, which may indicate that it is busy passing data or messages between threads. A process may be frequently polling on short timer delays may show up as generating a high level of wakeup events.

Some applications may be sub-optimally writing out data frequently, causing dirty pages and meta data that needs to be written back to the file system. Health-check will capture file I/O activity and report on the names of the files being opened, read, written and closed.

To help identify excessive or heavy system call usage, health-check uses ptrace to trap and monitor all the system calls that the program makes. For example, it has been observed that some applications excessively call poll() and nanosleep() with poorly chosen timeouts causing excessive CPU utilisation. For system calls such as these where they can wait until an event or a timeout occur, health-check has some deeper monitoring. It inspects the given timeout delay and checks to see if the call timed out, for example, health-check can identify CPU sucking repeated polling where zero timeouts are being used or excessive nanosleeps with zero or negative delays.

The ptrace ability of heath-check also allow it to monitor per-process wake lock writes. Abuse of wakelocks can keep the a kernel from suspending into deep sleep so it is useful to keep track of wakelock activity on some processes. This is not enabled by default as it is an expensive operation to monitor this via ptrace and also some kernels may not have wakelocks, so one has to use the -W option to enable this.

Health-check also inspects /proc/$pid/smaps and will determine if memory utilisation has grown or shrunk. Unusually high heap growth over time may indicate that an application has a memory leak.

Finally, health-check will inspect /proc/$pid/fd and from this determine any open sockets and then try and resolve the host names of the IP addresses. For example, it is entirely possible for an application to be making spurious or unwanted connections to various machines, so it is helpful to check up on this kind of activity.

Health-check is still very early alpha quality, so beware of possible bugs. However, it has been helpful in identifying some misbehaving applications, so it is already proving to be rather useful.