Introduction

Tracing and debugging large systems, well, even smaller systems is often a tricky and challenging business. This howto introduces Onviso, a tool to trace systems distributed across several nodes and analyze the data collected. The Onviso tool is a part of the ProTest Research project.

Note: Onviso is still very much a work in-progress, i.e things may change!

Feedback on the tool, the user-guide, anything really, is much appreciated. Just drop me an e-mail with your thoughts <marcus dot ljungblad at erlang-consulting.com>.

Overview

The Onviso application wraps the OTP tool Inviso in an easy to use API. It is designed offer adequate defaults while still retaining the advantages that
Inviso offers while tracing in large system environments.

Using only two functions it is possible to set up tracing across multiple nodes and merge these in any possible way. Additionally, for convenience when using Onviso as an ad-hoc tracing tool, it is also possible to retrieve the status of the recent traces run and the configuration that was used.

For merging the traces there are a number of defaults provided, the simplest of them simply writing every trace to a file in the order as they are generated. The merge functionality can also be used to conduct property checking or profiling the system that was traced. The possibilities are endless and only restricted to the three funs that are passed to the merge function as you will see soon.

Background

Onviso merely wraps the functionality of Inviso. Inviso is part of Erlang/OTP, at the time of writing in version 0.6, and is still maturing. Onviso in itself doesn't add any new functionality to Inviso (apart from some information on the traces set). In essence, Onviso is an attempt to make Inviso easier to use.

Comparison to dbg

Onviso's strengths comes with the advanced possibilities of merging the traces collected as well as easily setting up tracing across multiple nodes. Onviso can be used in a similar fashion as dbg, however, it is not intended to be a replacement.

Setting up a trace

To give some meaning to this guide we'll provide a system that we can trace on. In the following examples we'll play with two nodes: 'server@linux' and 'client@linux'. Our test node will be called 'inviso@linux'.

If you have used dbg before this will look very familiar. To set up a trace in Onviso you must specify what to trace on, this is done by specifying modules, functions and possibly arguments passed to the functions. Onviso uses the following format to specify a so called pattern:

{module, function, arguments, matchspecification}

Say for example that we want to trace the 'put' function in the client module the pattern would look like:

{client, put, '_', []}

This pattern would match any calls made to the client:put/1 function. Furthermore, if we would like to match only when the function is passed the atom none as an argument then we can write the following:

{client, put, [none], []}

Wildcards can also be used to match all functions of a module:

{client, '_', '_', []}

Note however that

{client, '_', [none], []}

is illegal.

Onviso exports one function to set-up a trace, not surprisingly called trace.

trace(Patterns, Flags) % only traces on the local node
trace(Patterns, Nodes, Flags) % no overload protection is used
trace(Patterns, Nodes, Flags, OverloadProtection)

I wont go into details about match specifications, if you want you can read about them here instead. However, Onviso has a nice trick up its sleeve (which is honestly inspired by Mats Cronqvist's work on redbug), namely match specification shortcuts. Instead of typing:

{client, put, '_', [{'_',[],[{return_trace}]}]}

you simply type:

{client, put, '_', return}

and you get the same result; a match specification which will give you the return information from the calls which are matched by the pattern. One can also use caller to get the calling function.

Onviso supports all flags that the erlang:trace bif supports, for a full list I recommend you to have a look at the manual. There is one difference though, Onviso will always add the timestamp flag to the list of flags so that it is possible to merge the traces in chronological order.

Tracing example

So, lets set up a trace on the loop function of the server module and a trace on the client functions that access the server node. Remember we have to specify patterns, nodes and flags.

If Onviso could connect to the nodes it will return with ok and a reference number. We will see the use of this shortly. Make sure that you are using the same magic cookie on all the nodes you want to connect to.

erlang:set_cookie(node(), inviso).

Note that the trace is activated once the function is executed. This means it will now start collecting traces on the respective nodes, eventually to be written to file.

Collecting the results

Once a trace is started it can be stopped in one of two ways: using onviso:stop/1 or onviso:merge/4.

2> onviso:stop(1). % note that the 1 is the reference number returned by the trace function.

Will cause the activated nodes to stop tracing on all patterns, write the traces collected to files on each respective node and finally transfer these files (which are stored in a binary format) using distributed Erlang to the Onviso control node. Depending on the amount of trace data collected this may take some time.

However, the binary files by themselves are not much use to us. The fun only starts once we analyze the traces collected. This is done with the merge/4-5 function. The most simplest of merges is:

3> onviso:merge(1, void, void, shell).

Which will cause all traces collected to be printed to the Erlang shell. Arguably, this is not very useful if we've collected a lot of traces. Instead we can:

4> onviso:merge(1, void, void, file).

print all traces to a file. It will be called outputREF.txt and stored in your default path (the root of your Erlang node).

Imagine you for some reason want to count all the traces collected (this is actually done for you when merging, but let's play with the idea) then we can write our own funs to do this. Let's have a look at the following example:

Note: We can merge the same trace case over and over again through the initial reference. A new reference is generated for every trace.

Protecting live systems

The architecture of Inviso has already some protection for live systems built-in, but there's actually more to it. You can also specify own functions to monitor the load of the node. This section will cover an introduction to the architecture used by Inviso and we'll also set-up a trace using the so called overload protection.

Architecture

Inviso consists of two parts. One is called Inviso and exists in the Erlang/OTP library as an application called just Inviso. The other half of Inviso, and equally important, is part of the runtime_tools application. Thus, the requirements to use Onviso is that runtime_tools are installed on the nodes you want to trace, and that the controlling node also has Inviso installed.

For every node you trace inviso_rt activates the trace patterns using the erlang:trace BIF and collects the traces in RAM. When the control node instructs all traced nodes to stop tracing the RAM is flushed to a file (in fact this uses the same file driver as dbg). To transfer the binary files from each of the node a process is spawned to send this file using distributed Erlang. Moreover, this process runs with a low priority so as to minimize the system impact as much as possible. By relaying the raw traces to the control node, which doesn't have to be a part of the system under trace, we essentially remove all set-up and performance load on the system nodes. In fact, as long as the magic cookie is the same and runtime_tools is running, you don't need to change anything on the system.

Inevitably, large numbers of patterns will generate a lot of traces. Equally, if there's heavy usage of the node, a lot of traces will be generated and files grow quickly. For the time being, Onviso does not support wrapping of log files. Just beware.

Overload protection

There are however some precautions we can take before running on production systems. We can specify own functions that measure the load of the system and have Onviso regularly use this to ensure everything still runs smoothly. In case a threshold is reached, the tracing will be suspended automatically.

Since Onviso has no idea about the system it is tracing and absolutely not even the faintest clue when it is being close overloaded this must be handled outside Onviso. Below is a rudimentary example module which checks the amount of free RAM of a Linux system.

check/1 actually accepts the return value of the start/0 function, and is passed this value every time the check function executes. In addition, start can also be passed any number of arguments. How to pass arguments to the start function is defined below. It is important, however, that the check function returns ok when everything is alright, and {suspend, Reason} when it is not.

Here's how we integrate the overload_handler describe above in our trace:

As you will see, start and stop are both followed by empty lists, which can be used to pass any number of arguments to the functions respectively. The interval with which Onviso will run the check function is defined to 15000 ms. Also note that the overload protection will apply to all nodes.

It is very likely you have a better way of checking for potential overload which is way more specific to your system. This is simply an example.

More merging

As mentioned in the introduction there are endless ways one can merge and analyze the traces collected. To use a well known cliché: only your imagination sets the limit.

In this section we'll look at two examples to illustrate some of the possibilities and also take a look at some traces in more detail. Please feel free to add more examples to this section.

In this example all the traces are stored in a record which is first created by BeginFun. There is a fun head matching each trace type, either call or return_from. It is in this way we can selectively choose what to do with each trace. Finally, the results are written to file.

Do you have ideas of what to do with the traces collected? Feel free to try them out and add more examples to this list.

Summary

As you probably realize Onviso can be used for so much more than simple debugging. It is a competent tool for easily setting up tracing across multiple nodes and offers plenty of possibilities for analyzing the data collected.

I welcome any suggestions for improvement, ideas (wild or not), and general feedback on the API of Onviso. Just drop me line!