Building the right thing, building it right, fast

Who I am

My name is Jakub Holy and I’m a software craftsmanship enthusiast and a (mainly JVM-based) developer since ~ 2005, consultant, and occasionally a project manager, working currently with TeliaSonera AS in Norway. More under About and #opinion.

Archive for October, 2008

A word of warning: Information on this page originates from my exploration of Glassbox performed in Oct 2008 and may be inaccurate. Ron Bodkin, the mastermind behind Glassbox, was so kind as to review this but still there may be some mistakes or inexact informatiom left. In any case blame me🙂

Introduction

There’re a few open source java webapp monitoring tools: Glassbox (latest release 2008-09-02), JAMon (2007-09-20), InfraRED (2006-05-17), UseMon (2008-10-06). Among these, Glassbox is both still actively developed and mature. It has also nice though only basic user documentation. Unfortunately there is only little docs about its architecture and about customizing it for monitoring a specific webapp. Therefore I’ve decided to dive into its code to learn more about it to be able to decide whether it’s suitable for our needs of monitoring the Pentaho BI Platform web application.

The unique features of Glassbox are that it tries to point out existing performance problems together with their likely causes and advices what to do about them and that it displays aggregated statistics about the high level "operations" in the monitored web app.

Basic Info about Glassbox

Glassbox consists of a webapp UI that displays statistics and of a set of monitors that are injected into the observed webapp and collect data about its performance. Some of its characteristics are:

Preferably Java 5+ though it’s also possible to use AspectJ with Java 1.4 and maybe 1.3.

Data not persistent (though this is on the roadmap).

Layer aware (UI, logic – EJB, resources – DB, I/O, …).

Customized monitors for popular frameworks (Spring MVC, …).

UI concentrates on providing the right information at the right time with the necessary context. In other words it doesn’t display tons of detailed data but a list of top-level operations highlighting the problematic (slow/failing) ones and providing detailes on likely cause(s) of the problems.

If you are new to Glassbox I’d recommend you to have first a look at the presentation Glassbox Architecture & Design (Oct 25, 2008). You may also want to watch an older but still usefull Glassbox Tech Talk video (Sep 25, 2006). And if you’re the visual type there’s a screenshot of Glassbox UI.

How it works

Glossary

Aspect (wiki): There’re many definitions, for us this is a piece of code injected into the monitored application to perform the actual monitoring of a particular method in a particular class.

Components and resources: Glassbox breaks time spent in an operation down by the type of activity that consumed it distinguishing between ‘components’ such as database access or remote calls and ‘resources’ such as running java code, running native code, or thread waiting (usually I/O).

Layer: A monitored method can belong to one of application’s layers of processing such as ui.controller or resource.database.statement.

Operation: An "entry point" of user requests into the application. Usually this is some high-level method such as a Struts Action’s execute or JSP’s service. Glassbox uses
operations as the unit of monitoring service level agreements (SLA) and to analyze where time goes. Anything
that’s used within an operation is considered a resource (EJB call, DB query…). (Though the notion of components shall be added in the future.)

In the source code, the classes having Operation in their name are often also used to track monitored methods.

Request: a) Vaguely synonymous to a monitored method; b) In this text it often means a user-invoked request that resulted in some activity in the monitored web application (though I try to use the term user request to distinguish it from a) ).

Basic concepts

Glassbox doesn’t bother to monitor all methods that are invoked as a result of user request because it is not necessary and would incur higher overhead. It only monitors some especially significant methods of the call tree such as a Struts Action invocation or invocation of a data access method (being it simple JDBC or Hibernate). I call them "monitored methods", while in Glassbox’ terms they are either operations (the top-level ones) or resource/component request.

The framework is aware of and takes care of tracking the hierarchy of monitored methods invoked during a single user request processing. This basically means that a monitored method may have a parent monitored method and that Glassbox stores both the aggregated duration of the higher level method and that of the nested one. For example, a JDBC executeQuery may be invoked from an EJB call, which is itself invoked from a Struts Action processing. Thus it can tell which processing layer or component/resource caused a particular top-level operation to be slow.

Glassbox is also aware of layers and monitored methods may be marked as belonging to a particular layer, e.g. the JDBC executeQuery would belong to the layer resource.database.statement.

Agent(s):
Handle communication between the monitors and a client. In a
clustered environment, a separate instance is on each server.

Client: Collects monitoring data from agents and displays
an aggregated view to a user. The most used one is the Glassbox web application but you can also use e.g. jconsole or implement another client. Usually you have only one client even in a clustered environment.

Monitors

Monitors are mostly aspects injected into a monitored web application or the related code. Usually a monitor observes one or few methods in a single class. When one of the method is invoked, it creates a Response object storing the details of the invocation (its id, parameters, …) and, once it finishes, its performance data. If the method has been invoked as a part of processing of another monitored method, its Response will have the Response object of that method as its parent, in other words monitored methods can be (even indirectly) nested and this structure is preserved in the monitoring data structures. A typical example is a servlet calling an EJB doing some JDBC stuff.

Responses are stored in a ThreadLocal stack, which makes it possible to really connect those Responses belonging to the processing of a single request, and once the top-level operation completes it can be considered as finished.

Agent

There is one agent per server (i.e. JVM). It collects the detailed monitoring data from monitors and aggregates them by monitored method into "statistics" (Stats) objects. Stats contain summarized data for a monitored method over all of its execution (avg, min, max time…) together with detailed data for a limited number of those executions that were either slow or failed. There are statistics both for top-level monitored methods and for nested ones and they’re aware of each other. These stats are stored in a global StatisticsRegistry (namely OperationTracker’s registry).

There is also a special background job (ThreadMonitor) that monitors threads to detect CPU-intensive methods. This is done by taking snapshots of the thread stack in regular intervals and – if its execution time exceeds a predefined limit – by drawing a conclusion from these snapshots. (Note: This is an example of a timer task monitor.)

Web UI

The Glassbox web user interface queries all registered agents for their data (see OperationsHelper.updateListOfOperations). The detailed and hierarchical statistics are turned into an OperationSummary for the top-level operation having also a list of "findings", i.e. textual descriptions of problems with the operation. It contains average statistics and stats when slow and when failing.

Main components’ behavior in detail

Monitors

To monitor a web application, Glassbox inserts
monitoring aspects into its code and related server code (e.g. its
javax.servlet.Servlet implementation) during server startup using
AspectJ. In its simplest form an aspect is nothing more than an XML
file mapping an existing aspect class to a method(s) of a particular
class.

The sequence of actions triggered by a user request towards a monitored webapp is:

A monitored method in the target web application is going to be invoked.

An
aspect associated with that method is invoked. It’s connected to the
monitoring framework by extending a base monitor aspect class and
optionally by calling some of the framework’s methods.

The
framework stores data about the monitored method/operation’s name, layer and
component/resource and its start time. It also checks the thread local
stack of responses to find out whether it’s invoked as a
part of processing some higher level monitored method and if it is the case
then it sets it as its parent.

The monitored method is
invoked and when it finishes, the framework stores its end time. If it
finished with an exception then it’s first checked whether the
exception should be indeed regarded as a failure and if yes, the
exception data is stored as well. If the operation was slow or failed, the framework puts it on the list of slowest/recently failed operations and captures its parameters.

When the original user
request is processed completely, the framework aggregates data about
its performance. This happens in StatsSummarizer, a ResponseListener. If the request processing was too slow with respect to predefined limits
(SLA` usually <1s in 90% of time) then it’s stored into the slow operations list and the
layer/component that caused the delay is marked.

Note: Monitors interact with a ResponseFactory to create and finish Responses and these actions trigger ResponseListener events. The most important ResponseListener is the StatsSummarizer but there’re also others, for instance a listener that can log slow responses and a new one that captures a trace for a specific request. Since response listeners are invoked directly by the response factory, they execute in the same thread as the monitored method itself and therefore can safely use ThreadLocal variables to keep info of requests/responses belonging to a single user interaction.

Agent

As said above, performance statistics are aggregated by monitored methods and are collected into a global statistics registry. The registry contains instances of PerfStats or CompositePerfStats, which also implements StatisticsRegistry to hold nested statistics, or a subclass such as OperationPerfStats. A registry is actualy a map of OperationDescriptions to their OperationPerfStats and can return a subset for a particular StatisticsType, for example UI, Database, Remote Call.

Let’s see a partial example produced by OperationTracker.registry.dump(new StringBuffer(), 0):

*Stats also hold all other necessary information, for instance about slow/failing cases.

OperationTracker also uses an instance of OperationAnalyzer, which is responsible for preparing data for all the nice output you can see in the UI. This includes summarizing the stats into OperationSummaries and detecting (based on the collected stats) what is the cause of a slow/failing operation and providing this info in the form of OperationAnalysis.

Web UI

The web UI, aside of handling installation of Glassbox monitoring into a server, maintains connections to all the agents (or to the single local agent in a non-clustered environment) and retrieves all the needed summaries and analysis from them via its OperationHelper.

Currently the web UI only provides statistics about a top-level operation and its list of detected problems (slow SQL, excessive CPU in method XY, …) with troubleshooting details but you cannot use it to view statistics for its nested monitored methods/components. However you can access those detailed statistics e.g. via the JMX interface.

Additional notes

You can view results and manage Glassbox using JMX: jconsole
service:jmx:rmi:///jndi/rmi://localhost:7232/GlassboxTroubleshooter . Check $jboss/lib/glassbox/glassbox.properties and glassbox.war/WEB-INF/lib/agent.jar/beans.xml for settings.

You can disable/enable some monitors at runtime via JMX, for example RemoteCallMonitor or JdbcMonitor. I’m not sure whether there is a way to dis/enable on a more granular level.

Glassbox API – Main classes

Here we will learn about the most important Glassbox classes, what they can do for you, and how they relate to each other. Classes without an extension are regular java classes while those with .aj are AspectJ classes and need to be compiled by the aspectj compiler.

Response API

This API is used by the monitoring aspects to produce the monitoring data that is than further analyzed and presented to the user by Glassbox.

glassbox.response.Response

Collects data about the system’s response while processing a request. These are typically nested, i.e., we track times, parameters, etc. for Servlet requests that result in Struts action requests that result in a database query. A response belongs to a particular layer and may have a parent response (when nested). It has also a duration and a status (processing/suceeded/failed/…). Actually it can hold any context, so a monitor can store whatever relevant data is
needed (e.g., this can be useful for a custom metric that your
application wants to track).

This is a helper class for manipulation requests including their creation while taking care about their proper nesting and setting their start/end times. It uses a thread local stack to keep track of nested requests and System.currentTimeMillis() for timing.

As noticed elsewhere, its used by monitors to create/finish Responses and produces the appropriate events for that and also manages a list of ResponseListeners.

Monitor API

The monitoring aspects extend this API and it also includes many specific monitoring aspects such as EjbCallMonitor.aj and StrutsRequestMonitor.aj.

glassbox.monitor.OperationFactory – create OperationDescription(Impl) from JSP path or from a class name – see e.g. MvcFrameworkMonitor.aj

Ron’s note: One reason for having AbstractMonitorClass is to allow using Java-5
annotation-based aspects with the Glassbox framework, either for
AspectJ extensions written in that style or for Spring annotation-based
aspects.

Tracking API

An addition to the Response API to keep track of requests (monitored methods) etc.

glassbox.track.api.Request – represent a specific instance of a request to something, i.e. an invocation of a monitored method. They can be compared based on elapsed time.

glassbox.track.api.FailureDetectionStrategy – shall an exception thrown by a monitored method be regarded as its failure or not?

glassbox.track.api. Call/Failure/Operation/SQLFailure Description – OperationDescription has a type (e.g. "HttpServlet"), a name (e.g. the servlet’s name), context (e.g. the web app’s context root) and perhaps a parent OperationDescription if nested.

glassbox.track.api.SlowRequestDescriptor – describes a request whose processing was too slow; it has among others the attributes StackTraceElement slowestTraceElement and mean/slow/total counts.

Agent API

Collect data from monitors, summarize it, analyze problems, and provide the outputs to the Web UI.

glassbox.client.persistence.jdbc.BackupDaemon – stores agent connections (but not any monitored data) into a database (by default an embedded hsqldb – see the myDataSource below).

glassbox.monitor.thread.ThreadMonitor15Impl: This monitor periodically grabs thread dumps for all threads that are processing user requests. It runs in a daemon thread collecting the dumps in preset intervals. Creates instances of glassbox.monitor.thread. OperationSample when sampling a monitored thread. When a monitored thread finishes, it results perhaps in a call to ThreadSummarizer.summarize, which updates the assoc. CompositePerfStats.

glassbox.monitor.thread.ThreadMonitorIntegration.aj: starts a ThreadMonitor after StatsSummarizer.startTopLevelStats|startNestedLaterOperation

uses glassbox.track.api.StatisticsRegistry stored in a thread local variable of the type ThreadStats together with first/last operation key (OperationPerfStats)

glassbox.track.OperationTracker (singleton): used by GlassboxServiceImpl to analyze/list/… operations. Holds a global StatisticsRegistry registry and an OperationAnalyzer.

glassbox.analysis.OperationAnalyzer: Analyses the collected statisticts to detect problems and their causes. It uses (Default)TimeDecomposition.It also makes OperationSummaries from the stats further used by the UI.

Note: in agent.jar the Spring config file beans.xml defines a bean operationTracker of the type glassbox.track.OperationTrackerImpl (implements OperationTracker, StatisticsRegistry) – this is used to collect all the stats in the monitored app. It’s used by the bean glassboxService (glassbox.agent.control.GlassboxServiceImp).

glassbox.agent.control.GlassboxServiceImpl (singleton): its listOperations() (delegation to OperationTrackerImpl.listOperations()) is invoked using a local/remote call from the glassbox UI webapp (OperationHelper) to collect operations (stats) from the given server and it also provides problem analysis to the UI in a similar manner.. I suppose that there is only a single instance of this class in a JVM.

UI Web App’s API

The Spring configuration file glassbox.war/WEB-INF/applicationContext.xml defines among others the following beans:

a backupDaemon – regarding its function see glassbox.client.persistence.jdbc.BackupDaemon above. See also glassbox.client.persistence.jdbc.PersistChanges. It’s schedule is def. ibidem by scheduledTask with the default period of 10000 and it uses indirectly myDataSource defined there as well.

Note: To override the data source used to store client configuration
including remote connections to open, you can define the System
property glassbox.config.ds (this is configured in
the applicationContext.xml).

glassbox.client.helper.OperationHelper – updates statistics to display by calling listOperations on the remote agents.

glassbox.client.pojo.OperationData – monit. data collected and used by the UI It only adds source and agent identification to its nested OperationSummary, holding the actual statistic (operation count, is failing, is slow, avg time; nested OperationDescription that might have a parent too).

Customizing Glassbox for your webapp

There’re two ways of customizing Glassbox for monitoring of a particular webapp:

Adding monitors

Glassbox plugins mechanism for advanced customization.

1. Adding monitors

You can apply an existing monitoring aspect to a new method using AspectJ weaving rules described in an aop.xml file thus turning the method into a monitored method or you can even create a new monitoring aspect extending a base Glassbox aspect, putting it – perhaps together with an aop.xml – on the monited webapp’s classpath.

Note: If you only want to monitor another method invoked during processing of something already regarded as an operation you should rather extend the glassbox.monitor.MethodMonitor.

You can then add this file to a META-INF subdirectory of a
directory on your classpath or add a jar containing the file at the
location META-INF/aop.xml. For Tomcat, you might just create the
directory common/classes/META-INF and install your custom aop.xml
file there.

Implementing a new monitor

Warning: This is only my idea what is necessary to be done and may contain mistakes and false ideas.

To implement a new monitoring aspect you should do at least the following:

Extend glassbox.monitor.AbstractMonitor

Define the pointcuts (i.e. to what methods to apply this monitor)

Redefine the abstract pointcut monitorEnd() to apply to the monitored method(s) so that the parent class detects when it finishes.

Either redefine tha abstract pointcut monitorBegin(Object identifier) to allow the parent class to automatically register the beginning of the monitored operation or define your own advice (method) that is run when a custom pointcut is encountered; usually this is a before advice. The identifier should be e.g. an OperationDescription. Inside a custom advice:

(Re)implement some methods such as getLayer().

Create an OperationDescription for the operation, likely using the inherited operationFactory.

Call one of the inherited begin(..) methods (see AbstractMonitorClass.aj), passing the OperationDescription as the 1st argument, i.e. as a key. This will return a glassbox.response.Response object.

Note: If you have it override getLayer() to return Response.RESOURCE_SERVICE and the monitored method is slow then Glassbox will report it as a slow remote call.

2. Glassbox plugins mechanism

It’s possible to extend Glassbox with application-specific extensions using the API glassbox.config.extension.api. Simply adding monitors doesn’t need it: you can just deploy a monitor jar with aspects to the classpath, and an app can simply call on the response API. The PluginRegistry supports deeper extensions, like adding custom operations (see interface glassbox.config.extension.api.OperationPlugin).

Ron explains (2008-10-25): Glassbox lets you customize a variety of facets using plugins.
Operation plugins let you add an operation type that can extend how
Glassbox summarizes and analyzes the operation (to detect service level
violations) and how the UI renders these. Glassbox plugins also let you
define a connection provider so you can write custom code to define
what connections in a cluster/server farm should be opened (allowing
discovery instead of manual configuration). Most recently I’ve also
added a runtime controller that lets you change behavior at runtime
(e.g., requesting that a request on a specific thread be monitored).

Operation plugins really shine if you need to add in different service levels, or
want a custom display to summarize problems (e.g., Ron used this to
detect an out of date cache for one custom project).

Hopefully some documentation for creating plugins will be created soon.

3. Restricting what to monitor

Currently Glassbox monitors any web application deployed on the app. server including itself and also some common code like JDBC drivers (which may be invoked not only from a web application but also by the server’s daemons etc.). If you don’t want to monitor all of that, for instance to decrease the overhead and to make the outputs easier to read, there are few things you can do.

Glassbox yet doesn’t support any filtering but you can:

You can avoid instrumentation of specific
applications on the server by deploying a META-INF/aop.xml file in
their classpath that disables weaving into any classes (although that
would still track calls to common classes like JDBC drivers and app
server servlets).Example (MyApp.war/WEB-INF/classes|MyApp.ear)/META-INF/aop.xml:

<aspectj>
<weaver>
<exclude within="*"/>
</weaver>
</aspectj>

Instead of load-time weaving (LTW) you can perform offline weaving of Glassbox aspects only into the code of the web applications that interest you and perhaps into some common server code that you also need to be monitored, for instance a JDBC driver. An additional benefit is that you’ll get rid of the longer class loading at server startup.

A word about memory overhead

According to the aforementioned presentation of Glassbox design & architecture it can consume roughly 20% more memory than without Glassbox. Ron further explains:

20% is a rough guideline – it varies quite a bit in specific cases. The
biggest area where Glassbox adds overhead is indeed from AspectJ
load-time weaving, most specifically the memory overhead from handling
JSP’s – the load-time weaving system uses a little more than 1 megabyte
of memory for each loader, and each JSP gets its own loader. The
AspectJ project has been working on this area – see Reducing weaver memory usage over time.
We need to merge in the updated work on AspectJ to the version of
AspectJ we’re using with Glassbox, which also reduces memory overhead a
lot. You could try the patch in that bug report with AspectJ 1.6.1’s
aspectjweaver.jar to see if it has better memory performance.

Another approach that might be simpler and yet could help a lot is a
hybrid one: if you just precompile your application’s JSPs you will see
far lower overhead.

Glassbox does work best if it is weaving some server classes, but in
many cases you can get the desired visibility if you do offline weaving
of your application and a few key libraries, like your JDBC driver and
e.g., web services callers. If you want to try that, I’d be glad to
help.

There is one other area where Glassbox can consume significant memory:
it records statistics based on the structure of components and
resources and how calls are nested. For a fairly static application
this is normally constrained and quite small, but some applications
generate names/queries/etc. dynamically and Glassbox can build
increasingly large trees of statistics, which consume memory also. We
definitely want to address this area – I’m leaning towards not
recording details for quick operations, and only recording information
for things that run often and are taking noticeable time. We’d like to
know about cases where this happens so we can test better approaches.

–

Update 2009-01-21

I’ve implemented persistence for the detailed monitoring data of Glassbox. Unfotunately Glassbox generates too many uninteresting entries and for the methods of interest it doesn’t provide enough data, you can read more about this in the glassbox forums linked to above. I had unfortunately no time to try to deal with these problems.

Have you ever been strucked by a completely useless exception message somewhere from the depth of a 3rd party application or library you had to use in your code? Have you ever wanted the bloody nameless programmer to have written a truly informative and helpful error message so that you wouldn’t need to spend hours trying to figure out what was the problem that would have been easily discovered if only more context information available at the moment when the exception occured was included in its error message? Have you wondered how only you could inject the necessary logging into the spot? Read on to get the answer.

Recently I was testing my extension to Saba, a monstrous J2EE learning management system, and got an exception from a Saba class saying that the expected and actual data types of a custom attribute don’t match. The problem was that I had no idea which one of the 10s of custom attributes could be the cause and I even wasn’t sure which object’s attributes I should check. It would be so much easier if the “nameless bloody Saba programmer” (no offense :-)) included the attribute’s name and preferably also its actual & expected data types and the actual and new values. How could I insert there logging of these properties? Needless to say that I had to use Java 1.4 (no agents…) and couldn’t afford more than modifying this single class file (i.e. no additional libraries etc.) because the changes I could do to the development environment where the application ran were limited.

Of course the easiest would have been to decomile the class, add the loging before the exception is thrown, recompile it and replace it on the server. But not only is decompiling illegal here, it also sometimes simply doesn’t work. Fortunatelly there is another solution – JBoss Javassist is a byte code manipulation library that supports not only runtime manipulation but also post-comilation time manipulation, i.e. you can modify and save the class and use it to replace the original file. There are quite a few byte code manipulation libraries but Javassist has the great advantage that you don’t need to know much about bytecode, you can simply pass a String with java statements that should be inserted before/after/… method call or into a new catch statement. There is a nice tutorial that describes it (see part 4.1, Inserting source text at the beginning/end of a method body):

addCatch() inserts a code fragment into a method body so that the code fragment is executed when the method body
throws an exception and the control returns to the caller. In the source text representing the inserted code fragment,
the exception value is referred to with the special variable $e.
For example, this program:
ClassPool pool = ClassPool.getDefault();
CtClass compiledClass = pool.get("mypackage.MyClass");
CtMethod m = compiledClass.getDeclaredMethod("myExceptionThrowingMethod");
CtClass etype = ClassPool.getDefault().get("java.io.IOException");
m.addCatch("{ System.out.println($e); throw $e; }", etype);
translates the method body represented by m into something like this:
try {
the original method body
}
catch (java.io.IOException e) {
System.out.println(e);
throw e;
}
Note that the inserted code fragment must end with a throw or return statement.

You can use $e to access the exception, $0 to access “this”, $1 to access the 1st parameter of the method etc. At the end you just call compiledClass.writeFile(); and use the modified mypackage/MyClass.class to replace the original class in the application.