Monitoring your Alfresco solution

Hi folks, this post follows my previous post about capacity planning and it provides you with the tools (and a vmware image ready to run) for your to implement it.

Would like to start with a huge thank you message to Miguel Rodriguez (Alfresco Support Engineer ). He is the creator of this monitoring solution and also the person responsible for setting up the vmware image with all the tools, scripts, etc. My hero list just got bigger, Miguel got a place just after Spider Man and the Silver Surfer

Monitoring Alfresco with OpenSource tools

Monitoring your Alfresco architecture is a know best practice. It allows you to track and store all relevant system metrics and events that can help on:

Troubleshooting possible problems

Verify system Heath

Check user behavior

Build a robust historical data-warehouse to later analysis and capacity planning

This posts explains a typical monitoring scenario over an Alfresco deployment, using only opensource tools.

I’m proposing a fully opensource stack of monitoring tools that build the global monitoring solution. The solution will make use of the following opensource products.

The solution will be monitoring all layers of the application, producing valuable data on all critical aspects of the infrastructure. This will allow a pro-active system administration opposed to a reactive way of facing possible problems by predicting the problems before they happen and take the necessary measures to maintain a healthy system on all layers.

I see this approach as as both a monitoring and capacity planning system allowing to provide “near” real time information updates, customize reporting and to provide custom search mechanism over the collected data.

The diagram below shows how the different components of the solution integrate. Note that we centralize data from all nodes and the various layers of the application in a single location.

The sample architecture being monitored consists on a cluster of two Alfresco/Sharenodes for serving user requests and two Alfresco/Solr nodes for indexing/searching content.

The ElasticSearch server that will collect all the logs from the various components of the application and will host the graphical user interfaces (Kibana and Grafana) to view the monitoring data.

About JavaMelody

JavaMelody is used to monitor Java or Java EE application servers in QA and production environments. It is a tool to measure and calculate statistics on real operation of an application depending on the usage of the application by users. Very easy to integrate in most applications and is lightweight with mostly no impact to target systems.

This tool is mainly based on statistics of requests and on evolution charts, for that reason it’s one important add on to our benchmarking project, as it allow us to see in real time the evolution charts of the most important aspects of our application.

It includes summary charts showing the evolution over time of the following indicators:

Number of executions, mean execution times and percentage of errors of http requests, sql requests, jsp pages or methods of business façades (if EJB3, Spring or Guice)

Java memory

Java CPU

Number of user sessions

Number of jdbc connections

These charts can be viewed on the current day, week, month, year or custom period.

Its really easy to attach javamelody monitor to all alfresco applications (alfresco.war or share.war) and every other web-application that is deployed on your application server.

Step 1

Configure the JavaMelody monitorization on Alfresco tomcat by copying the itextpdf-5.5.2.jar, javamelody.jar and jrobin-1.5.9.1 to the tomcat shared libfolder under <tomcat_install_dir>\shared\lib or your application server (if not tomcat) global classloader location.

Step 2

Edit the global tomcat web.xml (D:\alfresco\tomcat\conf\web.xml) file to enable javamelody monitorization on every application. Add the following filter :

Stage 3 – Trending and Analysis (Kibana,Grafana)

To analyze the data and the trends we use install 2 different GUIs on the monitorization server (Kibana and Grafana).

Kibana allows us to check the indexed logs with metadata, and to troubleshoot on specific log traces. It provides a very robust search mechanism on top of the elasticsearch indexes. It provides strategic technical insights with an global overview on all layers of the platform delivering actionable insights in real-time from almost any type of structured and unstructured data source.

On the flow above we can see how the information and statistics get to Grafana.

Grafana is a beautiful dashboard for displaying various Graphite metrics through a web browser. It has enormous potential, it’s easy to setup and to customize for different business needs.

Let’s have a closer look on the remaining components on the flow diagram.

Statsd is a network daemon that listens for statistics, like counters and timers sent over UDP and sends them to Carbon.

Carbon accepts metrics over various protocols and caches them in RAM as they are received, flushing them to disk on an interval using the underlying whisper library.

Whisper provides fast, reliable storage of numeric data over time

Grafana is an easy to use and feature rich Graphite dashboard

Stage 4 – Monitoring

We use scheduled commands and index data on elastisearch checking the following monitoring information from the Alfresco and Solr servers.

JVM Memory Usage

Server Memory

Alfresco Cpu utilization

Overall Server Cpu utilization

Solr Indexing Information

Number of documents on Alfresco “live” store

Number of documents on Alfresco “archive” store

Number of concurrent users on Alfresco repository

Alfresco Database pool occupation

Number of active sessions on Alfresco Share

Number of active sessions on Alfresco Workdesk

Number of busy tomcat threads

Number of current tomcat threads

Number of maximum tomcat threads

Those can be extended at any time, performing monitorization on any target relevant on your use case.

Stage 5 – Troubleshooting

While troubleshooting we use Kibana/Grafana and JavaMelody.

Kibana allow’s us to check the “indexed” logs with meta-data and verify exactly what classes are related with the problem as well as the number of occurrences and root of the exceptions.

Grafana show us what/how/when the server resources are being affected by the problem.

JavaMelody provides detailed information on crucial sections of the application. The goal of JavaMelody is to monitor Java or Java EE application servers in QA and production environments.

Using these 3 tools, troubleshooting a possible problem becomes an friendly task and it boosts the speed of the investigations, that normally would take ages to gather all the necessary information to get to the root cause of the issue.

Stage 6 – Notification and Reporting

We use Icinga in order to notify the delegated alfresco administrator (email) when there is some problem with the Alfresco system. Icinga is an enterprise grade open source monitoring system that keeps watch over networks and resources, notifies the user of errors and recoveries and generates performance data for reporting.

Icinga Web is highly dynamic and laid out as a dashboard with tabs which allow the user to flip between different views that they need at any one time

Stage 7 – Sizing Adjustments

Sizing will be a human action on the capacity and monitoring solution. Performing a regular analysis to the monitorization/capacity planning data, we will know exactly when and how we need to scale our architecture.

The more data gets inside elastic search along the application life cycle, the more accurate are the capacity predictions because they represent the “real” application usage during the defined period.

This represents a very important role when modeling and sizing the architecture for the future business requirements.

7.1 – Peak Period Methodology

The Peak period Methodology is the most efficient way to implement a capacity planning strategy as it allows to analyze vital performance information when the system is under more load/stress. On its genesis the peak period methodology collects and analyzes data during a configurable peak period. This allows the application to estimate the number of CPU’s, Memory and cluster nodes on different layers of the application required to support a given expected load.

The peak period may be an hour, a day, 15 minutes or any other period that is used to analyze the collected utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.

Your monitoring Targets on a Alfresco installation

I’ve identified the following targets to be candidates to participate on the Monitoring system and have their data indexed and stored on elastic search.

The Alfresco Audit Trail

The monitoring solution also uses and indexes the Alfresco audit trail log, when audit is enabled. Alfresco audit should be used with caution as auditing too many events may have a negative impact on performance.

Alfresco has the option of enabling and configuring an audit trail log. It stores specific user actions (configurable) on a dedicated log file (audit trail).

Building on the auditing architecture the data producer org.alfresco.repo.audit.access.AccessAuditor gathers together lower events into user recognizable events. For example the download or preview of content are recorded as a single read. Similarly the upload of a new version of a document is recorded as a single create version. By contrast the AuditMethodInterceptor data producer typically would record multiple events.

A default audit configuration file located at <alfresco.war>/WEB-INF/classes/alfresco/audit/alfresco-audit-access.xml is provided that persists audit data for general use. This may be enhanced to extract additional data of interest to specific installations. For ease of use, login success, login failure and logout events are also persisted by the default configuration.

Default audit filter settings are also provided for the AccessAuditor data producer, so that internal events are not reported. These settings may be customized (by setting global properties) to include or exclude auditing of specific areas of the repository, users or some other value included in the audit data created by AccessAuditor.

No additional functionality is provided for the retrieve of persisted audit data, as all data is stored in the standard way, so is accessible via the AuditService search, audit web scripts, database queries and Alfresco Explorer show_audit.ftl preview.