Scaling a Zabbix Monitoring System to Accommodate Business Growth

The following is adapted from a presentation delivered by RingCentral operations lead Leo Yulenets at Zabbix Conference 2013 .

RingCentral provides a variety of cloud services for customers in the US, Canada and Europe. In order to handle tens or hundreds of thousands of calls and faxes daily, RingCentral requires a robust monitoring solution.

Currently, we use a system based on network management platform Zabbix. The challenge is to scale our Zabbix instance as the RingCentral cloud infrastructure continues to grow rapidly.

When the number of hosts is 1,000 and the number of items is not more than 100,000, the system may work fine with several proxies and a single Zabbix server (i.e., no data delays and no reporting gaps).

Having 5,000 hosts and about 1 million items, some performance tricks can still be used to make the Zabbix monitoring system deliver data without delays. But what if the system becomes 10 times bigger? How can data delays be avoided?

We may increase the number of Zabbix proxies, and even the number of Zabbix servers, switching to multi-node architecture, but delays would remain an issue.

Installing high-performance storage is prohibitively expensive. And reducing the number of items and triggers (or extending our monitoring intervals) to decrease the processing load is not a good solution. We would probably even need to turn off one part of the monitoring system, in order to make another part deliver the data on time.

We could split the system into 10 independent Zabbix systems. But we wouldn’t be happy with 10 different dashboards. In addition, history data would not be consolidated. Finally, what if we needed to install an 11th Zabbix system? How would we quickly port our monitoring configuration?

The architectural solution which is currently under development at RingCentral uses several Zabbix monitoring systems, consolidated on different levels, to meet our monitoring requirements. These include data delivery with no delays, calculation of items and triggers within a short period of time, keeping long period of history data, providing visibility on all levels. The following distributed architecture is proposed.

First of all, we should keep only one day of real data! This will not require separate network storage; real-time data can be stored on local drives and even in memory.

The most complex part of the system is the historical data storage. We propose using a non-SQL data warehouse based on MongoDB. Certainly we need to keep configuration separate from history. A MySQL-based consolidated database should be used for this purpose.

The next step is to consolidate alerts from all the systems on a single dashboard. Here we may use API and programming skills. All we need is to combine the triggers on a single page.

The next problem is passing the data to the history database and porting configuration. A simple binary log parser may be used for this purpose. It would involve two modules that take the data from the queue and write it into consolidated databases. A history inserter writes to MongoDB history database, while the configuration updater writes to the MySQL configuration database. Both happen in real-time, and delays are not an issue because the real-time information is always up-to-date.

The last custom change is to the Zabbix frontend, to make it work with MongoDB. MongoDB uses a technology called sharding. The picture below shows the 3 shards as an example, but we may use as many as required. For better database performance, we can also use parallel (mirror) sharding to separate reads and writes into different databases.

The advantages and disadvantages of this new approach are as follows:

Advantages

System scalability is improved with MongoDB

No data delays and no gaps, even with high load

Save a very long period of history data

Reads from and writes to history may be separated

Disadvantages

Templates synchronization is tricky

Custom consolidated dashboard for triggers

Zabbix frontend needs changes for viewing historical data

With this new architectural solution, we can extend the length of time we store historical data from one to 13 months. That is valuable for troubleshooting as well as analysis and reporting.