Putting the RUM in API Monitoring

Last week I was very fortunate to be part of a webinar on API monitoring together with Alan Ho from Apigee. The plan was to have a face-off between our respective monitoring technologies – Synthetic Transaction Monitoring from SmartBear and Real User Monitoring from Apigee. But what was supposed to turn into a heated discussion with swearing in our respective native languages (Chinese and Swedish) turned into a somewhat courteous acknowledgement of our respective merits, thanks to our somewhat civil personalities (or was it the California heat?). In the end, we both agreed that not only are STM and RUM complementary offerings, but the combination of the two even opened the door for some additional analytical bells and whistles.

Let’s back up a bit here to give you the whole picture. First of all, it’s pretty obvious that bad app performance will quickly ruin an app’s reputation – bad reviews, Twitter storms, malignant hashtags – nobody wants that kind of publicity. With the increasing use and deployment of back-end APIs to fuel all the apps out there, often the reason your app isn’t giving your users what they expect can be traced to those APIs, which can suffer from slow response times, unexpected changes or error messages – or even downtime. As an app provider and API consumer, you need to stay on top of the APIs you are using to build your apps. You want to be the first to know when they aren’t performing as they should so you can take appropriate actions – before your users take to the air waves.

Synthetic vs. RUM API Monitoring

Enter the world of API monitoring – which is all about keeping a watchful eye on the performance of those APIs so you can take relevant action should they not perform as required. These could be either your own APIs or those developed by others – it doesn’t really matter because, from a user perspective, they are all part of “your package” and must be monitored and treated as such.

And now for the debate between RUM and STM:

Synthetic Transaction Monitoring (STM) uses simulated API calls to make sure your APIs are working as they should. These API monitors should consist of fairly complex multi-step transactions that “exercise” your APIs in the same way as your users would when they use your app; login, search, buy, review, logout – the works. They are then configured to run periodically from the desired geographical locations through available networks to continuously measure response times and functional status – if these monitors fail, corresponding notifications should be sent to corresponding operational instances to get on top of things and fix any errors fast.

Real User Monitoring (RUM), on the other hand, measures the response times and result status of your APIs on the actual devices running your app. The results are collected and aggregated on the device, and periodically reported back to the analytics backend to show you trends in response times for specific APIs, devices, geographic locations, etc. Since the transactions you measure and errors you see are the exact same as those users are getting – RUM gives you the “real deal” in regard to your users’ experience.

Combining the Two Approaches

As you can probably figure out, both of these approaches have their pros and cons. But the interesting aspect is that the combination of these two approaches turns out to be an extremely potent solution (it reminds me of Remi combining chocolate and strawberries in Ratatouille – wow!). Here are some of the things we came up with:

Test-Driven Monitoring

You could use transactions identified in RUM to create synthetic monitors allowing you to embrace a “test-driven” approach to fix errors detected in the wild:

Detect errors or bad response times on the actual device with RUM

Convert those failing/erroneous RUM transactions to a STM

Configure that STM so that it fails in line with the RUM measurement

Fix/handle the underlying errors accordingly

Deploy the fixes so the monitor doesn’t fail any more.

Correlation Analysis

Correlation of errors in “unrelated” APIs – for example trends identified in STM might give you a “smell” of that something is about to go wrong; increasing response times in one API might indirectly indicate, or be a pre-cursor of, failures in another API that shares the same backend resources. Detecting this in your STM allows you to proactively detect and handle these conditions before your users are being affected.

Because we’re geeks about this kind of stuff, our conversation continued after the webcast ended. We started discussing the addition of another type of real user monitoring – namely one that analyzes actual user feedback on Twitter, app-stores, Facebook, etc (for an example of this kind of monitoring, see Applause by uTest). Adding the results gathered by a similar solution to our RUM and STM offerings above seemed even more exciting; an end-to-end monitoring solution that allows you to detect and correlate error conditions in your apps and APIs across the board. I’ll leave it to your imagination to come up with all the cool monitoring correlational analysis this enables – and stay tuned to hear more about this as we continue our thinking along these lines.

Comments

One question regarding both approaches: when it goes to RUM testing, are the test scenarios run against a specfic customer data set, or a local “instance” of the server backed by a fake database? I think about multi-tenancy for cloud providers that have many instances of “prod” data of real cloud customers.

Regarding STM, you mention real data metrics being captured at device level. Is that on consumers device end or on server device?

One issue with proactive monitoring involves potentially modifying data on server end, which a cloud provider may not be allowed to do, if the data belongs to a specific cloud tenant. So you end up testing the software on a internal instances which is no necessary guarantee that it runs as good with customer specific data. Best is usually to get to some kind of traffic routing where any calls coming to prod customer instances is duplicated in your test environment, allowing you to capture real life scenarios on real data set and allowing test monitoring to change that data without consequences on the end users/customer.