Monitoring Async Applications

I am a software engineer at New Relic. New Relic provides multiple monitoring products, most notably APM.

Recently I was working on gathering Async performance metrics for Python. This turned out to be a non-trivial task, so I thought I'd write a talk to share with y'all some of the things I learned.

There is some confusion on the internet as to what the terms surrounding concurrency mean. So I'm going to define them for you, and provide an explanation to serve as a framework for understanding the rest of this talk.

Synchronous

Concurrent

Parallel

Asynchronous

Synchronous paradigm is the traditional way of doing things. Only executing on one task at a time, one after the other.

A concurrent program is one that is able to work on multiple tasks in the same period of time.

Parallel execution mean running different tasks in the same instant of time. For this we must have multiple CPU cores.

Asynchronous is interleaved task execution. We can alternately run chunks of different tasks.

A talk about async wouldn't be complete without an analogy

Let's say you have you have a litter of puppies. Each puppy needs to be fed and cuddled.

Synchronous

Explain diagram.

If you were to do this in a totally synchronous fashion, you would give a puppy its food, stand there and wait for her to finish, then clean up. Then you'd do that for the next puppy and the next etc.
Then you would pick up and cuddle each puppy in turn as well.

Parallel

Parallel is if you had friends that could feed and cuddle the puppies, one each, all at the same time.

Asynchronous

Asynchronous fashion would allow you to give a puppy her food, and while she is eating you can move on to another puppy and give her her food. Since cuddling is an active task you can can't do anything else while you cuddle a puppy.

Human = CPU

Puppy = Task

Feeding = I/O call

Cuddling = CPU intensive code

Examples will be in Pseudocode Python

Mostly because Python is my favorite language, but also because pseudocode looks like Python. My hope is that even if you don't know python that you'll be able to follow the examples. I'll also be using the Tornado web framework because it has an elegant API that allows for concise examples, where we'll be able to stay at a high level.

I am taking an example-based approach.

If you are familiar with advanced python, you may notice that I have sacrificed best practices/safety for the sake of simplicity. Do not copy and use this code as is.

There are also some places where I'm going to gloss over the details. My goal here is to outline the fundamental concepts, not show python tricks.

What this code demonstrates is the ability to wait on multiple IO calls for the same task at a time.

For each handler we have 10 IO calls to make (in this case 10 http fetches), in the synchronous case we have to wait for each fetch to return before we get the next one, and this is a blocking wait, nothing else is able to run while the IO is taking place.

In the Async handler we can fire off each fetch, and then wait for them all to complete at the same time. While we are waiting we can also execute any other code that needs to run.

Winning!

Wait on all I/O at the same time

Other code can use the CPU while I/O is executing

Do work after response is sent back

No thread messiness

This supports multiple long-lived connections to a user

Most common manifestation is event loops

Now that we know what an async app is, and why we should use it, I'm going to talk just a bit about monitoring it.

First of all, what do I mean when I say monitoring?

Collecting data on your app in production

Monitoring is Collecting and processing data about your application as it is running

Not Profiling

high overhead

critically, for async, doesn't give context

What data are we going to collect?

Execution times for: handlers, queries, layers of stack

Throughput

Error rate

Some of most common types of data we would want to monitor are Timing data, throughput and error rate. Since it's not specific to async, we're not going to look at throughput or error rate for this talk.

Monitoring is a VERY large topic, not covering most of it here

I'm going to focus on what is specific to asynchronous apps

Your users should not be your monitoring system

Why should you have monitoring in the first place? I'm hoping if you're at this talk, I don't need to sell you on the importance of monitoring in general.

If your app is a source of income (or pride), and it's broken, you're losing money/sleep. So we'd prefer to find problems sooner, ideally before they become a real problem for our users.

By broken I mean not just total unavailable, or throwing errors.

Performance matters. Slow websites erode the sanity of your users.

Visualize your data in a way that is consumable

Tailing a log file is not monitoring

Visualize your data in a way that is consumable

Why is my website slow? hint: it's the database

Our example app

defcuddle(responses):# pretend to do processingtime.sleep(0.01)classASyncRequestHandler(RequestHandler):asyncdefget(self):future=feed_puppy()awaitfutureself.finish('Pup is full!\n')cuddle()app=Application([('/',ASyncRequestHandler))]if__name__=='__main__':app.listen(8888)tornado.ioloop.IOLoop.current().start()

What to Measure

Response time

Duration

CPU time

External time

We may not always want to, or be able to measure all of these.

Response Time

Duration

CPU time

External Time

Aggregate and collect data in monitor service

I've created a simple plotting service which I can post data to, and it will use bokeh to chart the data.

This part is not async specific

I do get to use an async http client library to send data to my monitoring service, which keeps overhead low.

Percentiles are better than the mean

times.sort()index95=int(len(times)*0.95)data_point=times[index95]

If your webservice has a mean latency of 100ms, your top 1% of requests may take 5 seconds. This is a bad user experience on it's own if that is a stand-alone service. However, with today's tend to move towards microservice architecture, if several such services are needed to render a page, the 99th percentile of one backend may become the median response of what the user experiences.

To the demo!

CPU intensive tasks are bad news for async architecture

Now that we're seen an example of what gathering this data looks like, I want to talk more generally about how to collect async data. How we can widely apply gathering async metrics from our app.

Strategies for a general solution

Black box

Bake it in

Monkey patch code base

Of course, if for reduced effort, but also reduced insight to our app, we can monitor just response time using an outside service, that is watching request and responses.

Like I showed in the example, we can embed stopwatches into our code directly.

If we have a lot of async code, and this becomes tiresome, or we want do decouple our monitoring from our service, we can put the monitioring code in it's own modules or functions, and patch that into our app.

However, if we have several async calls, which one will be done first? How do we know when to stop collecting, and process our data? In this case our "general solution" becomes necessary to get any data at all

looking back at this example where we had multiple external calls. We saw before how to keep track of the cpu time - which we know is important for throughput.

I'm going to depart from our puppy analogy a bit and introduce some other IO calls.

Thinking about this task by itself, what if one of the 3 chore we are doing in this task is a really slow eater? We want to know which particular call(s) is the limiting factor for us to finish. This is called the critical path.