Description

Because of the heavy weight DOM manipulation happening in the endpoints, there are some endpoints that are being starved because the event loop is being blocked for too long, producing GC problems and memory pressure and worsening the response times of the other endpoints that could be responding earlier.

We should make it so that the event loop is never blocked for too long so that the server can respond faster to things that are done and avoid blocking the event loop for too long.

What

Figure out the best way to improve the performance of the MCS server and services

How

Suggested:

Investigate different alternatives, benchmark their performance characteristics and document the impact on the code/development. To decide which one we should apply to all the code/services. Suggested approaches to look into:

Splitting synchronous CPU bound algorithms to be asynchronous and not go for longer than Xms, delaying work to nextTick after a certain amount of time.

Having a pool of threads for CPU bound work (dom manipulations) and a pool for the network handling, and some sort of queue to send work to the CPU bound threads.

Having a pool of threads for CPU bound work (dom manipulations) and a pool for the network handling, and some sort of queue to send work to the CPU bound threads.

That's not easy to do in the node environment - node has a UV_THREAD_POOL, but you can't access it without native code. There probably exist some libraries to offload things to threads, but that will require a native dependency and might not really help with performance since for thread safety all the data should be copied on the way out of the main thread and on the way back in - this will increase memory consumption and GC footprint quite significantly, since we will be copying Parsoid HTML which could be quite big.

I think the easiest thing to start with is to break long blocking tasks into smaller portions and break them down into chains of promises, or wrap each in process.nextTick.

I spent some time going over the codebase and increasing its asynchronicity by promise-wrapping expensive operations (such as domino.createDocument) and replacing series of synchronous processing steps with promise chains. I focused on the summary endpoint, although changes ended up being rather far-reaching due to the amount of shared code involved in summary processing. In the end, this effort turned out to have no measurable impact on the endpoint's performance as measured by ab.

So, I tried another experiment. I knew from profiling that quite a bit of our processing effort is being expended constructing a domino Document object in domino.createDocument, so I tried simply caching the constructed Document with lru-cache, using the etag from the HTML request as the cache key. This turned out to improve performance considerably:

That said, I doubt we'd see gains this dramatic in production, for a few reasons. First, of course, we wouldn't just be hammering away at a single page and enjoying a 100% cache hit rate; also, we'd expect that many of the requests that reach MCS are in response to page content changes, meaning that cached Document objects from previous renders wouldn't help us. OTOH, most MCS endpoints for both page and feed content include domino.createDocument as a processing step, so at least for popular pages we could expect cached Document objects to be reused at least several times.

Another point to consider is that these Document objects can be rather large (up to 8 or more MB for very large pages, as measured by this method), so we might need to allocate a considerable amount of memory to such a cache in order to see a benefit.

That said, I doubt we'd see gains this dramatic in production, for a few reasons. First, of course, we wouldn't just be hammering away at a single page and enjoying a 100% cache hit rate; also, we'd expect that many of the requests that reach MCS are in response to page content changes, meaning that cached Document objects from previous renders wouldn't help us. OTOH, most MCS endpoints for both page and feed content include domino.createDocument as a processing step, so at least for popular pages we could expect cached Document objects to be reused at least several times.

MCS is only touched if the HTML content has changed - thus etag is changed - thus the cache will not work. As for sharing between different endpoints - this will probably not work either, cause given the number of workers multiplied by the number of nodes the probability of 2 requests landing in the same worker is negligeable.

The only thing where I think it could be beneficial is feed endpoints - they're not stored, they touch a small subset of pages over a long period of time, so the probability of cache hitting is much better.

As @Pchelolo, I also doubt that a cache would help us in production, considering our service architecture having many worker. Hitting the same page in your ab test is not indicative of production behavior.

@Mholloway I'd be interested to learn more about the results of the promise chains. I think we're not really after absolute performance gains but more responsiveness of workers, so that a worker's main thread is not blocked for too long when it's already processing a large page.

@Pchelolo would adding more workers to MCS be a reasonable course of action?

Thanks @Pchelolo and @bearND. I had been meaning to check on whether each worker was a separate process. Looks like the Document cache idea is a bust.

(Maybe we could take pressure off the main workers by creating a separate, internal-only service within mobileapps, with the sole purpose of parsing pages into Document objects on demand?)

I wasn't sure how to test the effect of the promise chain approach on event loop blockage except as reflected in overall performance. I'd expect the version with the promise chains to have less of a performance hit in response to increasing concurrency, but in fact there appears to be no change.

The -c 10 numbers for the promise chain version do look a bit better in this sample, but I think that only reflects a transient network connection improvement. After running multiple times, I haven't seen the numbers consistently better or worse.

The -c 10 numbers for the promise chain version do look a bit better in this sample, but I think that only reflects a transient network connection improvement. After running multiple times, I haven't seen the numbers consistently better or worse.

Actually, in the numbers you've posted the p95/p99 performance dramatically decreased, from 5s to almost 9s. However, for p95/p99 we need to perform way more requests then 100.

I've repeated your experiments but cached the parsoid and mw api results in memory to eliminate networking. The variant without promisifying the long sync processing is faster on average, but the promisified version have way more flat distribution of the request latencies.

It is normal to wither see no changes in number or to see slightly higher numbers when the process is promise-heavier. However, we are here interested in the performance of other requests when one request holds the event loop. In this case, results for other requests should be better. It is acceptable to have slightly higher latencies for the CPU-heavy request if other requests' latencies improve.

In general, there are two types of promisification we can do: (i) inter-method; and (ii) intra-method. The former is what you were testing (putting calls to next methods on nextTick). This can help, but as you note correctly most of the work gets done inside manipulation methods. Hence, we need to find a (reasonable) way of implementing intra-method promisification. The logic would be something like: assess/calculate the amount of transformations/manipulations that would need to be done and then split that over a number of event loop ticks. For example, if divs are being removed, we could say that removing up to 10 of them in a single loop pass is acceptable. Since parsing is usually the most expensive step (especially for long documents), IMHO we should first try to focus on how to promisify that step in a sustainable manner.

As for using the threaded model, I agree that using it in Node is not a good approach. We would need to use a different technology/framework to achieve this. This would, naturally, complicate the architecture so we need to carefully assess the impact of moving on with this idea. For this reason, I think we shouldn't take it off the table, but I think we should first see how much can we improve the current situation by modifying the existing code base.

And the fact it did increase the perf of /ontishday 2x suggests there is a lot of room for improvement in MCS itself - we're making 5 times more requests RB->MCS for that endpoint now, so more CPU time for making the request, parsing and routing it in MCS, combining the results, but just by splitting the actual processing into different nodes we get a lot of improvement.