Diagnosing a MongoDB issue

You might have read a recent post by our developers concerning performance analysis tools and its follow up concerning sysdig.
In the database world these tools come handy almost everyday. In this blog post
I will show you a case where I have put tools to action diagnosing a MongoDB issue.

The case

Some time ago we were alarmed by one of our scrum teams that MongoDB
response time had jumped sky high. The New Relic screens were undeniable.

Environment

MongoDB environment in question is a shared cluster. It currently comprises four shards.
Each shard is a three node replica set. The design assumes one physical host in one data center,
one physical host in another. Third member of the set is a cloud machine spawned in an openstack
private cloud. This member is by definition weaker in terms of performance capabilities
and system resources than the physical nodes. It is used to form a quorum
and to keep a third data copy, just in case. It also has priority set to zero so that
it is never promoted to primary.

Analysis

Since the issue concerned data modification the natural way was to check primary first
for any global issues. This is where tools mentioned earlier came in handy:

top showed negligible load on the server,

iostat showed just a few percent utilization of our SSDs,

vmstat showed no swap activity.

It was safe to conclude the issue was not caused by an insufficient hardware resources.

Why did simple modification take so long then? I learned the application sets
write concern to replica acknowledged. That implies next two areas of analysis:

network issue,

replica set issue.

I was inclined to believe it was a network issue. This is due to an outage we had experienced
a few days before. It had caused severe degradation of network performance,
as well as connectivity issues. Hopefully, there is an easy way to check it.
Simple test using ping utility showed stable and expected value for RTT (Round Trip Time).

It is worth to mention the importance of the RTT when building large scale,
multi-datacenter applications with OLTP (Online Transaction Processing) characteristics.
This value affects every operation connected with database interaction.
In MongoDB due to implemented semantics of write concern the relationship
is even tighter. Every operation changing data state will be delayed.
The bigger the RTT the bigger the delay.

So, network was doing great. That left only one suspect which was replica set state itself.
A quick check showed no clue:

Cloud node was a bit behind but physical node was up to date. This is what I would expect
given our environment. The Replica acknowledged write concern requires confirmation
from any secondary. So, one lagging secondary should not be an issue. However, as stated in the documentation,
this data may not represent current state.

To make sure the issue was connected with replication itself a code fix changing
write concern configuration from replica acknowledged to acknowledged was applied.
The results were obvious.

So the replication was to blame. Great, but where was the root cause?

Some time later it hit me. The answer is below in rs.status() output. Observe the value of the syncingTo attribute.

Replica set formed a chain: physical primary, cloud secondary, physical secondary.
This way physical-node-1.dc5 did not get a chance to confirm write since it was
syncing to a lagging cloud-node.dc4
This is how chained replication, a feature designed to offload the primary triggered a problem in our environment.

To permanently address the issue I followed documentation and disabled chained replication.

Final note

Rafal is a DBA with over 10 years of professional experience. With a team of 5 he is responsible for overall management, troubleshooting and performance tuning of Allegro’s critical 24/7 database systems. He specialises in Oracle and MongoDB databases.