Take it literally.
If you configured your MongoDB config servers as a replica set and for some reason, say a network outage, Mongos server lost connection to all of them and is not able to reconnect during maxConsecutiveFailedChecks attempts then, surprise, it becomes useless. Even if the network is up and running again, Mongos will not reconnect to the config servers and you won’t be able to authenticate to your shard cluster until Mongos is restarted.

static int maxConsecutiveFailedChecks = 30
If a ReplicaSetMonitor has been refreshed more than this many times in a row without finding any live nodes claiming to be in the set, the ReplicaSetMonitorWatcher will stop periodic background refreshes of this set.

And if you check the source code of 3.2.x (3.2.12 as of this writing) branch you will see the following (./src/mongo/client/replica_set_monitor.cpp):

It wouldn’t have been a problem had Zookeeper server used IPv4 address but it was configured with IPv6. So tools that used gethostbyname2(), e.g. getent, were still ok, and only those with gethostbyname() were failing me. Luckily, netcat and other important libraries had newer versions I could use. Once again, if you are on an old and rusty Linux distro be aware that gethostbyname*() and gethostbyaddr*() functions are obsolete

Update
As Anton mentioned in his comment below, getaddrinfo() had its own gotchas, which, if I got it right, were caused by AI_ADDRCONFIG flag. There is a good summary page which goes in more details regarding AI_ADDRCONFIG and the peculiarities pertaining to its current implementation in glibc.

Have you ever been wondering why jbd2 (or jbd if your are still using ext3) is sitting at the top of iotop and consuming the most of IO bandwidth? Well, it’s certainly not because it’s doing that just to drive you nuts but there is a reason. And the reason is most probably there is an app that is doing a lot of sys_fsync(), sys_fdatasync() or sys_msync().
In case your are not on the latest and greatest kernel and BPF is not available, there is an easy way to confirm that using ftrace.

Not closing a cursor in MongoDB could hurt you big, so it’s generally not recommended to use no_cursor_timeout=True (pymongo3) or timeout=False (pymongo2). Especially when you run shared MongoDB installation:

PyMongo does “close” cursors when they are garbage collected, but they aren’t closed immediately and closing a cursor in all current versions of MongoDB is asynchronous. Depending on the python implementation, relying on garbage collection to close the cursor is not a great idea. Discarded, not fully iterated cursors can live for some time when using Jython or PyPy which do not do reference counting garbage collection. That’s why the Cursor object has an explicit close() method.

Not using close() method could potentially be the reason behind the following lines in the MongoDB’s log file:

SHARDING [RangeDeleter] waiting for open cursors before removing range

And even if it looks innocuous, it’s actually not quite, since what it means is that a source shard can’t delete its copy of the documents – Step 7 in chunk migration procedure.

At the time of this writing there is still no way (SERVER-3090) to glean more information that pertains to a cursor’s id, so the only way out that I was able to come up with was to kill those dangling cursors using an undocumented (as of this writing) killCursors command:

After several happy years with FreeBSD running in AWS I finally have switched to Digital Ocean. That happened a few days ago and was driven mainly by the lack of the console support which “aws ec2 get-console-output”, in my opinion, is certainly not.
After the upgrade to FreeBSD 11 I found my instance unreachable and had absolutely no clue what was wrong with it. In that situation “aws ec2 get-console-output” was totally useless with its succinct single-worded output – “Output”. Last time when I had a similar issue after another upgrade I at least was able to glean some helpful information with get-console-output to fix the problem. Not this time though.
So without further ado and armed with Tarsnap backups, I jumped on to DO’s bandwagon with ZFS and HTML5 console which, I hope, would be able to save me should I hit the same boot problem again. As an extra bonus, DO instance is a bit cheaper and beefier than the one I had in AWS. But as always… horses for courses.

After 5 exciting and tumultuous years in the enterprise IT as a Unix and SAN engineer it’s time to switch gears and taste something new. Of course, the alley I’m stepping into is not universally novel but for me personally it’s like an uncharted territory and something I could only dream about. The idea for a change had been ripening for quite a while so it was an effortless decision to say goodbye and move forward without turning back and holding no regrets.
Don’t want to paint with a broad brush but the enterprise IT is notoriously known for its conservatism, red-taping and being usually reluctant to any sort of changes. That’s ok for their business goals but, based solely on my personal experience, that could turn IT engineer into a bench sitter. Hopefully, at my new position I will be more exposed to the bleeding edge technologies, systems’ internals, programming and could become a better practitioner. Time will certainly tell but for now I’m emphatically waiting to face new challenges…