All Posts

What I Learned in a Two Week Microservice Bug Hunt

After a few years building platforms using (mostly) Node.js microservices, I thought I could troubleshoot problematic situations in minimal time, regardless of coding style, or lack thereof. It turns out – nope.

The two week bughunt

There was this tiny bug where occasionally, a group-routed phone call displayed as if someone else answered. Should be easy to find. Except call logs touch severalservices:

service for doing the calls and mashing data from the call into calllogs

internal service for CRUDding the calllogs

external gateway api for pulling back calls that massages and transforms thedata

native app display

2 databases

Tripup One: complex local environmentsetup

We skipped setting things up with docker-compose or another tool that will spin up the whole environment locally, in one command. This is a must, these days. It would take 7 terminals to fire up the whole environment, plus a few databases and background services – and each service needs its own local config. There would still be phone service dependencies (this would be mocked in an ideal world) and external messaging dependencies(Respoke).

Not being able to spin up the whole environment means you better have excellentlogging.

Tripup Two: not enoughlogging

Aggregated logs are the lifeblood of microservices, especially dockerized or load-balancedones.

We use the ELK stack for log management. Elasticsearch, Logstash, and Kibana are wonderful tools when they have consumed all server resources and blocked the user interface processingdata.

For these particular bugs, there was insufficient logging, and the problems only occurred when all of the microservices talked together. Because we have some special Asterisk hardware and phone number providers, it is a lot of work (if not impossible) to spin up the entire environmentlocally.

Thus, at first I started by adding a few logs here and there in the service. It was a round of Add Logs – PR – Deploy – Test – Add Logs – PR – Deploy – Test. Eventually I just added a ton of logging,everywhere.

I have this fear that I will add too much logging and it will cause things to go down, or get in the way. With few exceptions, you can’t have too much logging when things break. You can have bad log search. Also, at this point I have decided that the ELK stack will always consume all resources, so you might as well log everythinganyways.

Tripup Three: forgotten internal supportinglibrary

There was an internal library, written in the early days of the project, whichhad:

a unique codingstyle

no tests

poor commit messages

no comments

generic naming of variables andmethods

several basic bugs in unused codepaths

As it turned out, none of the bugs in this library were causing problems because those code paths were not in use. Nonetheless, I spent a full day understandingit.

Tripup Four: code generation in functional and unittests

I am a firm believer, now, that DRY (don’t repeat yourself) has no place in unit tests, and probably not in functional tests either. Here are common things I raninto:

test setup has multiple layers of describe() and each has beforeEach()

beforeEach() blocks used factories which assigned many values to uuid.v4(), then further manipulated theoutput

layers of generated test values are impossible todebug

It’s best just to be explicit. Use string literals everywhere in unit tests. Minimize or eliminate nested describe()s.

Tripup Five: too much logic in onespot

In Node.js land, there’s no reason to have functions with complexity higher than 6 or 7 because adding a utility or library is cheap. It takes little effort to extract things into smaller and smaller functions, and use explicit and specificnaming.

We had a ton of logic in Express routes/controllers. This is hard to unit test, because the realistic way to get at that logic is using supertest over mock HTTP. It’s better to make small functions and unit test input-output on thosefunctions.

Conclusion

Eventually, I never found the bug – I found four, after careful refactoring of code and tests to the tune of several thousandSLOC.

The actual bug could have been only a one-line-fix, but finding it took weeks ofcoding.

Situations like this are often unavoidable. I am sure if I haven’t inflicted similar situations on colleagues in the past, I will in the future. It’s the nature of trade-offs you face when moving fast to test a business idea. The following things might help minimize that,though:

agree to a single .eslintrc file and neverdeviate

use a lot of small functions, and unit testthem

don’t make a separate module withouttests

be as explicit and repeat yourself intests

be able to spin up a local dev environment with minimal commands, or run against a presetup testingenvironment