The Art of Debugging Complex Systems

The first bug ever reported in a computer program was actually a sizeable moth stuck between relays of the Harvard Mark II electromechanical computing machine. Debugging that one involved, well, walking around the computer and looking at it for a few minutes.

Fortunately for all of us, computers have come a considerable distance since ENIAC. But along with this increase in the ability of computing to augment our lives comes an increase in complexity. And that complexity means that bugs can be ever more nuanced and elusive to pinpoint.

The nastiest of bugs can take hundreds of hours to debug--sometimes its easier to just work around it or live with it! There is no cut and dried solution for complex debugging--the strongest engineers who are determined to master the monsters they have created are forced to get creative and invent novel solutions to help with debugging.

At ThoughtSpot, we have managed to reduce all the complexity of a relational database into one search bar, so when a search goes wrong it’s not always easy to find out why. In hopes of helping the world squash a few more bugs, we would like to share our approach to some tricky debugging. Enjoy!

The world’s first recorded computer bug, “debugged” on September 9, 1945 by U.S. Navy officer and famous computer scientist Grace Hopper.

Search Suggestions: The What

The ThoughtSpot search interface provides Google-like type-ahead suggestions as the user types. These suggestions are queries that ThoughtSpot considers most relevant to a user given his current search text. For example, when a user types in “re”, the type-ahead suggestions could be something like:

revenue city

region color

revenue asia

This might seem simple enough, but these suggestions are influenced by a number of dynamic factors that make reproducing search results a challenge. These factors include:

Database schema: We tend to suggest queries that have a simpler execution plan - for example in a search with columns from 2 different tables, we would not use a join path that goes through 3 other tables if the 2 can be joined directly.

Probability of Query: We suggest queries that are most probable. For example, if the query is only a single measure, such as “revenue”, then it is likely that the user wants to break down total revenue by some attribute column. Therefore we would suggest an attribute column like “city”.

Prior Search History: We pay a lot of attention to a user’s search history. So if TS has learnt from a particular user’s search behavior that he tends to use “revenue” and “marketplace” together in a query then we tend to prefer “marketplace” in our suggestions whenever “revenue” is present (and vice versa).

Data Characteristics: The shape of the data itself can impact suggestions. For example, if the user types ‘ca’, TS is less likely to suggest one particular customer out of a million named ‘Carl Johnson’, and more likely to suggest one state out of 50: ‘California’.

These kinds of highly dynamic factors which vary based on each individual customer’s usage history and data make analyzing issues very challenging. But more on that later.

Search Suggestions: The How

Let’s consider a given set of services in ThoughtSpot as described at a very high level in the diagram below. These services are responsible for providing ranked type-ahead suggestions.

The Auto-completion service is responsible for generating type-ahead suggestions in ThoughtSpot. For example, when someone types a query like “re”, the possible typeahead suggestions could be “revenue city” or “region color” or “revenue asia”. These suggestions are ranked in an order that makes the most sense for the current user. In order to achieve this the auto-completion service talks to the Index Server to fetch prefix matches for the given string and the Personalized Ranking Server (PRS) to rank the auto-completions based on the current user’s search history.

From time to time, a customer reaches out to us with a search example where the suggestions he received were really not the ones he wanted. So, how can we go about dissecting the algorithms of the auto-completion service to diagnose and fix this sub-optimality in ranking?

As one would suspect, the ranking of auto-completions is impacted by the results received from downstream services: the Index Server and PRS. Each ThoughtSpot customer’s environment is different in terms of data, metadata and search history and hence debugging a ranking problem requires logging into their system and getting a set of logs from them to comb through, or replicating their environment in house with the exact same data. ThoughtSpot is typically deployed in customers’ data centers where access by the ThoughtSpot team is limited or impossible. The logs are not retained forever in the system and often get deleted by the time someone starts diagnosing. And bringing customer data in house is almost always out of the question. All of these make it very difficult to debug issues with suggestion ranking.

Our Bug-Squashing Solution

In order to solve these pesky ranking issues at ThoughtSpot we came up with the following architecture:

In the above diagram there are local request/response caches sitting between the Auto-Completion service and the downstream services: Index Server and PRS. Each request and response to these downstream services is cached by the auto-completion service. In order to debug a given request to the auto-completion service, a special flag is passed to the request to “snapshot” the relevant internal state, including these caches and any other data that the service depends on. This snapshot is serialized into a protocol buffer and is stored locally in the logs directory of the service. Back at headquarters, we fire up a standalone version of the Auto-Completion service that can be initiated with the snapshot we just took from the customer system, as depicted below.

The Index Server and PRS caches are hydrated from the snapshot and all the internal state of the Auto-Completion service is also restored. We have now replicated the exact environment in which the ranking issue occurred--it is identical to the customer environment with respect to this particular search request. This helps tremendously in much faster debugging, as we can add more logging and even step through the code in a debugger to exactly pin-point the problem.

Our Bug Battle

Our nifty debugging solution might not solve every nasty bug in modern computing, but snapshotting messages passed between services in order to re-hydrate an exact moment in the life of a program is a versatile tool and a mercifully useful one for tricky debugging. We hope this example inspires you to dream up powerful new bug-squashing mechanisms of your own, instead of wallowing in bug-hunting purgatory for countless hours. The truth is our bugs are only getting tougher to chase down and their impact only getting bigger on society, so we need to keep up in this battle to maintain order in our digital world.