They found that is common to incur massive maintenance costs in real-world machine learning systems. They looked at several ML-specific risk factors to account for in system design. These factors include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a number of system-level anti-patterns.

Boundary erosion in complex models

In traditional software engineering, setting strict abstractions boundaries helps in logical consistency among the inputs and outputs of a given component. It is difficult to set these boundaries in machine learning systems. Yet, machine learning is needed in areas where the desired behavior cannot be effectively expressed with traditional software logic without depending on data. This results in a boundary erosion in a couple of areas.

Entanglement

Machine learning systems mix signals together, entangle them and make isolated improvements impossible. Change to one input may change all the other inputs and an isolated improvement cannot be done. It is referred to as the CACE principle: Change Anything Changes Everything.

There are two possible ways to avoid this:

Isolate models and serve ensembles. Useful in situations where the sub-problems decompose naturally. In many cases, ensembles work well as the errors in the component models are not correlated. Relying on this combination creates a strong entanglement and improving an individual model may make the system less accurate.

Another strategy is to focus on detecting changes in the prediction behaviors as they occur.

Correction cascades

There are cases where a problem is only slightly different than another which already has a solution. It can be tempting to use the same model for the slightly different problem. A small correction is learned as a fast way to solve the newer problem. This correction model has created a new system dependency on the original model. This makes it significantly more expensive to analyze improvements to the models in the future. The cost increases when correction models are cascaded. A correction cascade can create an improvement deadlock.

Visibility debt caused by undeclared consumers

Many times a model is made widely accessible that may later be consumed by other systems. Without access controls, these consumers may be undeclared, silently using the output of a given model as an input to another system. These issues are referred to as visibility debt. These undeclared consumers may also create hidden feedback loops.

Data dependencies cost more than code dependencies

Data dependencies can carry a similar capacity as dependency debt for building debt, only more difficult to detect. Without proper tooling to identify them, data dependencies can form large chains that are difficult to untangle.

They are of two types.

Unstable data dependencies

For moving along the process quickly, it is often convenient to use signals from other systems as input to your own. But some input signals are unstable, they can qualitatively or quantitatively change behavior over time. This can happen as the other system updates over time or made explicitly. A mitigation strategy is to create versioned copies.

Underutilized data dependencies

Underutilized data dependencies are input signals that provide little incremental modeling benefit. These can make an ML system vulnerable to change where it is not necessary. Underutilized data dependencies can come into a model in several ways—via legacy, bundled, epsilon or correlated features.

Feedback loops

Live ML systems often end up influencing their own behavior on being updated over time. This leads to analysis debt. It is difficult to predict the behavior of a given model before it is released in such a case. These feedback loops are difficult to detect and address if they occur gradually over time. This may be the case if the model is not updated frequently.

A direct feedback loop is one in which a model may directly influence the selection of its own data for future training. In a hidden feedback loop, two systems influence each other indirectly.

Machine learning system anti-patterns

It is common for systems that incorporate machine learning methods to end up with high-debt design patterns.

Glue code: Using generic packages results in a glue code system design pattern. In that, a massive amount of supporting code is typed to get data into and out of general-purpose packages.

Pipeline jungles: Pipeline jungles often appear in data preparation as a special case of glue code. This can evolve organically with new sources added. The result can become a jungle of scrapes, joins, and sampling steps.

Dead experimental codepaths: Glue code commonly becomes increasingly attractive in the short term. None of the surrounding structures need to be reworked. Over time, these accumulated codepaths create a growing debt due to the increasing difficulties of maintaining backward compatibility.

Abstraction debt: There is a lack of support for strong abstractions in ML systems.

Common smells: A smell may indicate an underlying problem in a component system. These can be data smells, multiple-language smell, or prototype smells.

Configuration debt

Debt can also accumulate when configuring a machine learning system. A large system has a wide number of configurations with respect to features, data selection, verification methods and so on. It is common that configuration is treated an afterthought. In a mature system, config lines can be larger than the code lines themselves and each configuration line has potential for mistakes.

Dealing with external world changes

ML systems interact directly with the external world and the external world is rarely stable. Some measures that can be taken to deal with the instability are:

Fixing thresholds in dynamic systems

It is necessary to pick a decision threshold for a given model to perform some action. Either to predict true or false, to mark an email as spam or not spam, to show or not show a given advertisement.

Monitoring and testing

Unit testing and end-to-end testing cannot ensure complete proper functioning of an ML system. For long-term system reliability, comprehensive live monitoring and automated response is critical. Now there is a question of what to monitor. The authors of the paper point out three areas as starting points—prediction bias, limits for actions, and upstream producers.

Other related areas in ML debt

In addition to the mentioned areas, an ML system may also face debts from other areas. These include data testing debt, reproducibility debt, process management debt, and cultural debt.

Conclusion

Moving quickly often introduces technical debt. The most important insight from this paper, according to the authors is that technical debt is an issue that both engineers and researchers need to be aware of.

Paying machine learning related technical debt requires commitment, which can often only be achieved by a shift in team culture. Prioritizing and rewarding this effort which needs to be recognized is important for the long-term health of successful machine learning teams.