Understanding Data Feeds to Optimization Models

For successful implementation, it is critical to understand how data feeds the optimization system.

There are two sources of data. The first is automatic feeds. I have been involved in building systems where one database is feeding another database, which is feeding another database, and eventually it feeds the model that you have. What happens if somebody changes that middle database? Suddenly, some of those assumptions can change and the data doesn't look the same or doesn't satisfy the same characteristics.

I worked on a problem once where the data used to be in minutes, then someone changed the intermediate database and changed the data to be seconds, but left the column name as “minutes.” It took us a while to figure out this change in the upstream database.

The second data source type is manual. If your model relies on manual data, you must account for errors that can occur when users are entering data. When I was at IBM, I worked on a system that allowed people to enter a sales budget. The budgets would typically be in the tens of millions of dollars. We had built our optimization models with the assumption that the numbers were going to be always less than $100 million.

An IBM tech team tested our system, as well as the user interface which our group at IBM Research was not responsible for. The tech team members entered whatever numbers they wanted. They entered $50 trillion. IBM would have been very happy to have a budget of $50 trillion in its annual sales. The numbers were ridiculous and, of course, they broke our models.

Our team then had to discuss this problem. We concluded that either we should convince the implementation team to modify the user interface to disallow values outside business norms or that we would have to build something in our models that say something like, “Hey, this is not a valid value. We're not going forward with optimizing because the data is bad.” The lesson: watch where the data comes from when building optimization systems.

The technical side of optimization is difficult. Make sure your models are robust over different multiple instances of data. Be sure to handle infeasibility, which could be caused by data that the user inputs. The algorithms can be fragile as the problem data evolves. Simple changes to the data can change the performance of solving a MIP from two minutes to two hours, so understanding that sensitivity is important.

Even if there is a problem that is caused by the data and the users recognize that is in fact the case, they may well end up blaming the model anyway. Make sure the data and the explanations for the data are very clear because, otherwise, the model won’t get accepted by the users.