Machine learning is one of those hot technology categories that has lots of business and technology executives scrambling to see how their organizations can get in on the action. Done right, machine learning can help you create more effective sales and marketing campaigns, improve financial models, more easily detect fraud, and enhance predictive maintenance of equipment—to name a few.

But machine learning can also go terribly wrong, making you regret that enthusiastic rush to adopt. Here are five ways machine learning can go wrong, based on the actual experience of real companies that have adopted it. They’ve shared their lessons so you can avoid the same failures.

Lesson 1: Incorrect assumptions throws machine learning off track

Projector PSA, a US firm that designs and builds professional services automation software that helps consulting firms run their businesses, learned this lesson the hard way when it tried to use machine learning to forecast variances in staffing plans.

Because consulting firms are all about specialized and well-trained consultants and using their talents efficiently, firms often employ project managers to assess and forecast the staffing needs for their projects.

They then track the time consultants spent on each individual project to bill clients for that time. If the organization manages both activities in a single system such as a professional services automation tool, there are some distinct advantages such as being able to compare forecasted with actual hours, to see how good different project managers were with the accuracy of their planning.

Projector PSA had started a study with one of its clients that employed hundreds of project managers, recalls COO Steve Chong. It built models that compared the differences between the average actual hours worked versus the forecasted hours at ever-increasing planning horizons (variance). It also studied over the course of many months how consistent the project managers’ projections were (variability).

That is, if in one week the forecast was much too high and the next week the forecast was much too low (high variability), Projector PSA wanted to know if these canceled each other out such that on average there was little difference, or low variance.

“The initial premise here was that low variance and low variability are good, high variance and high variability are bad,” Chong says. Based on that premise, Projector PSA taught a machine learning algorithm to categorize project managers into different groups, such as “hoarders” and “optimists,” based on this data using a sample of the firm’s project managers as a training set.

The company then had the machine learning algorithm classify the remaining project managers based on what it had learned. It turns out that it classified some of the firm’s most-experienced and well-trained project managers as some of the worst offenders, because they had high variance and high variability.

“In truth, these project managers were the ones that the firm pointed at projects that were already in trouble, under the expectation that they would be able to get the projects under control,” Chong says.

Similarly, the initial machine learning algorithm rated one project manager highly because she had almost zero variance and zero variability. But it turns out that she was sending the forecasted hours to her team with an implicit expectation that they would report those hours as what they had actually worked. This led to a situation in which she was never over or under budget, but by doing so was effectively encouraging her team to act in a manner that was detrimental to the bigger picture, Chong says.

“These mistakes were not caused by the machine learning algorithms themselves, but rather by our assumptions as we trained them initially,” Chong says. “They also stemmed from initially relying solely on the data without having a sufficient understanding of the reality that the data represented.”

Once the company trained its machine learning algorithm to identify these new profiles, it felt it had a much better reflection of reality.

Although many tasks can be performed by machine learning, there are some circumstances that are not accounted for at the beginning of a project that trip up the machine learning results. That’s what happened to Mejor Trato, a financial services company in Brazil that was using machine learning as part of a digital transformation of its human resources department.

The project involved having prospective new employees answer a series of questions through live chat and calls using machine learning chatbots that the company developed internally.

Two key things went wrong with the initial use of chatbots. One was that the job candidates were asked to complete the wrong forms for their profile/profession. The other was that days and hours were given for the interviews that overlapped with HR staff meetings, meaning HR staff would not be able to monitor the chatbots as needed.

During the first few weeks, it was essential that some people on the HR team be monitoring each conversation to correct the bots when necessary, says CTO Cristian Rennella. “We made the mistake of thinking that everything was solved and [left] the chatbot without supervision,” she says. The lesson was “do not forget to monitor the chatbots full-time minimum for a couple of months.”

As a result of not fine-tuning the chatbots, the company determined that about 10 percent of the data gathered was incorrect.

“Machine learning will be useful at the beginning perhaps for 90 percent of the answers, but the remaining 10 percent should have human supervision that can correct the algorithm,” Rennella says. Over time, that 90 percent will increase up to as much as 99 percent, “but we must not stop paying attention to deviations or even new situations that may arise and that were unexpected when we started with the project,” she says.

One of the biggest problems the companies have had related to machine learning is poor data based on the difficulty of labeling, says Stanislav Ashmanov, CEO of both companies. “It is virtually impossible to provide high-quality data labeling,” Ashmanov says. “Typically, people who work on data labeling are sloppy as they often work in a rush. What’s more, it is incredibly difficult to relay the tasks in a way that everyone comprehends them in the same way.”

As a result, the data contains multiple labeled samples such as wrongfully identified silhouettes in a picture that poorly affects the quality of the trained neural network’s performance.

It’s also challenging to collect the large amounts of data needed in a short period of time. Data collection might take up to a few months, Ashmanov says. And the data collected from publicly available sources, such as those found on the internet, does not always accurately represent reality. For example, images taken at a studio or in a lab might be drastically different from real-life street views or factory production unit snapshots. As a result, the neural network’s performance will be low.

One example of what can go wrong occurred when the companies were training a neural network to identify glasses in selfies posted online, as part of a project for a client. They collected a selection of photos from social media and labeled them. The neural network performed at a low quality, Ashmanov says, because it mistook people with black circles under their eyes as wearing glasses.

Another client submitted two satellite images of a city. The task was to mark the cars in the images and teach the neural network to recognize them and count their approximate number. The problem in this case was that the neural network recognized ledges on building roofs as cars, because they were similar in appearance—small, rectangular, and mostly dark in color.

“It all comes down to careful work on margin cases, creating heuristics, and improving preliminary data processing and postprocessing proof check,” Ashmanov says.

Casepoint, a US provider of e-discovery technology for the legal sector and other markets, has experienced the imperfections of machine learning. The company uses machine learning for document classification and predictive analytics. By using the technology, legal teams can dramatically reduce the hours spent reviewing and categorizing documents.

Using machine learning to classify documents is effective, but not flawless, says David Carns, the chief strategy officer. One weakness the company has seen is the overreliance on machine learning to solve subtle, more nuanced classification problems.

For example, in the legal field machine learning document classifiers are frequently used to identify documents that are responsive to a “request for production of documents.” Party A requests documents related to specific topics or content and Party B can use machine learning document classifiers to help sift through document repositories for responsive documents.

It works so well that attorneys have begun to use this technology-assisted review (TAR) of documents routinely, Carns says. “Such success leads to the desire to blindly use machine learning document classifiers for more subtle and nuanced classifications, such as identifying documents protected by attorney-client privilege,” he says.

Although it’s easy to use machine learning to train a document classifier on the content of privilege documents, what makes a document legally privileged strongly depends on a document’s audience, confidentiality, time of receipt, and relation to legal advice or litigation. Most machine learning document classifiers cannot adequately classify these additional context clues, Carns says.

Lesson 5: Test/train contamination can bedevil machine learning

US automation firm Indico has been providing enterprise artificial intelligence and deep learning services to customers for years, and one of the biggest issues it continues to run into is the contamination of testing and training data for machine learning.

One client was creating a model to determine whether a piece of news was going to impact its share price, says CTO Slater Victoroff. It was hard to determine exactly what the impact time would be, so the company created the model to always predict the next day’s impact.

“What they didn’t realize was that they had neglected the data science fundamentals of ensuring a clean test/train split,” Victoroff says. “So they presented near-100-percent accuracy on the task of predicting next-day impacts, when in reality the model was no better than random chance.”

Another experience involved a client looking at its internal natural language processing (NLP) system. The client had a team that had been creating and updating features for machine learning models for years, and continuously testing them based on the same set of searches. The team also experienced the impact of test/train contamination. “Any time you look at your test error and change your algorithms to improve your test error, your numbers are no longer accurate,” Victoroff says.

In this particular case, there was a poor understanding of the issue. Internally the model achieved near-100-percent accuracy for a particular task. “But in production, the system was borderline nonfunctional because they had unwittingly contaminated their results,” Victoroff says. “The most critical error that any organization will make in machine learning is this issue of test/train contamination.”