What They Don’t Teach You in Machine Learning Courses

Maksim Butsenko, Data Scientist at Taxify

Data science is an integral part of building an efficient ride-hailing platform. At Taxify, it took us just one year to build a strong and agile data science function which works on state-of-the-art solutions and deals with optimising millions of rides happening in real time.

While interviewing hundreds of candidates, we’ve realised that even those with a strong technical background were very often lacking some essential skills. In this article, we’re talking about things that they don’t teach you in Machine Learning courses.

Defining the role of a data scientist

Tech industry has (more or less) learned how to make engineers and business work together. We have several product development approaches to choose from — it is up to your team to setup either Scrum, Kanban, or XP workflow. People know how to fit those flows into organisation and you’ll find a lot of advice out there on how to apply it efficiently. What often lacks, however, is the understanding of how data scientists should fit into this picture. Where to place them — engineering or business? Do you unleash them to look for the insights that nobody has? Or do you ask them to answer a very specific question and improve one narrow field in your business?

Asko Seeba explains very nicely the process of how business might look at data science project and argues that it is mainly a research project and should be considered as such. Considering that even people in the industry still try to understand how to utilise data scientist’s expertise in the most efficient way, how should new joiners know what skills they should focus on?

Building a team

A year ago we started building a top data science team at Taxify. Considering our growth (the number of rides on Taxify grew tenfold last year and we’re serving more than 15 million passengers and 500,000 drivers globally) and amount of data-related challenges in the transportation field, simply hiring a few great data scientists would not be enough. We are looking at adding a few dozen more data scientists to our team in the next year.

But who exactly are we looking for? The proper academic degree programs in data science are just emerging and the industry itself is not yet completely sure how to define a nicely framed profile of a data scientist.

As a matter of fact, the pool of data scientists currently consists of individuals with various backgrounds. There are people with background in computer science and AI in our team, but also those who come from the fields of signal processing, econometrics, chemistry, complex systems, sociology etc. Our common denominator is usually a good understanding of scientific method and of the design of experiments. Technical skills are much more straightforward to acquire. However, as we come from various fields, our understanding of processes around delivering data-based products might differ. It takes some effort to integrate all these experiences and deliver strong team results, and here’s how we handle this.

Data Science Excellence

We started an internal initiative to facilitate the way team delivers automated predictions focused on direct product impact. We call it Data Science Excellence, and it serves the purpose of gathering the best practices that we have already established as a team and those which we are striving to and to make them visible. It helps new team members to streamline the model development process and avoid common pitfalls. This way we can move forward faster. It’s not obligatory to check every box in this guide before we can produce something of value, however most of the points are there to ensure the quality of the result and to avoid costly mistakes and slowdowns on the way.

The reason I speak about our Data Science Excellence guide here is that it can provide useful insight in what it takes to deliver a data product. While most of this is self-evident for experienced data scientist, you won’t learn about it from ML courses or books, so this is useful for anyone starting their career or transferring to data science from other fields. Doing the interviews and reviewing the test assignments by candidates, we constantly see that many beginners who have strong technical skills and sufficient understanding of ML pipeline fail at asking the right question or knowing how to test their model in live. Our Data Science Excellence guide is here to help.

Problem statement

Start with defining the problem you are trying to solve. If possible describe it in mathematical notation (nothing too strict, but will be helpful to ensure that definitions are correct). By defining the problem we can start outlining possible approaches to the solution. This is an iterative process. Come back to problem statement and validate — have we asked ourselves the correct question? Do a literature review and a research on the solutions existing on the market that we can reuse. Focus on an impact.

Goals and metrics

Consider the goals of your project. Come up with metrics which will show if you have achieved your goal. Before starting to work on a prototype, ask yourself:

What is the actual value in your model for the product and people using the product?

What is the impact? Does it impact 10% of your user-base or 40%?

How do you plan to measure the efficiency of the model on the product level?

Collecting and measuring all the KPIs and metrics you can think of is a good practice. Positive impact in your main objective might have negative impact in other domains.

Timeboxing

Exploring data, trying different approaches for feature engineering and building models can be tinkered with forever. Spending considerable amount of time on this is also something that might score you a big win in Kaggle competition. In a fast growing company however, there are many data-related challenges to solve.

For example, predicting the amount of time for driver to reach a rider (estimated time of arrival — ETA) is the crucial element of our service. After delivering successful ETA prediction model which had improved considerably on mean absolute prediction error compared to existing solution, we had to ask ourselves if it is reasonable to spend the effort now on trying to reduce the error further by some % or come back to it in a few iterations. We knew it would require considerable engineering effort and the amount of possible improvement was impossible to estimate ahead.

Considering this we strive towards timeboxing our efforts, either in regards of doing exploratory analysis or optimising existing model. We set limited amount of time for particular task and try to deliver the results during this time-slot, even if it means not choosing the fanciest of the models available or omitting some interesting feature engineering ideas. Timeboxing is also useful in the ideation stage — for example in order to evaluate possible areas to work on, we spend one day per idea in quick pair-hackathon mode to understand what are the possible outcomes of this path, how quickly can we get to the deployable model and how much is it going to improve the simple baseline.

Tooling

Only a data scientist working alone can afford a luxury of not caring about reusability of his or her code. You can get to results quickly and consider the path which is the most comfortable to you. Working as a team requires to think of what is the fastest way to move forward together, even if it means to account for additional work for you personally.

For example, the method of comparing two heatmaps with geospatial data might be relevant to other people in their analysis as well, so it makes sense to spend some time and generalise the function and have it as a part of our internal data-stack library. This ensures we move faster as a team. Also the team as a whole will benefit from structuring the code or notebook in readable and easily reusable way.

In general we aim to give our scientists and engineers the best tools available. We build what we must and buy what we can.

Code review

Code reviews are basic hygiene that is standard practice in software development. It is less obvious for many data scientists, partly because of the fact that they don’t have a CS degree and have a vague understanding of the best software practices. That is why it was also important for us to establish the process of reviewing ML code that is pushed into production. It is important to note that code reviewing for data scientist should also include checking assumptions that were made for creating a model.

For example, in tasks related to our field you might want to discuss the questions such as “How do you define demand from the customers?”, “Why missing ride price fields were imputed by average price?” etc. This means that a software developer without good understanding of data science process is not capable to evaluate the full functionality of the code or notice mistakes in data-related assumptions. Therefore, it makes sense either to split code review in separate stages (software/model) or use people knowledgeable in both areas. For a great overview on software development for data scientists I would recommend reading this blog post by Trey Causey.

Code review is also a very good way to improve knowledge sharing inside the team, especially when different team members are working on separate projects.

AB testing

Your model is ready, it provides reasonable accuracy and you expect it to deliver necessary business impact. Model is code-reviewed and ready to go. How do you actually validate that the model has the expected impact? This is actually one of the hardest areas in our work because of all the ambiguity and uncertainty in terms of data we are seeing. We are the fastest growing ride-hailing business in the world competing in one of the toughest markets with multitude of competitors. The data that we are seeing is influenced not only by us, but also by events in the city and promotions that our competitors are doing.

Considering this, AB-testing is the tool which can reliably (if conducted properly) measure the impact of the feature. Here, by AB testing we mean experimentation setting with randomised assignment into control and treatment groups. However, some experiments influence the whole city at once (for example improving dispatching algorithm), making a proper AB-testing impossible. For these cases, our simulation engine comes handy. Running several experiments at once is another complexity to account for.

The solution is to build a sophisticated AB experimentation engine to track all the experiments, handle randomised allocation in test and control groups, collect observational statistics, and calculate corresponding p-values. Nevertheless, even the most sophisticated engine cannot account for errors that are made in the setup of the actual test. Therefore best practices have to be shared inside the company.

Visibility and communication

The larger your team grows, the more effort you have to put into ensuring that everything is communicated, expectations are managed, and every team member knows what others are doing. Visibility is extremely important because of multitude of reasons:

Everyone knows what others are doing and they know who they should ask for advice/collaboration.

Reduces probability of “double work” with several people working on the same task.

Provides feedback from others on your work (especially important for models closely related to non-technical domains, where operations teams have much more domain expertise).

It may sound counterintuitive, but communication visibility reduces overhead of explaining same things several times as well as helps to reduce miscommunication.

It was important for us to establish constant flow of tracking and sharing the team progress in many channels, so we explicitly defined the good practices of sharing the results of your work:

Slack: regular updates in corresponding channels about the state of the project.

Weekly and monthly meetings: discussing and prioritising our work inside the team as well as with stakeholders. Re-evaluating priorities continuously, as even weekly cycle is too slow.

Research Notes: data scientist tracking the state of the project mostly for himself or herself, important findings, plans. Good place to gather the main findings which are later easy to share in other channels, slides, meetings etc.

When it comes to visibility in the team, less is definitely not more. Oversharing is usually not a big issue, however not sharing enough can seriously hinder your team progress.

Storytelling

Storytelling is as important to data scientist working in a team as the mastery of handling overfitting or knowing whether to use chi-square-test or t-test. This covers the ability to present and explain the results to stakeholders. You should not be spending 5 minutes on explaining what is on the graph and why it matters.

Most of what you are trying to tell your audience should be self-evident from the plot itself and solved by choice of the plot type, colors, legend, axis labels. If you feel that you need to work on your storytelling, check this guide from AnalyticsVidhya and see some great examples from FlowingData here.

Conclusion

There is much more that goes into delivering impactful data science project than just a working model. Especially if we consider that building a product impacting lives of millions of people in the world requires a great team with expertise in various fields.

At Taxify, we’ve made this process more transparent, unified and efficient by having an initiative we call Data Science Excellence — it helps to build our work around the best practices established by the team. In addition, sharing the best practices with the new data science team members is beneficial both to the company and to the new joiners. Finally, we hope that these practices can be useful for anyone starting their path in data science.

Do you have your own version of Data Science Excellence project in your company? What are the best practices you believe are worth sharing? Please let us know here in the comments, or come talk to us during North Star AI conference. It would be awesome to know what do you think.

About the author

Maksim Butsenko is a Data Scientist at Taxify. His main responsibilities include building data and ML products for ensuring sustainable growth for the company, as well as helping to collect and promote best data science practices inside the team and the company. Maksim transitioned to the industry from the academia and has a research background in statistical signal processing.