What is a Data Scientist and Do You Need One?

It’s true what they say, Data Science is a buzzword but it’s a buzzword that has been around for a few years now which gives weight to its importance. What comes with buzz is hype and often a lot of misunderstanding, in this case what a data scientist is and what a data scientist can do. Hype doesn’t always manifest itself out of thin air (emphasis on not always, a lot of times it does spawn from nothing, the internet can be funny like that), sometimes there are good reasons for it but it pays to understand what they are.

What is a data scientist?

Not too long ago I became a data scientist. Did I learn a new skill? Go back to university and get another degree? No, I just had a rebranding. Before that I was considered a statistician with a mathematics and computer science degree with honours in statistics. This is really the crux of it, data scientists evolved just as the data landscape has evolved.

Most data scientist are either statisticians/mathematicians that dabble in software development or developers that dabble in statistics. Data scientists tend to lie on this spectrum. A key component here is knowing how to program. It doesn’t really matter which language you use (as long as it’s not something too obscure) a statistician needs to know how to use the tools of their trade otherwise they’re really not going to get very far in today’s world. The other key traits for a data scientist are:

Statistics – In-depth knowledge of distributions, statistical inference, probability theory and how to apply the right test to the right problem. This is key, whereas the machine learning component, a lot of the heavy lifting these days is done by packages in R and Python. Rarely would you need to write your own algorithm from scratch. Knowledge of statistics ensures your input is sound and not a dumpster fire.

Machine learning – Knowledge of machine learning algorithms and how to apply them to a problem, such as decision trees, SVM or neural nets. Also, knowledge of optimisation procedures, like gradient descent and how to formulate a problem to be solved in this manner. What is considered machine learning can also be controversial. Some would say only reinforcement learning is machine learning whereas others say the vast expanse between simple linear regression and RL. I tend to take a middle ground and say anything that requires an iterative procedure for estimation, whether that’s classification or regression can be considered machine learning.

Data visualisation and communication – Communicating the results is arguably the most important part and to be fair, and often one of the hardest aspects of a data science role. To get support for data driven solutions there needs to be a clear message of the value it will generate. Or simply explaining why a proposed method will not provide value. If this isn’t communicated well valuable time will be spent in semantics and misunderstanding. There is a real art in simplifying complex information into easily digestible plots. Knowledge of tools such as Shiny can help with the delivery and distribution of many graphs. Letting subject matter areas click through and interrogate the output in their own time can be powerful if not only for transparency.

Software – As mentioned, if you don’t know how to use the tools you’re not going to get very far. R and Python are easily the most popular software packages for small and large-scale systems. R in particular has support for database management such as Oracle R. Which one should you use? People like to get tribal about which is better, but in practice who cares? Both have a fantastic set of libraries and both are free. I use both, but primarily R. I use Python for more general purpose programming and often integrate solutions with both languages. My advice is to use the one you are more comfortable with and allow you to work faster, but it’s good to know both. In a Big Data context more specialised tools such as SparkR and PySpark will come in handy.

As you’ve probably heard, data scientist is the sexiest job of the 21st century. Does that sound sexier than statistician? How about mathematician? Computer scientist? Yeah, it does! I wouldn’t be surprised if in order the meet the demand for the skills the clever folks in the marketing domain initiated this rebranding. A few years ago, you wouldn’t find university courses on data science. Now they are everywhere.

Do you need a data scientist?

To answer this question, a few other questions need to be answered.

What is our current data capability? It pays to assess the current capability of your business. Data has been around for a long time and there is a good chance you already have people with the skills to analyse data, build complex models, interpret the results and find applications to support the business direction. The difference now is data comes in much larger volumes, consequently the computer science component becomes much more important. Wrangling such enormous volumes of unstructured data requires isn’t just difficult it could be impossible to make any use of it without the right infrastructure and the technical knowledge on how data is stored and processed. Any team tasked with finding data driven solutions benefit from having both skill sets but don’t necessarily need to be the same person, although it helps to have breadth of knowledge in this space.

Do we need data driven solutions? Not every business needs to have full, automated, data driven solutions. For example, local councils probably don’t need a predictive model to know when to mow their lawns and service their lawn mowers. If you’re a national lawn mowing conglomerate, fitting to a tight schedule and unexpected outages could cost the business substantially over time, then a data driven solution may help.

How much data do we have? Data scientists need data to find data driven solutions. If your business’s data maturity is low and very little data is collected, there may not be much value generated, if any. Data scientists aren’t magicians so throwing a few in the game might not deliver results you’re after. Although, they can still help by identifying data gaps and developing strategies for collecting useful data sources. It’s just important to keep in mind that this is the long game and 3 months investment won’t do the job.

Do we have the right data and know what we want to do with it? This is an important question for businesses. Data scientists work best when leaders communicate their vision clearly and there is enough understanding about the data to head in the right direction. It is much harder when there is no clear vision and no understanding of the current state of the data. This is still ok if the data team are given free reign to develop innovative ideas however, asking the team to solve a specific problem with the data available more often than not will turn out to be underwhelming. Again, a data science team can help in developing a strategy to remove the gaps and increase your business’s data maturity but it will take a longer time and more investment. If your motivation is simply “this Big Data thing is big, we gotta get on that” it is worth taking a step back before pushing all your chips into the center of the table.

Is our data usable? You’ve ticked the boxes, you need data driven solutions, you have a lot of data and you know what you want to do with it, but is it usable? This comes back to the question about data maturity. If there is a lot of data collected, in the same way a hoarder hoards bottle caps and milk cartons because one day they may come in handy, there is going to be a lot of work cleaning, preparing and getting the data moving down the right pipes before the data team can make the most of it. Otherwise they will spend a lot of time spinning wheels and not developing real solutions. This is very common and there is often a huge disconnect between where management think they are and reality. It is crucial to be upfront and honest to avoid situations where teams are tasked with delivering the impossible, which I’ve seen all too much. At this stage it is best to invest more in data engineers and solution architects with advice from data scientists to get data in the right places.

The problem with data scientists

Data scientists have a specific skill to make the most use of data, draw insights and aid/automate decision making. It truly is a transformative time for businesses and governments. A good data scientist goes in with intuition, experience and domain knowledge but ultimately relies on the data to paint the picture. A common challenge is communicating results which contradict subject matter knowledge. This can cause much debate and wasted effort if the subject matter experts don’t trust their data scientists. All too often I’ve been in the situation where SME’s say, “this doesn’t align with what we’ve known for years, therefore your model is wrong”. Prior knowledge of how the business operates is important, but trusting objective insights driven by data is just as important.

The problem with data scientists is they are seen as the oracle of knowledge that can be believed or ignored. They are often seen as a silver bullet for solving problems but rarely can they walk in, fix the thing and walk out. They need to be integrated into the business as any other team. Trusting your data scientists are working towards the same goal and having realistic expectations is the key to success.