Aspiring Data Scientist? Here Are Some At Work Project Ideas

Do you find yourself wanting to move into Data Science but keep hearing "get some data, analyze it, and you'll be fine..."? Have you developed many of the base skills for data science, such as programming, data analysis, and/or visualization but are unsure of how to apply them? Are you looking to differentiate yourself from the ever-growing pile of aspiring "data scientist" who have taken the usual Coursera classes and done Kaggle competitions?

You are not alone.

A few days ago, a member from our newsletter asked us our thoughts on question a reader on reddit.com/r/datascience asked regarding project ideas using data from their work for an aspiring data scientist.

Here is the question:
"I would like to know if any of you have any ideas for any specific problems I can solve that would be relevant to my current position. At the moment, I have been entrusted with pretty much all of the company's data, including service records, material pricing, billing info, and even vehicle tracking data. My boss is willing to let me use the data to work on a project, provided the day-to-day tasks get done and I don't leak privileged information all over the internet."

The person asking the question has a PhD in physics and has some of the necessary base skills for data science - programming, data analysis, and visualization. Now all that this person needs is data and they're good to go, right? Well, not so fast - they already have the skills and all of the companies data that they could want, what they now need is an understanding of what data science can do for them, their data, and their company.

There are two ways to approach this:

A) Top Down - understand what kinds of problems the business has, and then look for specific solutions in the Data Science domain. (i.e. - pick a problem the business has - are the marketing dollars working as best as they can - and then figuring out which Data Science techniques you can use to answer that question...)

B) Bottom Up - understand what kinds of problems can be solved with Data Science first, and then look for specific applications in the business domain. (i.e. - before even looking at the business, go out and learn Classification techniques, or Clustering techniques, or Dimensionality Reduction techniques, or Regression techniques, or Deep Learning techniques, or Optimization techniques, etc...)

The bottom up way reminds me of the following quote from "Zen and the Art of Motorcycle Maintenance: An Inquiry into Values" by Robert Pirsig:
“You want to know how to paint a perfect painting? ... It's easy. Make yourself perfect and then just paint naturally. That's the way all the experts do it. The making of a painting [...] isn't separate from the rest of your existence."

Which is great if you are still a PhD student or have all the time in the world to make yourself perfect. There are many positive attributes to this approach and is the long term solution.

However, as this person is currently in a job and wants to build an independent project to showcase their data science skills in the short term, the best approach is the Top Down approach.

To that end, we can think of the business in the following terms: A business, in broad strokes, has one goal - make money. A secondary goal is to make more money (i.e., grow profit). The way to grow profit is to A) get more revenue, B) cut costs, or C) do both A and B at the same time.

The person asking the question has access to pretty much all of the company's data. So taking the top down approach (what problems does the business have and the figure out what data science techniques to apply), the goal then should be to A) get more revenue and/or B) cut costs.

First, let's look at the different functional areas of a business:

Finance and Accounts

Human Resources

Customer Service

Marketing & Sales

Distribution

Research and Development

Administration Support

Production Operations

IT Support

Purchasing

Each of these different functional areas of the business will be generating some type of data in normal daily operations. The person asking the question's goal then is to find a way to help the business grow its profits only using data from one or more of these functional areas and data science techniques.

Second, we can group the different functional areas into those that have a direct impact into the profit numbers and those that have secondary impact:

Primary Impact:

Marketing & Sales

Distribution

Production Operations

Purchasing

Secondary Impact:

Finance and Accounts

Human Resources

Customer Service

Research and Development

Administration Support

IT Support

Note: of course R&D and the other areas that provide secondary impact are very important to the running of a business and how it generates profit. It's just that the effect is either longer term (happy customers eventually recommend other customers) or harder to measure (happy workforce helps to attract a better work force which makes the company better).

Since the person asking the question's boss is willing to let him/her use the data as long as the day-to-day tasks get done, we can safely assume that the boss will be happy to let this person work on a project for sometime though probably not forever. So it's probably important to have small / quick wins sooner than later to allow the side-project to continue if not grow. To that end, the person asking the question should then focus on the functional areas of the business that provide Primary Impact.

Third, we can further separate the functional areas of the Primary Impact list into A) get more revenue and B) cut costs:

A) Primary Impact & Get more revenue:

Marketing & Sales

B) Primary Impact & Cut costs:

Distribution

Production Operations

Purchasing

Fourth, we can now start to think about what projects a "Data Scientist" can do with datasets from these specific functional areas. A data scientist's job can be broken down at a high level into two areas - data engineering & asking questions and trying to answer them with data.

For the data engineering part, the somewhat general consensus is that 80% of the time of a data science project is spend on the "data munging" part of it (e.g. Data Science & Online Retail - At Warby Parker and Beyond: Carl Anderson Interview). You can think of the "data munging" part of the project as getting the data as it's currently stored into a nicely shaped, cleaned, checked, formatted, easily available data set that you can then throw into your modeling/visualization system.

For the asking questions and trying to answer them with data part, this is where the visualization, modeling, story-telling, statistics, math, and programming come in. When non-data scientist think of data science, this is the part they normally talk about as it's usually what they see in Kaggle competitions, etc...

What project ideas for specific problems the question asker can solve that are relevant to his/her current position start to become more clear. For A and B functional areas of primary impact we can come up with Data Engineering/Data Munging projects as well as asking questions and trying to answer them with data projects.

For Data Engineering/Data Munging projects, we can look for projects in the following areas:

For the asking questions and trying to answer them with data projects, we can look for projects in the following areas:

Visualization of Data

Modeling of Data

Story-telling of results

Fifth, and final, once we get to this point we can combine the functional areas and the project areas to find wins to improve the profit of the company. Each of these project areas when applied to the functional areas can be a project on it's own. As an example, let's look at A) marketing and sales and take a look at the full data science pipeline to see where things can be improved by asking the 5(6) W's and asking if each area can be improved:

For each W question in each section it is important to remember that we are always looking to either get more revenue, cut costs or do both. Also, remember that this example is just for the Primary Impact & Get more revenue functionality area. We can perform this same exercise for the Primary Impact & Cut costs area.

Pulling it all together...

Some example project in the data engineering aspect of sales and marketing could be the following:

Making all of the Sales and Marketing data available on an internal website where it is visible and breakdown-able all the way to specific sales people and advertising campaigns...

Moving all of the disparate sales and marketing data sets that are perhaps are siloed in spreadsheets on various people's computers into a centralized online data store that will scale to the size needed 5 years from now...

Given the companies goals and aspirations, revamp the data collection process to capture even more data about the sales and marketing functions (this involves figuring out what to capture as well as how to accurately and efficiently capture it)...

Some example project in the asking questions and using data to answer them aspect of sales and marketing could be the following:

Use Classification techniques to classify which accounts are the most likely to upgrade their service contract (this helps the salesforce to know which leads / accounts to focus on to sell more)...

Use Regression techniques to improve how sales people pitch prospective clients and what features of the company's services they should highlight...

Use Optimization techniques to maximize the number of views of company promotional material a prospective customer sees for a given dollar amount of promotional spend...

Some of these improvements are relatively simple while others will require much more thought and awareness. Any one of those with enough thought and energy should be able to help the company as well as provide some skills to showcase in a more explicit data science context. It's also important to remember that 80% of data science is generally thought to be data munging so while it may seem like the right thing to do is to jump into the modeling, it's much better to look at all the areas that can be improved and figure out where the highest value can be provided.

The great thing about the situation the person asking the question finds themselves in is that though their boss might not have a great deal of understanding of how to use data to make the company better / bring in more profit, they are willing and really want for the question asker to turn data into dollars. And, because the boss is letting them try a project, it means that the communication lines are open and that constant dialogue should happen to make sure that the right problems are being tackled that benefit everyone involved - the boss looks great and the person asking the question will have developed data science skills they can showcase in more explicit data science contexts.