Getting people excited about their data one visual at a time™

Many data scientists have years of higher education invested in becoming one. In their professional work life, they have had to probably go through many rings of fire before elevated to the title of data scientist in their organizations.

In a post on KDNuggets [5], they suggest that Data Scientists are
highly educated with 88% having at
least a Master’s degree and 46% have PhDs. They also state, “and while there
are notable exceptions, a very strong educational background is usually
required to develop the depth of knowledge necessary to be a data scientist. To
become a data scientist, you could earn a Bachelor’s degree in Computer Science,
Social Sciences, Physical Sciences, and Statistics. The most common fields of
study are Mathematics and Statistics (32%), followed by Computer Science (19%)
and Engineering (16%). A degree in any of these courses will give you the
skills you need to process and analyze big data.”

Regarding the realm of data warehousing, Stephen J. Smith, in his Eckerson Group article, The Demise of the Data Warehouse [12], discusses a future where there will be no data warehouse. Instead, we will have a data lake plus master data management (DL+MDM). In DL+MDM, the first principle is you want to keep track of the data about the data (metadata) and only move the data when it was absolutely required.

The second important principle is that for the data warehouse to be useful to the business, it must also provide the ‘truth’. This is where MDM comes in.

Thus, the data lake provides access to the data and the MDM provides the truth.

But the third principle of the data warehouse is speed. In the past, the speed had to come from physically moving the data closer together (and is still a good idea if you can – aka ‘data gravity’). But today, with the elasticity of the cloud and MPP (Massively Parallel Processing) speed and size are much less of an issue.

Many new tools are now available to support this paradigm. As a data scientist, you want to make sure you are, at least, aware of these tools, their capabilities, and the part they play in this paradigm.

Self-Service Data Preparation Tools

Tools like Hadoop and R Programming have steep learning curves. However, tools like Alteryx and Tableau Prep will help you easily connect to disparate data sources, clean and enhance the data, and then pass it along to other tools, such as Tableau, to perform your analytics and develop a deeper understanding of the data. The use of drag-and-drop of the tasks in the workflow makes these tools easier for non-technical people (such as non-technical citizen data scientists) to learn and use to do their own self-service preparation of their data.

SQL

Personally, I think the most important technical skill
for a Data Scientist to have is a solid understand of the Structured Query
Language, SQL. SQL allows you to perform the CRUD (Create, Read, Update and
Delete) operations you need to manipulate data in a database such as Oracle,
SQL Server, and DB2. Also, it’s basic core set of built-in functions allow you
to perform analytical functions and transform your database structures.

In my experience, most
employers want their data science personnel to have a solid foundation in SQL.
In fact, on many technical interviews, potential candidates may be asked to
write some pseudo-SQL code on a whiteboard based on a business question they
ask you.

In my work environment, we have many business-facing people with limited technical backgrounds who can do their own data analysis using SQL to go against our various databases and Tableau Extracts.

Machine Learning/AI

I personally think, right now, having a solid background in Machine Learning and Artificial Intelligence (AI) will get you the biggest bang for your buck. This is the area where there seems to be the biggest shortage of skilled workers.

So, what does this
entail? Per Jeremie
Harris
[13], a Machine Learning Engineer’s job is to
build, optimize and deploy machine learning models to production. Generally,
you will be treating machine learning models as APIs or components, which
you’ll be plugging into a full-stack app or hardware of some kind, but you may
also be called upon to design models yourself.

Some of the
requirements to develop your Machine Learning skills include working with
Python, JavaScript, scikit-learn, TensorFlow/PyTorch (and/or enterprise deep
learning frameworks), and SQL or MongoDB (typically used for app DBs).

I probably
could have started with interpersonal skills or communication skills, but if
you lack data ethics, regardless of the number of degrees you have or the
number of years you have worked in the profession, you may have the data
scientist title, but you are no data scientist.

According
to Hilary Mason, a Data Scientist in Residence at Accel, she noted three
key challenges facing the data science community.

Imprecise Ethics

No standards of practice

A lack of a consistent vocabulary

We work in a profession with a great deal of uncertainty. Too
often, our interactions with our business communities are determined by the
algorithms and machine learning developed by data scientists. [6]

We need to have that ethical
understanding, we need to have that training, and we need to have something akin to a Hippocratic oath. And we need
to actually have proper licenses so that if you actually do something
unethical, perhaps you have some kind of penalty, or disbarment, or some kind
of recourse, something to say this is not what we want to do as an industry,
and then figure out ways to remediate people who go off the rails and do things
because people just aren’t trained and they don’t know.

In the same interview, according to ProPublica, a recurring theme is the serious, harmful, and unethical consequences that data science can have, such as the COMPAS Recidivism Risk Score that has been “used across the country to predict future criminals” and is “biased against blacks.”

Mr.
Browne-Anderson also notes that we are approaching a
consensus that ethical standards need to come from within data science
itself, as well as from legislators, grassroots movements, and other
stakeholders. Part of this movement involves a reemphasis on interpretability
in models, as opposed to black-box models. That is, we need to build models
that can explain why they make the predictions they make. Deep learning models
are great at a lot of things, but they are infamously uninterpretable. Many
dedicated, intelligent researchers, developers, and data scientists are making
headway here with work such as Lime,
a project aimed at explaining what machine learning models are doing.[6]

#2 – You really need to
understand the data

“I am drowning in
data, yet I am starving” – Unknown

Data scientist often deal with very large amounts of data. Often times, we have a lot of data about a subject area, but it is in a form that it is not consumable. As a Data Architect by profession, over the years, I am often surprised when I ask a business partner questions related to their data and they are unsure or do not have an answer to my question. This is not something I am just now experiencing, but something I have experienced over the past 40 years in this profession.

Referring back to
Hilary Mason’s three major concerns, I have often walked into situations where no
documentation exists, no data dictionary exists, no abbreviation lists (her
version of a vocabulary), no data models, no data lineage, etc.

I often use the
telephone area codes as an example when discussing the important of being able
to understand your data. Back in the early 1980s, if I wanted to validate an
area code (I coded in COBOL back then), I could depend on checking the middle
digit to be a zero or a one. Arizona only has one area code back then, 602. Now
Arizona has five area codes: 480, 520, 602, 623 and 928. My old validation rule
no longer will work. I remember when we first had to convert full telephone
numbers to the new nomenclature and it was very messy.

As a data scientist, you must be curious. I would even dare to say, you must be more
curious about your business partner’s data then they are! It is O.K. to let
your guard down and say, “I don’t understand this. Please explain it to me.”
Often times, they don’t know the answers to questions about their data either.
Your non-functional requirements are as equally important as your functional
ones.

Curiosity causes knowledge to occur. I spend most of my time discovering and prepping data. This is starting to change with what I refer to as self-service ETL tools such as Alteryx and Tableau Prep.

#3 – You Have to Read!

I have always been a
voracious reader. It sometimes drives my wife crazy all the magazine
subscriptions I have. I have always tried to be aware of the latest technologies
and trends related to data science.

It is difficult to stay on top of everything related to data science. It embarrasses me to say this, but I am a slow reader. When I was young, my father encouraged me to read a paragraph at a time. He told me once I understood what I just read, to go on to the next paragraph. Repeat. And continue. My life as a life-long reader had begun.

Today, we have the
Internet at our disposal. There are a lot of excellent sites to find articles
about almost every subject imageable. There is also a lot of “noise” on the
Internet. You must train yourself to determine what is valuable to read and
what is just noise. I don’t personally have any guidance I can offer here how
to discern the two, but it is a kind of gut feel for me when I find something
of great value to read, and what is just a vendor preening or someone spewing a
biased beef.

I recommend you settle on some key focus areas such as data preparation, data visualization, key statistical concepts, machine learning or Tableau. In my personal technical learnings, the Tableau community is a prolific community. You will find tricks & tips, how to articles, deep discussions on data visualization philosophy, etc. As you read, visit the sites of some of the products they are discussing and read the product information, its capabilities, etc. Even if you probably would not buy or be using that product soon, you at least have the knowledge about it in your “bags of tricks.”

#4 – Know Your Business!

In a previous life, I use to do a lot of interviews, to help our HR Department, for hiring application developers and data modelers. I am going to let you know my favorite question I asked on these interviews. It is:

How
does your current company make a profit?

Out of, let’s say, 50 of these interviews over the years, I only had one person who was able to answer this question. Typically, they would respond by telling me about an application they just developed or some key reports they had created for senior management. I would tell them I understand that is what you did from the technical side, but I want to know your understanding of how your company makes money!

So, it is always important that you understand the business of your company. I recommend you read your company’s annual reports, their product descriptions, know the features and advantages of these products, competitive intelligence, etc.

I personally set my goal, when I am working with a department at work, to know as much or more about their department then the people I am working with. Perhaps it is an unrealistic goal, but often times, I come pretty darn close.

In terms of the data science aspect of the
business, you should be able to discern which problems the business considers
critical. In addition, you should always be thinking of new ways for your
business partners to leverage their data to make actionable decisions.

To be able to do this, you must understand how
the problems you solve can impact the business. Therefore, as I mentioned
earlier, you need to know how your business operates, what it needs to do to
make a profit, your business partner’s “pain points”, and what you can do to
steer them in the right direction.

#5 – Communication Skills

First, check your technical jargon at the door. You are talking to your business partners,
and if you want to get their attention, you need to talk in their language. You
need to take all that fancy tech-speak and translate it to language your
business partners understand and care about. Think of the departments within
your company and the types of things they are interested in. For example, when
I talk to our Finance Department, they are interested in being able to quickly
access information about next year’s budget, expenditures by department, how much
revenue has come in so far this year, and which departments are bringing in the
most revenue. A data scientist must be able to provide quantified insights to
their business partners for them to be able to make actionable decisions on a
timely basis.

Storytelling is a highly desirable skill for data
scientists. Being able to weave a compelling story around the business
partner’s data will draw them in to wanting to know more details about the
story. Also, it will work to help facilitate questions from them, which may
help bring new questions and data needs to the surface. Proper use of data
visualization and infographics are great tools to convey large amounts of data
in understandable, digestible visuals that can quickly be consumed by your
business partners, so they gain the knowledge they need to make actionable decisions.

#6 – Be a Team Player

A data scientist does not work in a vacuum. Nor should they want to. In most cases, you
will work with individuals within your organization from company executives,
product managers, department heads to staff-level employees. Often, you will
even have to work with external customers, or in my case, citizens within our
City.

When I first entered the IT profession in the
late 1970s, I literally worked in the cold, damp basement of the building. A
separate group (non-IT) gathered requirements and told us what reports to
develop. Once I completed coding the report, I turned that over to the person
who gathered the requirements to present to our business partners. If there
were any changes, we would have to repeat this entire process over again (often
several times) until they were satisfied with the report. Being able to talk
directly to the business partners was a no-no for coders. We have learned over
the years that this Waterfall, coder in the basement method of development does
not work well, and more dynamic, iterative methods will better serve our business
partners.

Fortunately, the methods used in IT have matured as well as the business partner’s perception of the IT professional. Iterative, collaborative processes now have IT personnel directly talking to the business partners which helps them hear the needs of the business, as well as being able to ask questions in real-time. Now, this notion of not talking directly to the business may sound silly or antiquated to some of you, but 40 years ago, programmers were relegated to the basement and just coded. Back then, most coders would have loved to talk to the business to get the requirements right the first time versus using a “middleman” to go back-and-forth. Times have changed, IT is more sophisticated, and the data scientist is now front and center with the business partners. Don’t take for granted the evolution of the IT profession to get us where we are today. These relationships with your business partners are critical to your very existence. Everyone around you is part of your team. Treat everyone with respect, remember the information they have is important, and what subject matter knowledge they know is essential to how successful you will be as a data scientist. To coin an old phrase, there is no “I” in “Team.”

Summary

Regarding what a data scientist really is, I
see this a little more simplistically. A person who can do the following is
what I see as a data scientist.

Engage with their business partners to better understand the question(s) that need answering and making their data actionable. They need to be able to speak in business terms and have some level of subject matter expertise in what the business sector does.

Be able to determine where the best data sources of the information needed to answer the business question are. This could be internal data sources, government data sources, etc. They also need to be able to determine if the data source is reliable and unbiased.

Based on the toolsets they know, they should be able to consume the data sources and produce the required results in a visual, easily consumable format. They should know how to validate the results to ensure they are accurate and correct.

They need to have ethics. If the results do not agree with the outcome the business expects, they should not be altered or tweaked to provide a better narrative. Also, if there is confidential data in the results, they need to ensure they obfuscate or redact the information the business should not see based on internal or other data governance regulations.

They need to be collaborative. I like to run my results by trusted coworkers to ensure I did not miss something or that I was fair and objective in producing my results.

Finally, they need to be able to go back to the business and convey these results in language they will understand. The data scientist needs to be open to criticism of the results, be able to explain their methodology used to create the results and be willing to stand by and support their results even if it does not fit or satisfy the narrative the business wants.

Data Science should be less about the title
and more about the skills of what that title encompasses. Degrees, tenure, and
titles do not necessarily make a good data scientist. The quality of the work,
attention to detail, ethics, accuracy, and the ability to convey the results to
the business are what is most important. If these attributes can be met, I
don’t care what you call yourself, you are a data scientist.