My mission: Find technology for Early Adopters. Follow me: on Twitter @danwoodsearly on LinkedIn @ www.linkedin.com/in/danwoodsearly/ on myBlog @ http://www.CITOResearch.com. I am a CTO, writer, and consultant. For tech vendors, I help explain their technology. For users, I help find, select, and deploy new solutions that have explosive business value. I love to speak and share ideas.

LinkedIn's Daniel Tunkelang On "What Is a Data Scientist?"

In our continuing research into the emerging field of data science, we are interviewing experts who are leading the charge and delivering exciting innovations for business using this new set of skills. (See “Growing Your Own Data Scientists” on CITOResearch.com for links to other articles in this series and more about my research direction in this area.)

Daniel Tunkelang is the principal data scientist at LinkedIn, which contains more than 120 million professional resumes, and suggests new ways for professionals to engage with each other.

Tunkelang joined LinkedIn in December 2010 and oversees the data science team, which analyzes terabytes of data to produce products and insights that serve LinkedIn’s members. The data science team has been instrumental in creating popular data-driven features of LinkedIn such as the “People You May Know” recommendation engine.

Prior to LinkedIn, Tunkelang led a local search quality team at Google. Tunkelang was also co-founder and Chief Scientist of Endeca (just purchaed by Oracle), a leader in enterprise search and business intelligence that pioneered the use of guided navigation in search applications. He has co-authored eight patents, written a textbook, and participated in numerous academic conferences, such asSIGIR, CIKM, and SIGMOD. Daniel holds a PhD in Computer Science from Carnegie Mellon University, as well as BS and MS degrees in both Mathematics and Computer Science from the Massachusetts Institute of Technology.

Woods: What is a data scientist?

Tunkelang: I’m a big fan of Hilary Mason, chief scientist at bit.ly, so I’ll cite her definition: a data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product. At LinkedIn, products pioneered by data scientists, such as People You May Know, harness the power of data to create value for users.

What essential skills are necessary for a data scientist to be effective?

Strong analytical skills are a given: above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve the will create value for users and drive business decisions. Communication is essential, because data scientists work in horizontal roles and partner with groups across the entire organization. At LinkedIn, data scientists collaborate with every other product group, as well as with sales and finance. Strong communication skills are a must-have.

How do data scientists add value to a company?

Data scientists add value in at least three ways.

The first is by performing offline analysis that informs mission-critical business decisions, e.g., identifying key user segments or activities.

The second is by improving products such as search and recommendations that rely on the quality of data and derived data.

The third is by creating data products: for example, LinkedIn Skills shows you the top locations, related companies, relevant jobs, and groups where you can interact with like-minded professionals.

How should today’s students prepare to be data scientists?

Most data scientists have a background in computer science, mathematics, statistics, or one of the natural or social sciences that relies heavily on quantitative methods. In fact, one of our data scientists was a practicing neurosurgeon before joining LinkedIn (yes, data science really is brain surgery!) I’m not aware of any university offering a major in data science, and I don’t expect any to do so in the foreseeable future.

Post Your Comment

Post Your Reply

Forbes writers have the ability to call out member comments they find particularly interesting. Called-out comments are highlighted across the Forbes network. You'll be notified if your comment is called out.

Comments

Collecting data has never really been the problem. Although we can clearly see an increase in size of data collected, because the internet allows us to serve so many users at the same time. The real problem has always been (and still is) the collection of annotation for that data. This allow the use of supervised learning techniques and provides us with a ground truth needed for evaluation. The increase in the availability of unlabeled data has caused an increasing interest in unsupervised learning and semi-supervised learning techniques, which use the recurrent patterns in the data without the need for much annotation. But to what extend is LinkedIn gathering increasing number of data samples. Taking the classic ‘apples and pears’ classification paradigm, is getting over 100 million users really equivalent to getting more and more datapoints of apples and pears? With so many different professions and so many different countries the service is offered in, aren’t you just adding more fruits to the bowl?

statistics: the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements.

I’ll start taking data scientists more seriously when they call themselves statisticians — and apply the same rigorous experimental methodology.

The defining difference between data scientists (new) and statisticians (old) is that the former deal with tera or petra bytes of data by writing computer programs that often execute in a parallel. An old-school statistician would not be able to deal with the internet-age challenge.

dictionary.com can’t keep up with the pace of change in the information age. Only a crowd-sourced site like wikipedia and urban dictionary can possibly pull that off.

One of the fasted growing data collections is patterns of medical data – especially in radiographic and tomographic files. The classical problem is “too much data, too little time” – which demands cycles of Forward and Invese Theory in addition to the pattern recognition and classification capability of artificial neural networks.

These areas seem not well understood, and often mis-understood by large numbers of professionals. Is this an educational problem, or are these just “hard concepts” for humans to grasp?