What Is a Data Scientist?: Michael O'Connell of TIBCO Spotfire

Financial services, retail and manufacturing have been large producers and consumers of data for many years. But now, every industry is becoming a data-driven industry, as more operations of a business are tied to a vast array of Web user actions and connected devices. To make use of all the data that now flows into the business, an analyst has to have empathy for real business problems - they has to ask, “what do we want to learn from the data that would be valuable for the business?” The data have to be mashed-up and massaged, through software such as Extract, Transform and Load (ETL) and analytics platforms. Then, businesses have to create solutions that enable everyday business users to ask and answer questions on their own. The volume of data, and the need to respond quickly to that data, is simply too great to continue to go forward in the idiom of data warehouses and months-long analysis projects.

In short, that means the market is ripe for data scientists.

We continue our series, “What Is a Data Scientist?” by speaking with Michael O’Connell, senior director of analytics of TIBCO Spotfire, TIBCO Software Inc. (See the CITO Research problem statement "Growing your Own Data Scientists" for other articles in the series.)O’Connell is working at TIBCO to create real-time data distribution and analysis platforms that help users get to the heart of the question more quickly, and with better outcomes of value for the business. In this role, O’Connell has arrived at several maxims that can help the data scientist become most valuable to the business, and at practices that can best create a data-driven culture at an organization.

Find the Right Problems to Solve, or, Beginning With the End in Mind

One of the chief pitfalls of analysis is attempting to solve the wrong problem. The most marketable data scientist needs to approach solving business problems in an open-minded way.

“Whether that’s a ‘churn’ problem in telecom, or maximizing revenue in a casino or on-line gaming, being able to figure out the kernel of that problem enough to really address it, takes out-of-the-box thinking,” O’Connell says. “And that’s the key starting point for framing analytic problems and problem-solving activities.”

The first step out-of-the-box involves collaborating with the business to identify business problems that are worth the time and investment they will take to solve. The second is, don’t wade into an undifferentiated sea of data before knowing what the problem is. The third is, liberating oneself from the idea that a data warehouse will solve contemporary business problems.

“Data warehouses are dead,” O’Connell says. “There’s a lot of very important data that never really make it into a database.”

Case in point: the common problem of turnover or “churn” in the telecom industry. A common reason for people to leave a service is dropped calls. So, the appropriate business question is, “How many calls can we drop before we’re in trouble?” A typical database can’t answer this question, O’Connell says, because most companies don’t log dropped calls. It wasn’t until his telecom customer started looking at real-time data connected to wireless transmitting stations that the answer - between two and three calls per day - was revealed.

If the business problem had not been defined so specifically, it might not have been financially worthwhile for the client to invest in a real-time event-tracking system. Nor would it have been worthwhile to send an analyst into a field of data to separate the wheat from the chaff, without a question that had a true business outcome as its goal.

“When you approach a business problem, don’t just start bringing in data and start looking at data,” O’Connell says. “That’s valuable, but only after you have some sort of problem defined that creates value for the business.”

Having a carefully defined business problem enables interpretation of patterns in data at what TIBCO CEO Vivek Ranadivé calls “the moment of truth,” O’Connell says. “That moment of truth can be at the point of sale, or it can be at the point of disappointment for a person,” [who then leaves] a casino, or it can be at the point of frustration for a person using a cell phone [who then leaves] a phone plan.”

When companies have “moment of truth” information, they can implement compelling interventions – what Ranadivé calls the two-second advantage in his recent book, where a little bit of information at the right time is more valuable than all the data in the world after the fact. Key information can be leveraged just before customers disappear. When data are collected about the demographic profile of a person checking into a hotel casino, educated guesses can be made about their own moment of truth - the point at which they’ve lost enough money to prompt an exit. Magically, someone materializes and offers them a meal at the adjacent restaurant - and they stay. After three dropped calls, mobile phone providers can send text messages to a customer offering a new product or service, or a discount on next month’s bill.

Even though it’s essential to define the business problem up front, this doesn’t mean one should head into the yonder expecting a certain answer, or even knowing all the questions one will ask -- that is the stuff of the old-school, SQL- and data-warehouse world. O’Connell advocates exploratory data analysis. But there’s simply too much data out there to go in without a plan. The data scientist can help business users formulate that plan – to detect patterns amidst the torrent of big data and events that surround their business, and interpret what is happening at the “moments of truth” in order to gain the two-second advantage.

Create a Data Modeling Workflow

The advanced life form known as the “marketable data scientist” will be able to help an organization develop a hierarchy and workflow of data aggregation when it comes to putting its data to work for modeling scenarios. In a sentence, you don’t want to throw all the data you have on the table every time you have a question about anything.

For example, in consumer packaged goods (CPG) if you’re interested in high-level summaries of the business you don’t need to look further than the delta in sales growth by product over time or comparing your product line’s growth versus industry wide category growth.But being able to figure out when to drill down deeper into root causes - “why is one retailer better than another at selling laundry products in rural areas?” - and expose wider universes of data, such as store-by-store context and attributes, is an important skill. Such dimension-free analysis on the drill-down provides visibility in to the contextual data sources, to identify root causes and initiate corrective action based on the recognized patterns.

“When you are dealing with a torrent of data, being able to construct different levels of aggregation is really important, because you want to make sure you’ve used your software intelligently with respect to that large amount of data, and bring to bear the contextual data you need to answer the question, at the level of aggregation that you need to answer it,” says O’Connell.

A high-level aggregate view doesn’t use much memory, nor does drilling down to the product or market that you want to explore. But drilling down to the store-level attributes, or the demographics surrounding that region, goes down to the granular data at the point of sale, which eats up a lot of memory, which costs money. “So you want to make sure you bring that [level of data] in only to the extent that you need to answer the question,” he adds. “If you start off that whole problem by bringing in all the point-of-sale data, you would just be thrashing.”

The winning combination, therefore, is:

1. Assembling the appropriate data mash-up to build the model that will solve the problem at hand.

2. Ordering the sequence of aggregations of that data so that it can be efficiently explored.

3. Understanding how to use analytics to tell the story of the data, and address the issues that you uncovered with exploratory visual analysis and predictive modeling. .

The Data Scientist Skill Set

The data scientist that would be able to solve these problems will need an impressive resume of skills, O’Connell says. These include the ability to:

“Think analytically, rigorously, and systematically about a business problem and come up with a solution that leverages the available data.”

Facilitate data discovery, which involves “exploring data in a dimension-free manner, informed by the business problem,” in other words, synthesizing and mashing-up a variety of back-end data sources and examining data in the context of a question.

Visually and analytically explore a data set in order to find relevant data sources, and then decide when to dig down more deeply, with additional visual and predictive analytics.

Detect patterns in big data and events, allowing one to interpret what is happening at the “moment of truth,” a very important point in time that affects the running of a business.

Develop an analytic capability, embedded in software, so that you can enable the business people (or anyone) with self-service to interpret and understand the implications of those patterns detected at the “moment of truth.”

Building a Data-Science Capability

Of course, data scientists don’t come fresh off the production line ready to be inserted into a company like cogs. The combination of business leadership in the hiring, and high IQ and entrepreneurial instincts in the hired, are key to the successful construction of a data-science capability at a company, O’Connell says.

“We put a lot of effort into finding high-IQ people, even if they don’t have all the necessary tools and skills,” O’Connell says. “If you point them in the right direction, and give them the right tools and the right framework, they come up with very innovative solutions.”

Generally, college graduates with some propensity to learn new tools quickly tend to do better as corporate data scientists, than do lifelong academics, he says. A PhD statistician, for example, may not necessarily be the best candidate to be a data scientist in a business context, as they tend to work on small and arcane data sets, they may not have proper empathy for business problems, and may have been trained on archaic software, O’Connell says.

“You need to make sure your [data scientists know how to] work on problems that are not academic, but are really pointed at value creation for the business,” he adds.

Part of that entrepreneurial spirit extends not just to finding the right problems to solve, but deciding not to simply pick the low-hanging fruit. A classic false move in an immature data culture is “working on the problem where they have convenient data, without really thinking about the problem,” O’Connell says, drawing an analogy to the Michael Lewis book and subsequent film Moneyball, and Lewis’ New York Times article “The No-Stats All-Star,” both of which described how sports teams gained advantage by abandoning traditionally collected statistics, instead searching for characteristics that were more esoteric, but nevertheless statistically valid.

“If you’re driving a business, and you’re looking at your team, saying, ‘what is the valuable lineup that I can put on the floor?’ or ‘who are the valuable players that I can put on the floor?’ then you are asking, ‘How do I run this game to win this game?’” he says. “There’s lots of data, but only some of it really has information in it. And I think people who are not hip to that will naively explore the convenient data.”

“A Little Bit Now,” Vs. “Everything Too Late”

Based on the moments of truth it can reveal, and the interceptions of dissatisfaction it can provide, O’Connell predicts that the leading businesses of the world will soon adopt data virtualization over data warehousing. In other words, companies will constantly pull real-time data from transactional systems, rather than wait for it to be sifted and sorted into warehouses.

“By the time you build data warehouses, the landscape has changed,” O’Connell says. “So we’re seeing a move to data virtualization, where you keep the source systems in place, you have a real-time cache, you have a way of refreshing that cache, and then you connect your analytics tools to the cache layer, or you connect them directly to the real-time bus and ‘listen.’”

This is, of course, the lifeblood of TIBCO’s offering -- the “IB” in its name refers to “Information Bus” -- and with products like Spotfire, TIBCO now has a layer of analytics that can provide the visualization needed to make decisions in real time with a high -- if not infallible -- level of confidence.

The ability to listen to data in this way has cross-industry applications. In addition to the retail examples, more highly-regulated industries, such as financial services, pharmaceuticals, airlines and utilities can benefit. With real-time data visualization , a pharmaceutical company could pick up safety signals before a problem gets out of control. Some TIBCO pharma customers have actually discovered, through real-time data, that the drug they were testing for one disease actually works better to treat a different disease, and were able to re-file with the FDA to develop the drug as therapy for a new indication, O’Connell says.

No one pretends that simply installing software will totally automate or optimize the process of decision-making at companies. Though it will be far easier than sifting through an undifferentiated mass of data, the substantial skill of the right-stats all-star data scientist, is still needed to use data visualization technology effectively. That special combination of statistical acuity and business savvy will need to come into play, O’Connell says, in order to execute on Ranadivé’s mantra, “a little bit of information ahead of time is better than all the information in the world six months after the fact.” The critical decisions lie in determining which snippets of information that surface represent the tails of whales and which are red herrings. The MVP data scientist will need to have the acumen to make that call, and help others do the same.

Dan Woods is CTO and editor of CITO Research, a firm focused on advancing the craft of technology leadership. He consults for many of the companies he writes about. For more stories about how CIOs and CTOs can grow visit www.CITOResearch.com.