The biggest lesson is to have a very clear set of customers that you’re going to serve, notwithstanding the fact you may be building something that can ultimately help many different types of customers.

Little by little, we’re giving sight to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes won’t be the only ones pondering & exploring our world.

How could a result be explained, especially a result of a machine learning model, without a versioned record of what data was input to generate the result and what data was output representing the result?

One of the benefits of cloud analytics and computing in general is the ability for small and mid-sized companies to take advantage of technology and applications that may have previously been out of reach.

Applications of the future will take advantage of the polyglot nature of the language world. … We should embrace this idea. … It’s all about choosing the right tool for the job and leveraging it correctly.

What’s better : A simple model that’s used, updated, and kept running ? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing?

Without statistics, big data is never going to reach its full potential. And perhaps even more importantly, without informed consumers, big data can be used for misleading and downright dangerous conclusions.

Data scientists love to create interesting models and exciting data visualizations. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.

There are 18 million developers in the world, but only one in a thousand have expertise in artificial intelligence. To a lot of developers, AI is inscrutable and inaccessible. We’re trying to ease the burden.

We are in the Golden Age of Data. For those of us on the front-lines, it doesn’t feel that way. Every step forward this technology takes, the need for deeper analytics takes two. We’re constantly catching up.

A key differentiator between heterogeneous analytics and traditional BI is the ability to rapidly deploy ideas into solutions, adapt to changes in the environment and maintain flexibility of one’s various assets.

On a sequential computer, the fast algorithm is the best algorithm, but for new science area, I believe we need more creative approaches for algorithm design in order to extract more valuable insight in real-time.

We are not saying that statisticians should not tell stories. Story-telling is one of our responsibilities. What we want to see is a clear delineation of what is data-driven and what is theory (i.e., assumptions).

Big Data and traditional data warehousing systems, however, have the similar goals to deliver business value through the analysis of data, but, they differ in the analytics methods and the organization of the data.

While working on Big Data & planning to implement it for the benefit of business, it is very important to explain the insights & valuable knowledge in a way that non-technical business user can actually understand.

Data Scientists are people who can reason through data using inferential reasoning, think in terms of probabilities, be scientific and systematic, and make data work at scale using software engineering best practices.

Most of the big data investment focus to date has been on the underlying infrastructure, while development of the applications that make use of that infrastructure – and that deliver actual business value – has lagged.

Data integration features have gained prominence during the last year as companies struggled to incorporate new data sources in their analysis, a process that can consume a sizable percentage of the total project time.

We can define data science as a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.

To evaluate a person’s work or their productivity requires three things:
1. To be an expert in what they do
2. To have absolutely no reason to care whether they succeed or not
3. To have time available to evaluate them.

We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.

… the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only to give the facts a chance of disproving the null hypothesis.

A good data scientist in my mind is the person that takes the science part in data science very seriously; a person who is able to find problems and solve them using statistics, machine learning, and distributed computing.

Good data science is on the leading edge of scientific understanding of the world, and it is data scientists responsibility to avoid overfitting data and educate the public and the media on the dangers of bad data analysis.

Many first time users of predictive models are happy to have the benefit of a good model with which to target their marketing initiatives and don’t ask the equally important question, is this the best model we can be using?

One problem with machine learning is too much data. With today’s big data technology, we’re in a position where we can generate a large number of features. In such cases, fine-tuned feature engineering is even more important.

Developing predictive analytic models is a different process from responding to requests for charts and graphs based on existing information. It is much more intuitive and creative and requires the mastery of different tools.

Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time.

One robust way to determine if two times series, xt and yt, are related is to analyze if there exists an equation like yt=βxt+ut such us residuals (ut) are stationary (its mean and variance does not change when shifted in time).

Academic culture teaches you that you’re dumb and that you’re probably wrong because most things never work, nature is very hard, and the best you can hope for is working on interesting problems and making a tiny bit of progress.

Companies may not be able to precisely predict the mix of workloads they’ll need to run in the future. But investing in the right family of big data cloud platforms and applications will give them the right foundation for change.

Myths change with understanding. Misunderstandings on some of the current myths surrounding big data as follows will fade away: big data is made for big business, big data adoption is high and machine learning overcomes human bias.

There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don´t know. But there are also unknown unknowns. There are things we don´t know we don´t know.

It is no longer sufficient for businesses to understand what has happened in the past, rather it has become essential to ask what will happen in the future, to anticipate trends and to take action that optimize results for business.

In the end, there is no truth, no ultimate ground truth, no lie-free utterances, as everything is contextual based on incomplete facts and knowledge. All world models are flawed, but Data Science has 2 power tools: Doubt and Verify.

The right thing to do is to not build a tool company but to build a consultancy based on the tools. Identify the company, identify the market, and build a consultancy. Later, if that works, you can then pivot to being a tool company.

Comment of a DeepLearning user: As a side-note, even though I’m good with pattern-based thinking, I do not have an academic background. I lack patience and feel the need to create, rather than to completely understand what I’m doing.

Data science can directly enable a strategic differentiator if the company’s core competency depends on its data and analytic capabilities. When this happens, the company becomes supportive to data science instead of the other way around.

I actually think a lot of the future is in small data …. As the big data hype cycle crests, we’re going to see more and more people recognizing that what they really want to be doing is asking interesting questions of smaller data sets.

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

I always try to look at the problem from the end. When you start from the beginning and everything is blue sky, there are hundreds of ideas to chase as well as thousands of ideas to try and, since everything is possible, nothing ever gets done.

Over time, more industries will fundamentally change or be disrupted as companies begin to leverage analytics, enhance efficiency, and allow data to drive decisions. Simply put, the competitive environment will necessitate data science capabilities.

There are a lot of different roles that are going under the name ‘data science’ right now, and there are also a lot of roles that are probably what you would think of data science but don’t have a label yet because people aren’t necessarily using it.

Improving Visual Data Discovery:
1. Always have new data sources.
2. Always have new techniques.
3. Always have new tools and platforms.
Visual data discovery is not once and done. It is an iterative process that requires communication and exploration.

So what’s the idea behind backpropagation? We don’t know what the hidden units ought to be doing, but what we can do is compute how fast the error changes as we change a hidden activity. Essentially we’ll be trying to find the path of steepest descent!

In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data. For Big Data, it pays off to analyze the data upfront and then design the modeling pipeline accordingly.

It would be very nice to have a formal apparatus that gives us some ‘optimal’ way of recognizing unusual phenomena and inventing new classes of hypotheses that are most likely to contain the true one; but this remains an art for the creative human mind.

Most would agree that the single biggest bottleneck for all machine learning is software engineering. We all collectively in the tech industry are still figuring out the best practices, tools, abstractions, and systems that can enable large organizations

Data Science has its own language. So, if you want to have at least a slight chance of surviving in the enterprise world of tomorrow -with its obsessive focus on collecting and analyzing data- you better have started yesterday with learning this terminol.

This is the norm in statistical analysis. Every time you sit down to write something up, you notice additional nuances or nits. Sometimes, the problem is severe enough I have to re-run everything. Other times, you just decide to gloss over it and move on.

One of the legendary events in the history of analytics was the original Netflix prize. The event led to a terrific example of the need to focus on not only theoretical results, but also pragmatically achievable results, when developing analytic processes.

Before data science can build the solution to simplify your life or make you lots of money, you have to give it some high quality raw materials to work with. Just like making a pizza, the better the ingredients you start with, the better the final product.

Part of Hadoop’s appeal is that it is not specifically optimized for any specific solution or data type but rather a general framework for parallel processing, so your developers and data scientists can add any relevant data, whatever its format or source.

Making graphs is very basic to data analysis. Whether you use the leading edge of statistical methods, or whether you want to quickly see the main features of your data, graphs are a must. They are the single most powerful class of tools for analyzing data.

In reality, almost no one actually cares about predictive accuracy because in almost all the cases, their starting point is nothing. The number of industries where the difference between 85 versus 90 percent accuracy is the rate-limiting factor is very small.

Deep neural networks have demonstrated impressive performance in various machine learning tasks. However, they are notoriously sensitive to changes in data distribution. Often, even a slight change in the distribution can lead to drastic performance reduction.

Data Science done well tells you:
• what you didn’t already know about the data
• what an appropriate algorithm should be, given what you now know about the data
• what the measurable expectations of that algorithm should be when it is automated in production

Analytics plays a key role by helping to reduce the size and complexity of big data to a point where it can be effectively visualized and understood. In the best scenario, the visualization and analytics are integrated so that they work seamlessly with each other.

The standard claim is that the observed effect is so large as to obviate the need for having a representative sample. Sorry – the bad news is that a huge effect for a tiny non-random segment of a large population can coexist with no effect for the entire population.

What is good visualization? It is a representation of data that helps you see what you otherwise would have been blind to if you looked only at the naked source. It enables you to see trends, patterns and outliers that tell you about yourself and what surrounds you.

I think it’s the dawn of an exciting new era of info and computer science … It’s a new world in which the ability to understand the world and people and draw conclusions will be really quite remarkable… It’s a fundamentally different way of doing computer science.

After the success of the SVM in solving real-life problems, the interest in statistical learning theory significantly increased. For the first time, abstract mathematical results in statistical learning theory have a direct impact on algorithmic tools of data analysis.

Generally, the systems implementation of machine learning methodology and ongoing software maintenance challenges are an understudied area that will continue to grow in importance as machine learning systems become more commonplace in commercial and open source software.

Information overload occurs when the amount of input to a system exceeds its processing capacity. Decision makers have fairly limited cognitive processing capacity. Consequently, when information overload occurs, it is likely that a reduction in decision quality will occur.

Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking, not only will allow one to interact competently, but will help to envision opportunities for improving datadriven decision making or to see data-oriented competitive threats.

If … we choose a group of social phenomena with no antecedent knowledge of the causation or absence of causation among them, then the calculation of correlation coefficients, total or partial, will not advance us a step toward evaluating the importance of the causes at work.

The truth is, that we need more, not less, data interpretation to deal with the onslaught of information that constitutes big data. The bottleneck in making sense of the world’s most intractable problems is not a lack of data, it is our inability to analyse and interpret it all.

Predictive analytics and cloud solutions are separately changing the way organizations do business, and bringing these two technologies together opens up new horizons. This timely research study will help organizations make investment decisions and effectively plan for the future.

To build successful teams and projects, I strongly believe in the Kaizen approach. Kaizen was made famous in part by Japanese car manufacturers involved in continuous improvement. I believe you should always be looking for ways to improve things, just small things. Just try it out.

We see that machine learning, data mining, data analysis and statistics are all highly ranking skills in the (Data Science Skill) network. This indicates that being able to understand and represent data mathematically, with statistical intuition, is a key skill for data scientists.

Data science involves principles, processes, and techniques for understanding phenomena via the (automated) analysis of data. For the perspective of this article, the ultimate goal of data science is improving decision making, as this generally is of paramount interest to business.

Analytics is the use of data and related business insights developed through applied analytical disciplines (e.g., statistical, contextual, quantitative, predictive, cognitive and other models) to drive fact-based planning, decisions, execution, management, measurement and learning.

Big Data technologies are being adopted widely for information exploitation with the help of new analytics tools and large scale computing infrastructure to process huge variety of multi-dimensional data in several areas ranging from business intelligence to scientific explorations.

The aim … is to provide a clear and rigorous basis for determining when a causal ordering can be said to hold between two variables or groups of variables in a model . . . . The concepts refer to a model-a system of equations-and not to the ‘real’ world the model purports to describe.

So consumers are happy to share personal information as long as they see a “value add” for themselves. And organisations with trust-based information sharing relationships with customers will have significant competitive advantage over those with traditional data gathering relationships.

The ability of deep learning to create features without being explicitly told means that data scientists can save sometimes months of work by relying on these networks. It also means that data scientists can work with more complex feature sets than they might have with machine learning tools.

This kind of mindset is not learned in a university program; it is part of the personality of the individual. Good predictive modelers need to have a forensic mindset and intellectual curiosity, whether or not they understand the mathematics enough to derive the equations for linear regression.

Predictive modeling can be a powerful tool to help businesses see problems and opportunities that are coming their way, but when done poorly, it can lead them down a path of error and uncertainty. Understanding where the pitfalls lie is a must for getting the most out of your analytical models.

There’s been a lot of talk about trying to make AI work on existing infrastructure. But the sad reality is that you’re always going to end up with something that’s far less than state-of-the-art. And I don’t mean it will be 30 or 40 percent slower. It’s more likely to be a thousand times slower

There is often no need to build single models over immensely large datasets. Good performance can often be achieved by building models on (very) small random parts of the data and then combining them all in an ensemble, thereby avoiding all practical burdens of making large data fit into memory.

Digital leaders know their data. They convert their information into actionable business insight. Considering that more data is shared online every second today than was stored in the entire Internet 20 years ago, it’s no wonder that differentiating products and services requires advanced tools.

Reason alone will not serve. Intuition alone can be improved by reason, but reason alone without intuition can easily lead the wrong way … both are necessary. For myself, that’s how my mind works, and that’s how I work … It’s this combination that must be recognized and acknowledged and valued.

As organizations adopt machine-learning techniques, they will see immediate competitive advantages in automation, workflow efficiency and human augmentation. Now is the golden time to consider how machine learning could help your organization and implement this science to improve overall efficiency.

Graphical Perception Experiments find that spatial position (as in a scatter plot or bar chart) leads to the most accurate decoding of numerical data and is generally preferable to visual variables such as angle, one-dimensional length, two-dimensional area, three-dimensional volume, and color saturation.

Interestingly, the number one concern for people in organizations that are just experimenting or planning for big data is the shortage of analytic professionals, which could indicate that organizations yet to take the big data plunge may be held back because of the lack of big data skills available to them.

Sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless.

Big data is not for the feint of heart, you and your team must be willing to master many disciplines in order to be successful. You’ll need understanding of code, hardware, Virtualization, networking, databases (SQL & NoSQL), ETL, Cloud, and more. Don’t fool yourself, you’ll need some serious skills on-board.

Remember that the most critical thing is not building analytic solution but making sure that your organization starts using it: that means creating buy-in, working to build adoption, educating and training, redesigning processes to include analytics. Give it time, be persistent, improve and results will follow!

The SAP Real-Time Data Platform, with SAP HANA at its core combines Hadoop with SAP Sybase IQ and other SAP technologies to provide a single platform for OLTP and analytics, with common administration, operational management, and lifecycle management support for structured, unstructured, and semistructured data.

Pretty much every application in machine learning will constantly dealing with uncertainty and thats increasingly a trend of computing. Computing is increasingly moving away from computing with logic to computing with uncertainty and moving away from pen crafted solutions to solutions which are learned from data.

Algorithms are everywhere, starting with what you receive in your mail, to the ads that follow you online. Recommendation engines are just small evidences of algorithms at work – if you look around, you will realize that a complete transition of the world’s functioning is in full swing, from humans to algorithms.

The ‘information rush’ is producing a sense of urgency; a great deal of opportunity; and spectacular breakthroughs coming from everywhere. Meanwhile, the combination of low statistics literacy and overzealous promotional hype is facilitating dysfunctional data analysis, which is more detrimental than UFO sightings.

Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.

People refer to neural networks as just ‘another tool in your machine learning toolbox’…. Unfortunately, this interpretation completely misses the forest for the trees. Neural networks are not just another classifier; they represent the beginning of a fundamental shift in how we write software. They are Software 2.0.

Today, we live in an always-on digital world. We work online. We socialize online. We shop online. We bank online. We support causes online. Not to mention, we drive on toll roads with our EZPasses, go to Disney World with our MagicBands, and check our personal stats with our Fitbits. We are living in a big data world.

This study provides a clear illustration that larger data indeed can be more valuable assets for predictive analytics. This implies that institutions with larger data assets – plus the skill to take advantage of them – potentially can obtain substantial competitive advantage over institutions without such access or skill.

[In Neural Networks] It is not required that a neuron has its outlet connected to the inputs of every neuron in the next layer. In fact, selecting which neurons to connect to which other neurons in the next layer is an art that comes from experience. Allowing maximal connectivity will more often than not result in overfitting.

In real organizations, people need dead simple story-telling – Which features are you using ? How your algorithms work ? What is your strategy ? etc. … If your models are not parsimonious enough, you risk to lose the audience confidence. Convincing stackeholders is a key driver for success, and people trust what they understand.

While there are projects underway to help automate the data cleaning process and reduce the time it takes, the task of automation is made difficult by the fact that the process is as much art as science, and no two data preparation tasks are the same. That’s why flexible, high-level languages like R are a key part of the process.

Although such Business Intelligence is still quite common and does give you at least some insights, the fast-changing world of today requires a different approach. Organisations today should strive for a holistic overview of their internal and external data that is analysed on the spot and returned graphically via live storylines.

There is a very important reason we use math. Our intuitions are often trained from a lifetime of experience, throwing that out in the name of ‘objectivity’ is foolishly excluding important information. But intuition also likes to skips steps, making it very easy to make errors in our reasoning if we don’t pull apart our intuition.

George Box has [almost] said ‘The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively.’ These words of caution about ‘natural experiments’ are uncomfortably strong. Yet in today’s world we see no alternative to accepting them as, if anything, too weak.

The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work.

Indeed, in neural networks, we almost always choose our model as the output of running stochastic gradient descent. Appealing to linear models, we analyze how SGD acts as an implicit regularizer. For linear models, SGD always converges to a solution with small norm. Hence, the algorithm itself is implicitly regularizing the solution.

The human brain is comprised of 300 million pattern matchers fed with data from our five primary senses and memories. In this age of distributed computing and cheap storage in the cloud, “thinking” without a biological brain is possible for the first time in history. The sensory input into this new, extra-corporeal brain is big data.

The end of data scientists. Data science moves from the specialist to the everyman. Familiarity with data analysis becomes part of the skill set of ordinary business users, not experts with “analyst” in their titles. Organizations that use data to make decisions are more successful, and those that don’t use data begin to fall behind.

On a scale less grand, but probably more common, data-analytics projects reach into all business units. Employees throughout these units must interact with the data-science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business.

What data fusion brings to the table is the idea that end-users, whether they are humans or machines, are brought into the data processing loop as collaborators. By iteratively combining multiple data streams in new and interesting ways, driven by the changing needs of users, data fusion produces a wide variety of ways to aggregate data streams.

At one time we had wisdom, but little knowledge. Now we have a great deal of knowledge, but do we have enough wisdom to deal with that knowledge? I define wisdom as the capacity to make retrospective judgments prospectively. I think these are human qualities, human attributes that need to be brought out, need to be drawn upon, need to be valued.

Improvements in technology and big data trends have given rise to improvements in machine learning. The sheer volume of data is growing exponentially, and companies are looking for faster speeds and real-time analytics. Cognitive computing combines machine learning and artificial intelligence to go beyond data mining and provide actionable insights.

When you’re thinking about artificial intelligence and machine learning at the enterprise level, it’s very important to make a road map. We’re making the same mistakes that we made in the advent of CRM systems, ERP systems, in the introduction of client-server technology – even the introduction of the computer into businesses. We need to learn from that.

Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that’s really step one of revealing intelligence…. If you don’t have this, all the data collection and presentation polishing in the world is meaningless.

What is a Prediction Problem? A business problem which involves predicting future events by extracting patterns in the historical data. Prediction problems are solved using Statistical techniques, mathematical models or machine learning techniques. For example: Forecasting stock price for the next week, predicting which football team wins the world cup, etc.

Sciences are primarily defined by their questions rather than by their tools. We define astrophysics as the discipline that learns the composition of the stars, not as the discipline that uses the spectroscope. Similarly, data science is the discipline that describes, predicts, and makes causal inferences, not the discipline that uses machine learning algorithms.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts.

All these new Big Data applications require a new way of working. As a result General Motors is currently undergoing a massive, cultural, change to become data-driven; hiring thousands of new employees will have a profound affect on the company culture, but in the end all existing and new employees must learn and adapt to this new, data-driven and information-centric, culture.

If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data.

In many applications, particularly in the business domain, the data is not stationary, but rather changing and evolving. This changing data may make previously discovered patterns invalid and as a result, there is clearly a need for incremental methods that are able to update changing models, and for strategies to identify and manage patterns of temporal change in knowledge bases.

Data science is a component of many sciences, including the health and social sciences. Therefore, the tasks of data science are the tasks of those sciences – description, prediction, causal inference. A sometimes overlooked point is that a successful data science requires not only good data and algorithms, but also domain knowledge (including causal knowledge) from its parent sciences.

R is certainly becoming more and more popular, and seems to have found widespread adoption within many statistical research communities. This is a great thing as it means as new statistical methods or practices come out of the research world, they are often implemented and available in R. In many cases they have been written by the person who “wrote the book” (or paper) on a given topic.

The question is no longer whether AI is going to fundamentally change the workplace. According to a recent survey, 85 percent of executives believe that AI will be transformative for their companies, enabling them to enter a new business or competitive advantage. Now, the true question lies in how companies can successfully leverage AI in ways that joins, not replaces, the human workforce.

1. Think carefully about which projects you take on.
2. Use as much data as you can from as many places as possible.
3. Don’t just use internal customer data.
4. Have a clear sampling strategy.
5. Always use a holdout sample.
6. Spend time on ‘throwaway’ modelling.
7. Refresh your model regularly.
8. Make sure your insights are meaningful to other people.
9. Use your model in the real world.

A lot of existing Big Data techniques require you to really get your hands dirty; I don’t think that most Big Data software is as mature as it needs to be in order to be accessible to business users at most enterprises. So if you’re not Google or LinkedIn or Facebook, and you don’t have thousands of engineers to work with Big Data, it can be difficult to find business answers in the information.

Numerous changes and innovations have come to life recently. The pace of digital revolution is unimaginable concerning it keeps on increasing. There is no doubt most of approaching digital changes are potentially disruptive to older habits, businesses, beliefs. Unconditionally they are changing former way of life on the globe. They push whole humanity into something very new and completely unknown.

There is predictable data as far as the eye can see. Millions of variables quietly tracing the path we thought, and perhaps hoped, they would. Because there are so many, noticing when one of these variables does something unexpected is a task that is unsolvable by diligence alone. In order to spot these rare unexpected observations, we need an often-overlooked statistical analysis: anomaly detection.

We now have unsupervised techniques that actually work. The problem is that you can beat them by just collecting more data, and then using supervised learning. This is why in industry, the applications of Deep Learning are currently all supervised. I agree with you that for the search and advertising industry, supervised learning is used because of the vast amounts of data being generated and gathered.

Predictive analytics is an iterative process that begins with an understanding of the question the user wants to answer. By exploring the relationships among different variables using correlation analysis, users can build sophisticated mathematical models that can cut through the complexity of modern computing systems to uncover previously hidden patterns, identify classifications and make associations.

Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many “researcher degrees of freedom”. If a naive analyst can pick any of a range of methods and does not understand how they work, they will generally pick the one that gives them maximum benefit.

R, an open source programming language for computational statistics, visualization and data is becoming a ubiquitous tool in advanced analytics offerings. Nearly every top vendor of advanced analytics has integrated R into their offering and so that they can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.

Data scientists know that it is futile to impose raw math and statistics on people who are not adept at them. The goal is to get an analytics platform into the hands of people who can build the models for use all around the organization. Every analytics platform claims ease of use, but that is not enough. It must be sufficiently powerful to meet the needs of data scientists yet easy enough for LOB staff to use.

People share and put billions of connections into this big graph every day. We don’t want to just add incrementally to that. We want, over the next five or ten years, to take on a road map to try to understand everything in the world semantically and map everything out. These are the big themes for us and is what we are going to try and do over the next five or ten years. That is what I have tried to focus us on …

What hiring companies consider requirements for being a data scientist. Here is a short list for an honest assessment:
– Are you really good at math – undeterred with calculus, differential equations, and linear algebra? Are you also strong in statistics and probability theory?
– Do you also know R and/or Python for developing machine learning algorithms?
– Do you have deep domain knowledge of a particular industry?

…in an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.

The Web is so vast … you need to extend categorization and make sense of the content and have a Web ordered for you … One of the key pieces is you have to understand and decide what the Ontology of entities is. Meaning how things are named and how are they organized into hierarchies … By mapping people’s search habits you pull all their content together and have a feed of information that is the web ordered for you.

Some decisions you need to make are big enough to change the course for your business. And your past experiences may not be good predictors of the future. More data are within your reach to understand what was previously unknown. Sophisticated analytical tools are available to you to ‘see’ a wider range of possibilities and evaluate them quickly. Now is a good time for an upgrade in your decision making capabilities.

Unlike a pure statistician, a data scientist is also expected to write code and understand business. Data science is a multi-disciplinary practice requiring a broad range of knowledge and insight. It’s not unusual for a data scientist to explore a fresh set of data in the morning, create a model before lunch, run a series of analytics in the afternoon and brief a team of digital marketers before heading home at night.

Big Data’s undeniable impact on companies’ goodwill and reputation has permeated the landscape of corporate valuation. Recent research confirms that companies need to face the new normal whereby corporate reputations suffer after mishaps with data under their control. Today’s companies must appreciate that their use, misuse and governance of Big Data can have an impactful effect on their goodwill and resulting valuation.

The end goal of pervasive analytics is simple and will change the way in which the world operates today. By feeding individuals the right information, at the right time, analytics become invisible and embedded into every application and workflow of every user. It is a vision, a goal, a strategy that every individual across every industry can rally around in order to drive the business metrics that matter through the use of data.

The 4 Types of Data Analytics
• Descriptive: Answers the question, ‘What Happened?’.
• Diagnostic: Commonly used in engineering and sciences to diagnose ‘what went wrong?’.
• Predictive: Used to predict for future trends and events based on statistical or mathematical modeling of current and historical data.
• Prescriptive: Used to tell you what to do to achieve a desired result. Based on the findings of predictive analytics.

There is a dizzying array of algorithms from which to choose, and just making the choice between them presupposes that you have sufficiently advanced mathematical background to understand the alternatives and make a rational choice. The options are also changing, evolving constantly as a result of the work of some very bright, very dedicated researchers who are continually refining existing algorithms and coming up with new ones.

Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data.

Simply looking at big data (e.g., total offensive or defensive yards) will not provide the right information – and only focusing on the single data point of pass completion percentage will not provide the valuable intelligence to help reach the goal of improving pass completion percentage. Only integrating and analyzing a variety of smaller smart data points will provide the actionable knowledge to make the best possible decisions.

One effect antithetical to exchangeability is “concept drift.” Concept drift is when the meanings and distributions of variables or relations between variables changes over time. Concept drift is a killer: if the relations available to you during training are thought not to hold during later application then you should not expect to build a useful model. This one of the hard lessons that statistics tries so hard to quantify and teach.

Current machine learning systems operate, almost exclusively, in a statistical, or model-free mode, which entails severe theoretical limits on their power and performance. Such systems cannot reason about interventions and retrospection and, therefore, cannot serve as the basis for strong AI. To achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks.

You do not know how to model: Learn it dude! There is no short-cut to learning. Your organization needs to learn it, even yourself. Leverage every single person of your organization that has any glimpse of experiences in dealing with the data. Combine that quant dude with a domain expert, let them fight and muddle through the journey. The organization needs it. So does you to learn how it brings value exactly to different internal clients.

Monitoring and maintenance are important issues if the data mining result becomes part of the day-to-day business and its environment. A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods of incorrect usage of data mining results. To monitor the deployment of the data mining result(s), the project needs a detailed plan on the monitoring process. This plan takes into account the specific type of deployment.

In any case, I come to the conclusion that Data Science is just another term in a long-line of terms. Whether called statistics or customer analytics or data mining or analytics or data science, the goal is the same. Computers have been and are gathering incredible amounts of data about people, businesses, markets, economies, needs, desires, and solutions – there will always be people who take up the challenge of transforming the data into solutions.

For a long time, Predictive Analytics has been primarily the responsibility of the Data Science and Analytics team, but this outlook is changing fast. While Data Science team still remains the primary contributor, the responsibility is increasingly being shared with database management, BI, LOB (Line of Business) analysts and others. This clearly demonstrates the need for better training and support for the non-technical users of Predictive Analytics.

Without effective data governance and data management, big data can mean big problems for many organizations already struggling with more data than they can handle. That “lake” they are building can very easily become a “cesspool” without appropriate data management practices that are adapted to this new platform. The solution? Firms need to actively adapt their data governance and data management capabilities – from implementing to ongoing maintenance.

Data science, surprisingly perhaps, is not about designing the most advanced machine learning algorithms and training them on all of the data (and then having Skynet). It’s about finding the right data, becoming a quasi-expert on the process, system, or event you are trying to model, and crafting features that will help quirky and sometimes frail statistical algorithms make accurate predictions. Very little time is actually spent on the algorithm itself.

Start small and go big: Analytical projects should not be planned across an entire company, or even division-wide. Initial pilots should focus on small, identifiable challenges and work to resolve those challenges. Once a project has been successfully piloted and measured, other teams within the organization will see the value in the new analytical technologies, and also understand the organizational changes required to adopt a new mindset and technology.

A machine isn’t a human. It’s not going to necessarily incorporate bias even from biased training data in the same way that a human would. Machine learning isn’t necessarily going to adopt-for lack of a better word-a clearly racist bias. It’s likely to have some kind of much more nuanced bias that is far more difficult to predict. It may, say, come up with very specific instances of people it doesn’t want to hire that may not even be related to human bias.

Data sources such as social media text, messaging, log files (such as clickstream data), and machine data from sensors in the physical world present the opportunity to pick up where transaction systems leave off regarding underlying sentiment driving customer interactions; external events or trends impacting institutional financial or security risk; or adding more detail regarding the environment in which supply chains, transport or utility networks operate.

We should expect a ‘Big Data 2.0’ phase to follow ‘Big Data 1.0’. Once firms have become capable of processing massive data in a flexible fashion, they should begin asking: ‘What can I do that couldn´t do before, or do better than I could do before?’ This is likely to be the golden era of data science. The principles and techniques (introduced currently e.g. due to ‘Predictive Analytics’ and HANA) will be applied far more broadly and deeply than they are today.

Hadoop is well suited for simple parallel problems but it comes up short for large-scale complex analytics. A growing number of complex analytics use cases are proving to be unworkable in Hadoop. Some examples include recommendation engines based on millions of customers and products, running massive correlations across giant arrays of genetic sequencing data and applying powerful noise reduction algorithms to finding actionable information in sensor and image data.

The intuition behind ‘latent semantic analysis’ (LSA) is to find the latent structure of ‘topics’ or ‘concepts’ in a text corpus, which captures the meaning of the text that is imagined to be obscured by “word choice” noise. The term ‘latent semantic analysis’ has been coined by Deerwester et al. who empirically showed that the co-occurrence structure of terms in text documents can be used to recover this latent topic structure, notably without any usage of background knowledge.

Be the Google of Your Organization: Everybody understands that data is key to Google’s dominance. Companies across every industry are trying to establish themselves as the owners and users of the best data available. Watching this data gold rush, it can be easy to forget about the opportunities much closer to home. Within your company, some people are using data to create and advocate effective strategic arguments. How do you ensure that you are a Google within your company, instead of an Alta Vista?

Analysts will need a proper understanding of math, statistics, algorithms, and other related sciences in order to deliver meaningful results. They must pair that theoretical knowledge with a firm grasp of the modern-day tools that make the analyses possible. That means having an ability to express queries in terms of MapReduce or some other distributed system, an understanding of how to model data storage across different NoSQL-style systems, and familiarity with libraries that implement common algorithms.

You believe in a God who plays dice, and I in complete law and order in a world which objectively exists, and which I, in a wildly speculative way, am trying to capture. I firmly believe, but hope that someone will discover a more realistic way, or rather a more tangible basis than it has been my lot to do. Even the great initial success of the quantum theory does not make me believe in the fundamental dice game, although I am well aware that your younger colleagues interpret this as a consequence of senility.

Data Science forms the very substratum of an Analytics Practitioners’ work, it’s what sets us apart from Statisticians or Mathematicians. However in some instances we cannot rely on it alone, we need to employ other measures to increase its definitiveness. In any event I am sure many Data Scientists use math and other means to augment the potency of their Analytics, some not even scientific at all. It is undeniably prudent to do so where necessary, especially in fields that demand a higher standard of accuracy and care.

‘The end of the Data Scientist Bubble’. This was the subject of a provocative article posted on Oracle’s blog, two days ago. It certainly shows how far from the reality some big companies are. They confuse people who call themselves data scientists (or get assigned that job title), with those who are true data scientists, and might use a different job title. Many times, the issue is internal politics that create the confusion, and not recognizing a real data scientist with success stories to share, or not leveraging them.

Forecasting is hard, and even those who sometimes get it right, often fail on a continuous basis. But fear not, there are three steps you can take to drastically improve your forecast accuracy, but you’ll have to be willing to put in the work, and possibly put your ego aside to get there.
1) First, understand that domain knowledge of a particular area doesn’t necessarily mean you’ll see the future better than anyone else.
2) Second, if you want better forecasts, run your expert opinions by others.
3) Third, bring your data – in fact, bring all of them.

Once the often laborious task of data munging is complete, the next step in the data science process is to become intimately familiar with the data set by performing what’s called Exploratory Data Analysis (EDA). The way to gain this level of familiarity is to utilize the features of the statistical environment you’re using (R, Matlab, SAS, Python, etc.) that support this effort – numeric summaries, aggregations, distributions, densities, reviewing all the levels of factor variables, applying general statistical methods, exploratory plots, and expository plots.

In the fast moving world of today, data is being created at lightning speed. Data comes from an infinite variety of sources and all this data can be used to discover valuable business insights. Combining internal and external data can enable organisations to beat the competition, as the analysis will provide valuable insights. The more business users that work with such insights, the better your organisation will become. Organisations should therefore strive for a data-driven, information-centric culture, where every business user makes decisions based on data.

Analytics no matter how advanced they are, does not remove the need for human insights. On the contrary, there is a compelling need for skilled people with the ability to understand data, think from the business point of view and come up with insights. For this very reason technology professionals with Analytics skill are finding themselves in high demand as businesses look to harness the power of Big Data. A professional with the Analytical skills can master the ocean of Big Data and become a vital asset to an organization, boosting the business and their career.

I can see at least two options where methods from Data Science will benefit from Linked Data technologies and vice versa:

– Machine learning algorithms benefit from the linking of various data sets by using ontologies and common vocabularies as well as reasoning, which leads to a broader data basis with (sometimes) higher data quality

It is important to understand data science even if you never intend to do it yourself, because data analysis is now so critical to business strategy. Businesses increasingly are driven by data analytics, so there is great professional advantage in being able to interact competently with and within such businesses. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threads.

The mechanical process by which data scientists and citizen data scientists make better use of data and analytics is underpinned by a deeper question about the organization as a whole: Does it have processes for sharing anything? This is not always a given in companies that have grown quickly, have grown through mergers and acquisitions or have begun to shrink. If the culture has never embraced or fostered the notion of transparency and sharing, then whatever process the company may put in place to use software to publish analytical models and the data they harvest is unlikely to succeed.

Robert Morison, lead faculty member for the International Institute for Analytics, provided three reasons businesses experience big data failures. Briefly, they are as follows:
1. As cited in the piece, clinging to a traditional IT project management style. Solution: Think R&D.
2. Businesses are taken in by the hype and make their first big data project a big deal. Solution: Businesses should start with a smaller project that will “move the proverbial needle.”
3. Reasonably good analytics are done, but they are not adopted. Solution: The business has to own the problem or the ambition to improve.

When data is locked in silos, organizations are unable to find and include all enterprise data for use with big data analytics tools. Planning to implement a data centric data management strategy enables the distributed metadata repository to be a source for analytics tools, as it can be used to provide real-time insight, without having to migrate data from silos to a separate analytics platform. It also enhances the quality of results, because having more relevant data often produces more accurate analysis. If organizations can harness all of its data, they will attain a greater competitive advantage.

Big data have the potential to improve or transform existing business operations and reshape entire economic sectors. Big data can pave the way for disruptive, entrepreneurial companies and allow new industries to emerge. The technological aspect is important, but insufficient to allow big data to show their full potential and to stop companies from feeling swamped by this information. What matters is to reshape internal decision-making culture so that executives base their judgments on data rather than hunches. Research already indicates that companies that have managed this are more likely to be productive and profitable than the competition.

Success in today´s data-oriented business environment requires being able to think about how these fundamental concepts (Data Mining, Predictive Analytics) apply to particular business problems – to think data-analytically. Data should be thought of as a business asset, and once we are thinking in this direction we start to ask whether (and how much) we should invest in data. Thus, an understanding of these fundamental concepts is important not only for data scientists themselves, but for any one working with data scientists, employing data scientists, investing in data-heavy ventures, or directing the application of analytics in an organization.

In the field of ‘big data’, Gartner identified five different types of data source used to ‘exploit big data’ in a company (Buytendijk et al., 2013): ‘Operational data comes from transaction systems, the monitoring of streaming data and sensor data; Dark data is data that you already own but don’t use: emails, contracts, written reports and so forth; Commercial data may be structured or unstructured, and is purchased from industry organisations, social media providers and so on; Social data comes from Twitter, Facebook and other interfaces; Public data can have numerous formats and topics, such as economic data, socio-demographic data and even weather data.’

Our Collective Data Science Duty: Here’s the thing, technology is empowering the public in never before seen ways, and data is the backbone of that shift. Between wearable tech and digital identity platforms, people are creating more data every day than has ever been created in decades, no, centuries past. Each of us is essentially our own personal data scientist, and those working in the digital space have very much been their own statisticians for quite some time. It’s why platforms like Google Analytics, Omniture and more are so popular across the industry. They put the power of analytics in the hands of users, requiring little training but returning lots of measurability.

An agile environment is one that’s adaptive and promotes evolutionary development and continuous improvement. It fosters flexibility and champions fast failures. Perhaps most importantly, it helps software development teams build and deliver optimal solutions as rapidly as possible. That’s because in today’s competitive market chock-full of tech-savvy customers used to new apps and app updates every day and copious amounts of data with which to work, IT teams can no longer respond to IT requests with months-long development cycles. It doesn’t matter if the request is from a product manager looking to map the next rev’s upgrade or a data scientist asking for a new analytics model.

What was once just a figment of the imagination of some our most famous science fiction writers, artificial intelligence (AI) is taking root in our everyday lives. We’re still a few years away from having robots at our beck and call, but AI has already had a profound impact in more subtle ways. Weather forecasts, email spam filtering, Google’s search predictions, and voice recognition, such Apple’s Siri, are all examples. What these technologies have in common are machine-learning algorithms that enable them to react and respond in real time. There will be growing pains as AI technology evolves, but the positive effect it will have on society in terms of efficiency is immeasurable.

Data science done well looks easy – and that is a big problem for data scientists. The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really ‘hard’ and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc. In the meantime, don’t be fooled by what looks like simple data science – it can often be pretty effective.

Data Scientists and automation (data products, algorithms, production code, whatever) are complementary functions. Good Data Science supports automation. It quickly adds value by investigating, testing, and quantifying hypotheses about existing data and potential new data. Simply switching on software ignores the reality of working with data, regardless of the claims of that software. Data is full of nuances, errors and unknown relationships that are best discovered and tested by an expert Data Scientist. This takes time and does not scale but it does not have to scale. It is the necessary prudent investment that you make before spending months in product development and automation of the wrong algorithm on the wrong or broken data.

One clear conclusion of this work is that, although the DL technology has achieved very promising results, there is still a significant need for further research into and development in how to easily and efficiently build high-quality production-ready DL systems. Traditional SE has high-quality tools and practices for reviewing, writing test, and debugging code. However, they are rarely sufficient for building production-ready systems containing DL components. If the SE community, together with the DL community, could make an effort in finding solutions to these challenges, the power of the DL technology could not only be made available to researchers and large technology companies, but also to the vast majority of companies around the world.

Top takeaways from my interviews with experts from organizations offering AI products and services:
• AI is too big for any single device or system
• AI is a distributed phenomenon
• AI will deliver value to users through devices, but the heavy lifting will be performed in the cloud
• AI is a two-way street, with information passed back and forth between local devices and remote systems
• AI apps and interfaces will be designed and engineered increasingly for nontechnical users
• Companies will incorporate AI capabilities into new products and services routinely
• A new generation of AI-enriched products and services will be connected and supported through the cloud
• AI in the cloud will become a standard combination, like peanut butter and jelly

The component of prediction tasks that can be easily automated is the one that does not involve any expert knowledge. Prediction tasks require expert knowledge to specify the scientific question (what input and what outputs) and to identify/generate relevant data sources. (The extent of expert knowledge varies across different prediction tasks.18) However, no expert knowledge is required for prediction after the inputs and outputs are specified and measured in a particular dataset. At this point, a machine learning algorithm can take over the data analysis to deliver a mapping and quantify its performance. The resulting mapping may be opaque, as in many deep learning applications, but its ability to map the inputs to the outputs with a known accuracy is not in question.

On a scale less grand, but probably more common, data analytics projects reach into all business units. Employees throughout these units must interact with the data science team. If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making. This requires a close interaction between the data scientists and the business people responsible for decision-making. Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions.

If we think of training the model as a part of it, then even after you’ve trained a model and evaluated it and found it to be good by some evaluation metric standards, when you deploy it, where it actually goes and faces users, then there’s a different set of metrics that would impact the users. You might measure: how long do users actually interact with this model? Does it actually make a difference in the length of time? Did they used to interact less and now they’re more engaged, or vice versa? That’s different from whatever evaluation metric that you used, like AUC or per class accuracy or precision and recall. … It’s probably not enough to just say this model has a .85 F1 score and expect someone who has not done any data science to understand what that means. How good are the results? What does it actually mean to the end users of the product?

The majority of organizations have barely moved beyond static BI reports and are unaware of the actual potential their data holds. Going from being ‘data unaware’ to investing in a big data architecture in one leap sets a company up for a bad ROI in analytics. A solid investment must begin with understanding what data is actually available and identifying the low hanging ‘data fruits’ that can lead to real value for the company. This can be used to build lightweight solutions that are imperfect but hugely beneficial. It can provide real-world tools for decision support via recommended actions or highlighted opportunities in real-time. It can offload much of the routine repetitive decision-making to algorithms so that professionals can operate with a more strategic view of the organization and bring their creative talents to their company’s challenges.

Managing big data for analytics is not the same as managing DW data for reporting. In fact, the two are almost opposites … . For example, reporting is about seeing the latest values of the numbers that you track over time via a report. Obviously, you know the report, the business entities it represents, and the data warehouse that feeds the report. An analysis is more about discovering variables you don’t know, based on data that you probably don’t know very well. Also, a report requires a solid audit trail, so its data must be managed with welldocumented metadata and possibly master data, too. Since most analyses have no expectation of an audit trail, there’s no need to manage one. That’s just a sampling of the differences. The point is to embrace Big Data Management for analytics as a unique practice that doesn’t follow all the strict rules we’re taught for reporting and data warehousing.

A different perspective on what data scientists are capable of:
• Imagine dozens of scenarios and rank them by chance of occurring
• Get siloed data from various departments (finance, sales, marketing, product, IT)
• Analyze the data in connection with the scenarios (including checking data validity)
• Get external data (competitive intelligence) as needed
• Find the causes (not just correlations)
• Find the remedies
• Detect issues well before anyone else can see them, by looking in summary data
• Complete the analysis with a 48 hours turnaround
Such a data scientist who can save billions to a company, is usually not hired, for the following reasons
• Companies are looking for coders, not business solvers, when they hire a data guru, despite claiming the contrary
• A data scientist without Python on his resume is unlikely to ever get hired
• Hard work gets rewarded, smart work does not.

Today’s information rush is exemplified by the great promise of overflowing observational data, hyper communications, and the approaching Internet of Things. The promotional hype intially comes from journals, self-glorifying books, and vendors, all with a certain perspective that is not informed by practice experience—publishers are unable to discern qualifications. This creates misinformation stampedes with energized statistics deniers writing amplifying blogs, presentation decks, et al., which further mischaracterize and even adulterate statistics. The downstream echos talk everyone into believing their own hyped fabrications. Two of the problems are that 1. Selling good statistics practice can be less lucrative than cutting some serious corners; and 2. Promoting services, workshops, data-analysis results, etc. is easier when not encombered by competently weilding and accurately depicting statistics.

People like simple explanations for complex phenomena. If you work as a data scientist, or if you are planning to become/hire one, you’ve probably seen storytelling listed as one of the key skills that data scientists should have. Unlike “real” scientists that work in academia and have to explain their results mostly to peers who can handle technical complexities, data scientists in industry have to deal with non-technical stakeholders who want to understand how the models work. However, these stakeholders rarely have the time or patience to understand how things truly work. What they want is a simple hand-wavy explanation to make them feel as if they understand the matter – they want a story, not a technical report (an aside: don’t feel too smug, there is a lot of knowledge out there and in matters that fall outside of our main interests we are all non-technical stakeholders who get fed simple stories).

Which use data aggregation and data mining techniques to provide insight into the past and answer: “What has happened?”
Use Descriptive statistics when you need to understand at an aggregate level what is going on in your company, and when you want to summarize and describe different aspects of your business.
Predictive Analytics: understanding the future

Which use statistical models and forecasts techniques to understand the future and answer: “What could happen?”
Use Predictive analysis any time you need to know something about the future, or fill in the information that you do not have.
Prescriptive Analytics: advise on possible outcomes

Which use optimization and simulation algorithms to advice on possible outcomes and answer: “What should we do?
Use prescriptive statistics anytime you need to provide users with advice on what action to take.

Two decades ago the folks who prepared our reports, graphs, and visualizations were ‘data analysts’ who knew how to extract data from relational data warehouses and run it through reporting and visualization tools like Crystal Reports. Ten years ago, predictive models were built by ‘predictive modelers’ who understood both the extraction and preparation of the data as well as the specialized predictive analytic tools like SAS and SPSS that allowed them to prepare predictive models. In the last few years, Gartner now declares that we need ‘data scientist’ who have all the above skills but also understand the complexities of the new NoSQL databases like Hadoop and can marry data from many sources and types together to produce useful and profitable predictive models. The requirement for broader and deeper skills is real and must factor into any business decision to build in-house capacity, as well as vetting potential consultants.

There is no general rule dictating how organizations should navigate the stages of big data maturity. They must each decide for themselves, based on their own situation – the competitive environment they are operating in, their business model, and their existing internal capabilities. In less-advanced sectors, with executives still grappling with existing data, making intelligent use of what they already possess may have a substantial impact on decision making.
The main priorities for executives are to:
• develop a clear (big) data strategy;
• prove the value of data in pilot schemes;
• identify the owner for “big data” in the organization and formally establish a “Chief Data Scientist” position (where applicable);
• recruit/train talent to ask the right questions and technical personnel to provide the systems and tools to allow data scientists to answer those questions;
• position big data as an integral element of the operating model; and establish a data-driven decision culture and launch a communication campaign around it.

The Last Mile of Analytics
1. Succeeding in Analytics and getting to the Last Mile of embedding Analytics in decision-making is not just about dealing with data. The journey from Data to Decisions requires one to look at how Analytics is operationalized in the business. And the missing piece that actually can drive or enable Analytics is not data but Culture. Can culture be enabled by Analytics?
2. To make Analytics actually a part of the Company culture, it’s not enough to have a set of people providing Analytics solutions. Analytics needs to be embedded in technology accelerators that can directly enable decisions at the point of action. This makes Analytics accountable for real business decisions rather than just providing more data.
3. Making Analytics work across the business requires collaboration and the right choice of Engagement Model which would vary based on the maturity of the organization and its decision-making, not its data needs. The business models could range from Products, Services, and Managed Solutions to Analytic Marketplaces. Unless the right business model is chosen, Analytics will remain a discrete project.

Pattern Analytics can be defined as a discipline of Big Data that enables business leaders to understand how different variables of the business interact and are linked with each other. Variables can be of any kind and within any data source, structured as well as unstructured. Such patterns can indicate opportunities for innovation or threats of disruption for your business and therefore require action. Finding patterns within the data and sifting it out is difficult. Machine learning can contribute in helping us humans find patterns that are relevant, but too difficult for us to see. This enables organizations to find patterns they act on. Business leaders can learn from these patterns and use them in their decision-making process. Business leaders therefore should rely less on their gut feeling and years of experience, and more on the data. Pattern Analytics does not require predefined models; the algorithms will do the work for you and find whatever is relevant in a combination of large sets of data. The key with pattern analytics is automatically revealing intelligence that is hidden in the data and these insights will help you grow your business.

There’s structure in it, but it’s kind of a different form. … It’s spit out by machines and programs. There’s structure, but that structure is difficult to understand for humans. … So, you can’t just throw all of it into an algorithm and expect the algorithm to be able to make sense of it. You really have to process the features, do a lot of pre-processing, and first do things like extract out the frequent sequences, maybe, or figure out what’s the right way to represent IP addresses, for instance. Maybe you don’t want to represent latency by the actual latency number, which could have a very skewed distribution, with lots and lots of large numbers. You might want to assign them into bins or something. There are a lot of things that you need to do to get the data into a format that’s friendly to the model, and then you want to choose the right model. Maybe after you choose the model, you realize this model really is suitable for numeric data and not categorical data. Then you need to go back to the feature engineering part and figure out the best way to represent the data. … I hesitate to say anything critical because half of my friends are in machine learning, which is all about algorithms. I think we already have enough algorithms. It’s not that we don’t need more and better algorithms. I think a much, much bigger challenge is data itself, features, and feature engineering.

It’s not enough to tell someone, ‘This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.’ As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. … It really comes back to this big divide, this bottleneck, between the domain expert and the machine learning expert. I saw that as the most challenging problem facing us when we try to really make machine learning widely applied in the world. I saw both machine learning experts and domain experts as being difficult to scale up. There’s only a few of each kind of expert produced every year. I thought, how can I scale up machine learning expertise? I thought the best thing that I could do is to build software that doesn’t take a machine learning expert to use, so that the domain experts can use them to build their own applications. That’s what prompted me to do research in automating machine learning while at MSR [Microsoft Research].

Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering — uncertainty and complexity — and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity — a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms. Many of the classical multivariate probabalistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism — examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages — in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems.

Today, you are much less likely to face a scenario in which you cannot query data and get a response back in a brief period of time. Analytical processes that used to require month, days, or hours have been reduced to minutes, seconds, and fractions of seconds. But shorter processing times have led to higher expectations. Two years ago, many data analysts thought that generating a result from a query in less than 40 minutes was nothing short of miraculous. Today, they expect to see results in under a minute. That’s practically the speed of thought – you think of a query, you get a result, and you begin your experiment. “It’s about moving with greater speed toward previously unknown questions, defining new insights, and reducing the time between when an event happens somewhere in the world and someone responds or reacts to that event,” says Erickson. A rapidly emerging universe of newer technologies has dramatically reduced data processing cycle time, making it possible to explore and experiment with data in ways that would not have been practical or even possible a few years ago. Despite the availability of new tools and systems for handling massive amounts of data at incredible speeds, however, the real promise of advanced data analytics lies beyond the realm of pure technology. “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse,” says Michael Minelli, co-author of Big Data, Big Analytics. “It’s about the ability to make better decisions and take meaningful actions at the right time. It’s about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing on a checkout line, or placing an ad on a website while someone is reading a specific article. It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place.” For some, real-time big data analytics (RTBDA) is a ticket to improved sales, higher profits and lower marketing costs. To others, it signals the dawn of a new era in which machines begin to think and respond more like humans.

There are few trends in the Big Data and Data Science world that can be of interest to market researchers:
• Visualization. There is a lot of interest in the Big Data and Data Science world for everything that has to do with Visualization. I’ll admit that sometimes it is Visualize to Impress rather than to Inform, but when it comes to informing clearly, communicating in a simple and understandable way, storytelling, and so on, we market researchers have a head start.
• Natural Language Processing. One of the 4 V’s of Big Data stands for Variety. Very often this refers to unstructured data, which sometimes refers to free text. Big Data and Data Science folks, for instance, start to analyze text that is entered in the free fields of production systems. This problem is not disimilar to what we do when we analyse open questions. Again market research has an opportunity to play a role here. By the way, it goes beyond sentiment analysis. Techniques that I’ve seen successfully used in the Big Data / Data Science world are topic generation and document classification. Think about analysing customer complaints, for instance.
• Deep Learning. Deep learning risks to become the next fad, largely because of the name Deep. But deep here does not refer to profound, but rather to the fact that you have multiple hidden layers in a neural network. And a neural network is basically a logistic regression (OK, I simplify a bit here). So absolutely no magic here, but absolutely great results. Deep learning is a machine learning technique that tries to model high-level abstractions by using so called learning representations of data where data is transformed to a representation of that data that is easier to use with other Machine Learning techniques. A typical example is a picture that constitutes of pixels. These pixels can be represented by more abstract elements such as edges, shapes, and so on. These edges and shapes can on their turn be furthere represented by simple objects, and so on. In the end, this example, leads to systems that are able to reasonably describe pictures in broad terms, but nonetheless useful for practical purposes, especially, when processing by humans is not an option. How can this be applied in Market Research? Already today (shallow) Neural networks are used in Market Research. One research company I know uses neural networks to classify products sold in stores in broad buckets such as petfood, clothing, and so on, based on the free field descriptions that come with the barcode data that the stores deliver.

Here are some rules of thumb for what makes a good metric-a number that will drive the changes you’re looking for.

A good metric is comparative.

Being able to compare a metric to other time periods, groups of users, or competitors helps you understand which way things are moving. “Increased conversion from last week” is more meaningful than “2% conversion”.

A good metric is understandable.

If people can’t remember it and discuss it, it’s much harder to turn a change in the data into a change in the culture.

A good metric is a ratio or a rate.

Accountants and financial analysts have several ratios they look at to understand, at a glance, the fundamental health of a company. You need some, too.

There are several reasons ratios tend to be the best metrics:

1 Ratios are easier to act on. Think about driving a car. Distance travelled is informational. But speed-distance per hour-is something you can act on, because it tells you about your current state, and whether you need to go faster or slower to get to your destination on time.

2 Ratios are inherently comparative. If you compare a daily metric to the same metric over a month, you’ll see whether you’re looking at a sudden spike or a long-term trend. In a car, speed is one metric, but speed right now over average speed this hour shows you a lot about whether you’re accelerating or slowing down.

3 Ratios are also good for comparing factors that are somehow opposed, or for which there’s an inherent tension. In a car, this might be distance covered divided by traffic tickets. The faster you drive, the more distance you cover-but the more tickets you get. This ratio might suggest whether or not you should be breaking the speed limit. A good metric changes the way you behave. This is by far the most important criterion for a metric: what will you do differently based on changes in the metric?

1 “Accounting” metrics like daily sales revenue, when entered into your spreadsheet, need to make your predictions more accurate. These metrics form the basis of Lean Startup’s innovation accounting, showing you how close you are to an ideal model and whether your actual results are converging on your business plan.

2 “Experimental” metrics, like the results of a test, help you to optimize the product, pricing, or market. Changes in these metrics will significantly change your behavior. Agree on what that change will be before you collect the data: if the pink website generates more revenue than the alternative, you’re going pink; if more than half your respondents say they won’t pay for a feature, don’t build it; if your curated MVP doesn’t increase order size by 30%, try something else. Drawing a line in the sand is a great way to enforce a disciplined approach. A good metric changes the way you behave precisely because it’s aligned to your goals of keeping users, encouraging word of mouth, acquiring customers efficiently, or generating revenue. If you want to choose the right metrics, you need to keep five things in mind:

1 Qualitative versus quantitative metrics

Qualitative metrics are unstructured, anecdotal, revealing, and hard to aggregate; quantitative metrics involve numbers and statistics, and provide hard numbers but less insight.

2 Vanity versus actionable metrics

Vanity metrics might make you feel good, but they don’t change how you act. Actionable metrics change your behavior by helping you pick a course of action.

3 Exploratory versus reporting metrics

Exploratory metrics are speculative and try to find unknown insights to give you the upper hand, while reporting metrics keep you abreast of normal, managerial, day-to-day operations.

4 Leading versus lagging metrics

Leading metrics give you a predictive understanding of the future; lagging metrics explain the past. Leading metrics are better because you still have time to act on them-the horse hasn’t left the barn yet.

5 Correlated versus causal metrics

If two metrics change together, they’re correlated, but if one metric causes another metric to change, they’re causal. If you find a causal relationship between something you want (like revenue) and something you can control (like which ad you show), then you can change the future

Analysts look at specific metrics that drive the business, called key performance indicators (KPIs). Every industry has KPIs-if you’re a restaurant owner, it’s the number of covers (tables) in a night; if you’re an investor, it’s the return on an investment; if you’re a media website, it’s ad clicks; and so on.

1. The nature of statistics
Statistics is the original computing with data. It is the field that deals with data with the most portability (it isn’t dependent on one type of physical model) and rigor. Statistics can be a pessimal field: statisticians are the masters of anticipating what can go wrong with experiments and what fallacies can be drawn from naive uses of data. Statistics has enough techniques to solve just about any problem, but it also has an inherent conservatism to it.
I often say the best source of good statistical work is bad experiments. If all experiments were well conducted, we wouldn’t need a lot of statistics. However, we live in the real world; most experiments have significant shortcomings and statistics is incredibly valuable.
Another aspect of statistics is it is the only field that really emphasizes the risks of small data. There are many other potential data problems statistics describes well (like Simpson’s paradox), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets. This is actually very important: datasets that are expensive to produce (such as drug trials) are necessarily small.
It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to GIGO). And in some cases big data is promoted as valuable only because it is the cheapest to produce. Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).
If your business is directly producing truly valuable data (not just producing useful proxy data) you likely have small data issues. If you have any hint of a small data issue, you want to consult with a good statistician.2. The nature of machine learning
In some sense machine learning rushes where statisticians fear to tread. Machine learning does have some concept of small data issues (such as knowing about over-fitting), but it is an essentially optimistic field.
The goal of machine learning is to create a predictive model that is indistinguishable from a correct model. This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).
My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example: Bennett, K. P., & Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265/1281). Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.3. The nature of data mining
Data mining is a term that was quite hyped and now somewhat derided. One of the reasons more people use the term “data science” nowadays is they are loath to say “data mining” (though in my opinion the two activities have different goals).
The goal of data mining is to find relations in data, not to necessarily make predictions or come up with explanations. Data mining is often what I call “an x’s only enterprise” (meaning you have many driver or “independent” variables but no pre-ordained outcome or “dependent” variables) and some of the typical goals are clustering, outlier detection and characterization.
There is a sense that when it was called exploratory statistics it was considered boring, but when it was called data mining it was considered sexy. Actual exploratory statistics (as defined by Tukey) is exciting and always an important “get your hands into the data” step of any predictive analytics project.4. The nature of informatics
Informatics and in particular bioinformatics are very hot terms. A lot of good data scientists (a term I will explain later) come from the bioinformatics field.
Once we separate out the portions of bioinformatics that are in fact statistics and the ones that are in fact biology we are left with data infrastructure and matching algorithms. We have the creation and management of data stores, data bases and design of efficient matching and query algorithms. This isn’t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like online suffix trees) are quite brilliant.5. The nature of big data
Big data is a white-hot topic. The thing to remember is: it is just the infrastructure (MapReduce, Hadoop, noSQL and so on). It is the platform you perform modeling (or usually just report generation) on top of.6. The nature of predictive analytics
The Wikipedia defines Predictive analytics as the variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. It is a set of goals and techniques emphasizing making models. It is very close to what is also meant by data science.
I don’t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background. To my ear analytics is more associated with visualization, reporting and summarization than with modeling. I also try to use the term modeling over prediction (when I remember) as prediction often in non-technical English implies something like forecasting into the future (which is but one modeling task).7. The nature of data science
The Wikipedia defines data science as a field that incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.
Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.8. Conclusion
Machine learning and statistics may be the stars, but data science the whole show.

Projects include taxonomy creation (text mining, big data), clustering applied to big data sets, recommendation engines, simulations, rule systems for statistical scoring engines, root cause analysis, automated bidding, forensics, exo-planets detection, and early detection of terrorist activity or pandemics, An important component of data science is automation, machine-to-machine communications, as well as algorithms running non-stop in production mode (sometimes in real time), for instance to detect fraud, predict weather or predict home prices for each home (Zillow).

An example of data science project is the creation of the fastest growing data science Twitter profile, for computational marketing. It leverages big data, and is part of a viral marketing / growth hacking strategy that also includes automated high quality, relevant, syndicated content generation (in short, digital publishing version 3.0).

Unlike most other analytic professions, data scientists are assumed to have great business acumen and domain expertize — one of the reasons why they tend to succeed as entrepreneurs.There are many types of data scientists, as data science is a broad discipline. Many senior data scientists master their art/craftsmanship and possess the whole spectrum of skills and knowledge; they really are the unicorns that recruiters can’t find. Hiring managers and uninformed executives favor narrow technical skills over combined deep, broad and specialized business domain expertize – a byproduct of the current education system that favors discipline silos, while true data science is a silo destructor. Unicorn data scientists (a misnomer, because they are not rare – some are famous VC’s) usually work as consultants, or as executives. Junior data scientists tend to be more specialized in one aspect of data science, possess more hot technical skills (Hadoop, Pig, Cassandra) and will have no problems finding a job if they received appropriate training and/or have work experience with companies such as Facebook, Google, eBay, Apple, Intel, Twitter, Amazon, Zillow etc. Data science projects for potential candidates can be found here.

Statistics: design of experiments including multivariate testing, cross-validation, stochastic processes, sampling, model-free confidence intervals, but not p-value nor obscure tests of thypotheses that are subjects to the curse of big data

Operations research: data science encompasses most of operations research as well as any techniques aimed at optimizing decisions based on analysing data.

Business intelligence: every BI aspect of designing/creating/identifying great metrics and KPI’s, creating database schemas (be it NoSQL or not), dashboard design and visuals, and data-driven strategies to optimize decisions and ROI, is data science.

Comparison with other analytic discplines

Machine learning: Very popular computer science discipline, data-intensive, part of data science and closely related to data mining. Machine learning is about designing algorithms (like data mining), but emphasis is on prototyping algorithms for production mode, and designing automated systems (bidding algorithms, ad targeting algorithms) that automatically update themselves, constantly train/retrain/update training sets/cross-validate, and refine or discover new rules (fraud detection) on a daily basis. Python is now a popular language for ML development. Core algorithms include clustering and supervised classification.

Data mining: This discipline is about designing algorithms to extract insights from rather large and potentially unstructured data (text mining), sometimes called nugget discovery, for instance unearthing a massive Botnets after looking at 50 million rows of data.Techniques include pattern recognition, feature selection, clustering, supervised classification and encompasses a few statistical techniques (though without the p-values or confidence intervals attached to most statistical methods being used). Instead, emphasis is on robust, data-driven, scalable techniques, without much interest in discovering causes or interpretability. Data mining thus have some intersection with statistics, and it is a subset of data science. Data mining is applied computer engineering, rather than a mathematical science. Data miners use open source and software such as Rapid Miner.

Predictive modeling: Not a discipline per se. Predictive modeling projects occur in all industries across all disciplines. Predictive modeling applications aim at predicting future based on past data, usually but not always based on statistical modeling. Predictions often come with confidence intervals. Roots of predictive modeling are in statistical science.

Statistics. Currently, statistics is mostly about surveys (typically performed with SPSS software), theoretical academic research, bank and insurance analytics (marketing mix optimization, cross-selling, fraud detection, usually with SAS and R), statistical programming, social sciences, global warming research (and space weather modeling), economic research, clinical trials (pharmaceutical industry), medical statistics, epidemiology, biostatistics.and government statistics. Agencies hiring statisticians include the Census Bureau, IRS, CDC, EPA, BLS, SEC, and EPA (environmental/spatial statistics). Jobs requiring a security clearance are well paid and relatively secure, but the well paid jobs in the pharmaceutical industry (the golden goose for statisticians) are threatened by a number of factors – outsourcing, company mergings, and pressures to make healthcare affordable. Because of the big influence of the conservative, risk-adverse pharmaceutical industry, statistics has become a narrow field not adapting to new data, and not innovating, loosing ground to data science, industrial statistics, operations research, data mining, machine learning — where the same clustering, cross-validation and statistical training techniques are used, albeit in a more automated way and on bigger data. Many professionals who were called statisticians 10 years ago, have seen their job title changed to data scientist or analyst in the last few years.

Industrial statistics. Statistics frequently performed by non-statisticians (engineers with good statistical training), working on engineering projects such as yield optimization or load balancing (system analysts). They use very applied statistics, and their framework is closer to six sigma, quality control and operations research, than to traditional statistics. Also found in oil and manufacturing industries. Techniques used include time series, ANOVA, experimental design, survival analysis, signal processing (filtering, noise removal, deconvolution), spatial models, risk and reliability models.

Mathematical optimization. Solves business optimization problems with techniques such as the simplex algorithm, Fourier transforms (signal processing), differential equations, and software such as Matlab. They are found in big companies such as IBM, research labs, and NSA and in the finance industry (sometimes recruiting physics or engineer graduates). These professionals sometimes solve the exact same problems as statisticians do, using the exact same techniques, though they use different names. Mathematicians use least square optimization for interpolation or extrapolation; statisticians use linear regression for predictions and model fitting, but both concepts are identical, and rely on the exact same mathematical machinery: it’s just two names describing the same thing. Mathematical optimization is however closer to operations research than statistics, the choice of hiring a mathematician rather than another practitioner (data scientist) is often dictated by historical reasons, especially for organizations such as NSA or IBM.

Actuarial sciences. Just a subset of statistics focusing on insurance (car, health, etc.) using survival models: predicting when you will die, what your health expenditures will be based on your health status (smoker, gender, previous diseases) to determine your insurance premiums. Also predicts extreme floods and weather events to determine premiums. These latter models are notoriously erroneous (recently) and have resulted in far bigger payouts than expected. For some reasons, this is a very vibrant, secretive community of statisticians, that do not call themselves statisticians anymore (job title is actuary). They have seen their average salary increase nicely over time: access to profession is restricted and regulated just like for lawyers, for no other reasons than protectionism to boost salaries and reduce the number of qualified applicants to job openings. Actuarial sciences is indeed data science (a sub-domain).HPC. High performance computing, not a discipline per se, but should be of concern to data scientists, big data practitioners, computer scientists and mathematicians, as it can redefine the computing paradigms in these fields. If quantum computers ever become successful, it will totally change the way algorithms are designed and implemented. HPC should not be confused with Hadoop and Map-Reduce: HPC is hardware-related, Hadoop is software-related (though heavily relying on Internet bandwidth and servers configuration and proximity).

Operations research. Abbreviated as OR. They separated from statistics a while back (like 20 years ago), but they are like twin brothers, and their respective organizations (INFORMS and ASA) partner together. OR is about decision science and optimizing traditional business projects: inventory management, supply chain, pricing. They heavily use Markov Chain models, Monter-Carlo simulations, queuing and graph theory, and software such as AIMS, Matlab or Informatica. Big, traditional old companies use OR, new and small ones (start-ups) use data science to handle pricing, inventory management or supply chain problems. Many operations research analysts are becoming data scientists, as there is far more innovation and thus growth prospect in data science, compared to OR. Also, OR problems can be solved by data science. OR has a siginficant overlap with six-sigma (see below), also solves econometric problems, and has many practitioners/applications in the army and defense sectors.

Six sigma. It’s more a way of thinking (a business philosophy, if not a cult) rather than a discipline, and was heavily promoted by Motorola and GE a few decades ago. Used for quality control and to optimize engineering processes (see entry on industrial statistics in this article), by large, traditional companies. They have a LinkedIn group with 270,000 members, twice as large as any other analytic LinkedIn groups including our data science group. Their motto is simple: focus your efforts on the 20% of your time that yields 80% of the value. Applied, simple statistics are used (simple stuff works must of the time, I agree), and the idea is to eliminate sources of variances in business processes, to make them more predictable and improve quality. Many people consider six sigma to be old stuff that will disappear. Perhaps, but the fondamental concepts are solid and will remain: these are also fundamental concepts for all data scientists. You could say that six sigma is a much more simple if not simplistic version of operations research (see above entry), where statistical modeling is kept to a minimum. Risks: non qualified people use non-robust black-box statistical tools to solve problems, it can result in disasters. In some ways, six sigma is a discipline more suited for business analysts (see business intelligence entry below) than for serious statisticians.

Quant. Quant people are just data scientists working for Wall Street on problems such as high frequency trading or stock market arbitraging. They use C++, Matlab, and come from prestigious universities, earn big bucks but lose their job right away when ROI goes too South too quickly. They can also be employed in energy trading. Many who were fired during the great recession now work on problems such as click arbitraging, ad optimization and keyword bidding. Quants have backgrounds in statistics (few of them), mathematical optimization, and industrial statistics.

Artificial intelligence. It’s coming back. The intersection with data science is pattern recognition (image analysis) and the design of automated (some would say intelligent) systems to perform various tasks, in machine-to-machine communication mode, such as identifying the right keywords (and right bid) on Google AdWords (pay-per-click campaigns involving millions of keywords per day). I also consider smart search (creating a search engine returning the results that you expect and being much broader than Google) one of the greatest problems in data science, arguably also an AI and machine learning problem.

Econometrics. Why it became separated from statistics is unclear. So many branches disconnected themselves from statistics, as they became less generic and start developing their own ad-hoc tools. But in short, econometrics is heavily statistical in nature, using time series models such as auto-regressive processes. Also overlapping with operations research (itself overlaping with statistics!) and mathematical optimization (simplex algorithm). Econometricians like ROC and efficiency curves (so do six sigma practitioners, see corresponding entry in this article). Many do not have a strong statistical background, and Excel is their main or only tool.

Data engineering. Performed by software engineers (developers) or architects (designers) in large organizations (sometimes by data scientists in tiny companies), this is the applied part of computer science (see entry in this article), to power systems that allow all sorts of data to be easily processed in-memory or near-memory, and to flow nicely to (and between) end-users, including heavy data consumers such as data scientists. A sub-domain currently under attack is data warehousing, as this term is associated with static, siloed conventational data bases, data architectures, and data flows, threatened by the rise of NoSQL, NewSQL and graph databases. Transforming these old architectures into new ones (only when needed) or make them compatible with new ones, is a lucrative business.

Business intelligence. Abbreviated as BI. Focuses on dashboard creation, metric selection, producing and scheduling data reports (statistical summaries) sent by email or delivered/presented to executives, competitive intelligence (analyzing third party data), as well as involvement in database schema design (working with data architects) to collect useful, actionable business data efficiently. Typical job title is business analyst, but some are more involved with marketing, product or finance (forecasting sales and revenue). They typically have an MBA degree. Some have learned advanced statistics such as time series, but most only use (and need) basic stats, and light analytics, relying on IT to maintain databases and harvest data. They use tools such as Excel (including cubes and pivot tables, but not advanced analytics), Brio (Oracle browser client), Birt, Micro-Sreategy or Business Objects (as end-users to run queries), though some of these tools are increasingly equipped with better analytic capabilities. Unless they learn how to code, they are competing with some polyvalent data scientists that excel in decision science, insights extraction and presentation (visualization), KPI design, business consulting, and ROI/yield/business/process optimization. BI and market research (but not competitive intelligence) are currently experiencing a decline, while AI is experiencing a come-back. This could be cyclical. Part of the decline is due to not adapting to new types of data (e.g. unstructured text) that require engineering or data science techniques to process and extract value.

Data analysis. This is the new term for business statistics since at least 1995, and it covers a large spectrum of applications including fraud detection, advertising mix modeling, attribution modeling, sales forecasts, cross-selling optimization (retails), user segmentation, churn analysis, computing long-time value of a customer and cost of acquisition, and so on. Except in big companies, data analyst is a junior role; these practitioners have a much more narrow knwoledge and experience than data scientists, and they lack (and don’t need) business vision. They are detailed-orirented and report to managers such as data scientists or director of analytics, In big companies, someone with a job title such as data analyst III might be very senior, yet they usually are specialized and lack the broad knowledge gained by data scientists working in a variety of companies large and small.

Business analytics. Same as data analysis, but restricted to business problems only. Tends to have a bit more of a finacial, marketing or ROI flavor. Popular job titles include data analyst and data scientist, but not business analyst (see business intelligence entry for business intelligence, a different domain).

Finally, there are more specialized analytic disciplines that recently emerged: health analytics, computational chemistry and bioinformatics (genome research), for instance.

Hello Andy, thank you very much for your hint. I had a look at your list and found 40 which were not in my list right now. My list now contains >700 from which I publish one a day. So at least another 2 Years …. There are some typos in your list, e.g “better plac”. You might have a look. Thank you very much, Michael

Hello Michael !
Thanks a lot for your attention to my humble work. I have fixed typo “better place” and hope for best. How did you find videos for #1, #2 interviews quotes ?
I hope you enjoy it too :)) I saw your web site and found it very useful for me.
So thanks again for your attention.
andy

Very nice post. I simply stumbled upon your weblog and wanted to say that
I’ve really loved browsing your blog posts. In any case
I will be subscribing for your rss feed and I’m hoping you write again soon!