Demystifying “big data” part 4: Machine learning

This piece is the final of four I will publish this spring in which I describe particular techniques used to make sense of or mine large data sets. This post covers machine learning.

With increasingly massive amounts of complex data available to us, the ways in which we try to make sense of it necessarily become more complex. Machine learning has emerged as a tool both honed from and capable of making sense of large sets of data. Machine learning is, simply put, a system created by a developer in which computers “learn” how to perform a complex task. The word learning in this context anthropomorphizes a computer, which can lead to confusion, yet machine learning is not the same as human learning. It is a complex mathematical structure based on vast quantities of data. Machine learning focuses on two interrelated questions, which are (1) how can one construct computer systems that automatically improve through experience? and (2) what are the fundamental statistical-computational-information-theoretic laws that govern all learning systems, including computers, humans, and organizations?[i]

To understand machine learning, it is necessary to understand how a basic computer program works. In a computer program, humans write code (lines of commands) that are based on logic. These lines of commands are a set of instructions. They dictate what a computer does or does not do. There is no intuition and no guesswork on the part of the computer. It simply follows exactly what the commands dictate. Machine learning is a set of instructions that create a sort of “self-modification” process within the lines of code. The lines of commands (what we call algorithms) are ordered to change themselves based on the input. Computers are made of billions of tiny switches called transistors, and algorithms are a sequence of instructions that turn those switches on and off billions of times per second.[ii] The computer isn’t using its will or initiative to learn (computers have no will nor initiative), but rather they are following the algorithm that was created for them. The algorithm commands that the code modifies itself or creates new code based on input, which is usually very, very large sets of data.

This can be done in several ways, which can range from learning by examples or analogy to autonomous learning of concepts. Today, there are various types of commonly adapted machine learning. One is “incremental learning,” whereby continuous improvements are made as new data arrives and is input into the program. Another is “one shot” or “batch learning,” which inputs all the data initially and distinguishes a training phase from the application phase. There is also “supervised learning,” where training input is explicitly labelled with the classes to be learned.[iii]

One of the problems that can occur when conducting machine learning is “overfitting.” This may occur when a learning algorithm is trained on a set of “training data.” A large quantity of data is input into the algorithm in order to train it to recognize the desired patterns. However, the algorithm is then applied to a new set of data points and requested to make predictions. The goal, under ideal circumstances, is to maximize its predictive accuracy on these new data points – not necessarily its accuracy on the training data. If a developer works too hard to make it fit the training data, it runs the risk of memorizing peculiarities rather than finding general predictive rules. This is referred to as “overfitting.”[iv]

There are a number of reasons, some more obvious than others, as to why someone might choose to use machine learning over a more traditional research method. There may be far too much data to make sense of without it, and machine learning uses statistical tools to find patterns in the data that reveal new and relevant information. Yet there are less obvious reasons, such as what can be seen in the case of electoral fraud. While application of a simple distribution test would reveal manipulation under normal circumstances, the problem with these simple analytical tools is that because they are widely known they are easily foiled. Methodologists and statisticians need to use increasingly sophisticated tools, such as machine learning, to keep up with those trying to manipulate elections.[v]

Researchers use machine learning across an array of disciplines. It is commonly used for developing practical software for computer vision, speech recognition, natural language processing, robot control and other applications. It has gained increasing attention in computer science as well as areas concerned with data-intensive issues, such as consumer services, anything involving complex systems or logistics chains. There has been a similarly broad range of effects across empirical sciences, from biology to cosmology to social science, as machine-learning methods have been developed to analyze high-throughput experimental data in novel ways.[vi]

In one well-known example, a group of researchers used satellite images and machine learning to predict degrees of poverty, which is hard to measure in poorer countries.[vii] To do this, the authors use high-resolution satellite imagery that contain an abundance of information about landscape features that could be correlated with economic activity. Given that such data are highly unstructured, it makes it difficult to extract meaningful insights even with intensive manual analysis. To manage this, they used a particular machine learning approach, a multistep “transfer learning” process,[viii] in which they used an easily obtained proxy for poverty to train a deep learning model.

On another front, with the advent of new technologies in the ﬁeld of medicine, large amounts of cancer data have been collected. As a result, machine learning methods have become a popular tool for medical researchers. It can assist in discovering and identifying patterns and relationships in complex datasets while effectively predicting future outcomes of a cancer type. In a review of studies conducted on machine learning and cancer prediction and prognosis, the researchers concluded that predictive models using supervised machine learning methods can provide promising tools for inference in the cancer domain.[ix]

Machine learning has also appeared in agriculture. To meet demands for increased crop production, scientists are looking for ways to improve production. One way to do this is to use plant phenotyping, which means measuring a specific plant trait, be it on the cellular level, the whole plant or even at the canopy level, and then relating it to the plant structure and function.[x] Machine learning is typically used in situations where large amounts of data are available, and the enormous volume of remote-sensing data of crops generated by real-time platforms represent a ‘big data’ problem. One of the major advantages of using machine learning is the opportunity to search large datasets to discover patterns and look at a combination of factors instead of analyzing each feature individually and, as the authors of a review of machine learning and plant phenotyping explain, machine learning may open up tremendous opportunities to accelerate breeding and to solve foundational problems in genomics and predictive phenomics.[xi]

While the opportunities for machine learning are as abundant as our many sources of big data, a challenge for machine learning is that to understand what it means and does requires a basic understanding of computer science and programming. This can be daunting to many, but it need not be. Today, a simple search online can produce a lot of answers to basic questions, such as; how do computers work? What is programming? There is also an abundance of free tutorials online that people can take advantage of to learn these necessary basics. The key is to not be intimidated. No one holds the master key to this knowledge, and everyone is capable of getting a grasp on it.

A second challenge to learning about this concept is the overly-charged use of the word “intelligence” to describe computers. Machine learning is considered a branch of artificial intelligence. Yet, every decade or so, the term artificial intelligence changes in meaning, and the language describing it and its potential tends to be dramatic and nebulous. There is no real consensus on what “intelligence” means as it relates to computers. By using the word “intelligence,” there is an implication that it is the same form of problem-solving inherent to a human mind; but it is not. This can make these concepts feel unnecessarily threatening and/or unfathomable. Machine learning, at least for the foreseeable future, is not a replacement of the human mind; it is a tool.

While most of us likely won’t be writing the algorithms ourselves, nor conducting the complex statistical modeling, that doesn’t mean we can’t understand machine learning or even become adept at applying it to questions we may have about the world. Machine learning may be largely invisible to most of us in our daily lives, yet it is pervasive and already being utilized in many components of our lives, such as Google search engines and online services that provide movie recommendations. It may or may not live up to its hype of becoming the most influential component of current and future innovation, but it certainly warrants our attention in the meantime.