Related Articles

Book Preview

Data-ism - Steve Lohr

1

HOW BIG IS BIG DATA?

Just outside Memphis, an industrial symphony of machines and humans shuttles goods to and fro, their carefully orchestrated movements and identifying marks tracked by bar-code scanners and chips emitting radio waves. Mechanical arms snatch the plastic shrink-wrapped bundles off a conveyor belt, as forklifts ferry the packages onto trucks for long-distance travel. Flesh-and-blood humans guide and monitor the flow of goods and drive the forklifts and trucks.

McKesson, which distributes about a third of all of the pharmaceutical products in America, runs this sprawling showcase of efficiency. Its buildings span the equivalent of more than eight football fields, forming the hub of McKesson’s national distribution network—a feat of logistics that sends goods to 26,000 customer locations, from neighborhood pharmacies to Walmart. The main cargo is drugs, roughly 240 million pills a day. The pharmaceutical distribution business is one of high volumes and razor-thin profit margins. So, understandably, efficiency has been all but a religion for McKesson for decades.

Yet in the last few years, McKesson has taken a striking step further by cutting the inventory flowing through its network at any given time by $1 billion. The payoff came from insights gleaned from harvesting all the product, location, and transport data, from scanners and sensors, and then mining that data with clever software to identify potential time-saving and cost-cutting opportunities. The technology-enhanced view of the business was a breakthrough that Donald Walker, a senior McKesson executive, calls making the invisible visible.

In Atlanta, I stand outside one of the glassed-in rooms in the fifth-floor intensive care unit at the Emory University Hospital. Inside, a dense thicket of electronic devices, a veritable forest of medical computing, crowds the room: a respirator, a kidney machine, infusion machines pumping antibiotics and painkilling opiates, and gadgets monitoring heart rate, breathing, blood pressure, oxygen saturation, and other vital signs. Nearly every machine has its own computer monitor, each emitting an electronic cacophony of beeps and alerts. I count a dozen screens, larger flat panels and smaller ones, smartphone-sized.

A typical twenty-bed intensive care unit generates an estimated 160,000 data points a second. Amid all that data, informed and distracted by it, doctors and nurses make decisions at a rapid clip, about 100 decisions a day per patient, according to research at Emory. Or more than 9.3 million decisions about care during a year in an ICU. So there is ample room for error. The overwhelmed people need help. And Emory is one of a handful of medical research centers that is working to transform critical care with data, both in adult and neonatal intensive care wards. The data streams from the medical devices monitoring patients are parsed by software that has been trained to spot early warning signals that a patient’s condition is worsening.

Digesting vast amounts of data and spotting seemingly subtle patterns is where computers and software algorithms excel, more so than humans. Dr. Timothy Buchman heads up such an effort at Emory. A surgeon, scientist, and experienced pilot, Buchman uses a flight analogy to explain his goal. GPS (Global Positioning System) location data on planes is translated to screen images that show air-traffic controllers when a flight is going astray—off trajectory, as he puts it—well before a plane crashes. Buchman wants the same sort of early warning system for patients whose pattern of vital signs is off trajectory, before they crash, in medical terms. That’s where big data is taking us, he says.

The age of big data is coming of age, moving well beyond Internet incubators in Silicon Valley, such as Google and Facebook. It began in the digital-only world of bits, and is rapidly marching into the physical world of atoms, into the mainstream. The McKesson distribution center and the Emory intensive care unit show the way—big data saving money and saving lives. Indeed, the long view of the technology is that it will become a layer of data-driven artificial intelligence that resides on top of both the digital and the physical realms. Today, we’re seeing the early steps toward that vision. Big-data technology is ushering in a revolution in measurement that promises to be the basis for the next wave of efficiency and innovation across the economy. But more than technology is at work here. Big data is also the vehicle for a point of view, or philosophy, about how decisions will be—and perhaps should be—made in the future. David Brooks, my colleague at the New York Times, has referred to this rising mind-set as data-ism—a term I’ve adopted as well because it suggests the breadth of the phenomenon. The tools of innovation matter, as we’ve often seen in the past, not only for economic growth but because they can reshape how we see the world and make decisions about it

A bundle of technologies fly under the banner of big data. The first is all the old and new sources of data—Web pages, browsing habits, sensor signals, social media, GPS location data from smartphones, genomic information, and surveillance videos. The data surge just keeps rising, about doubling in volume every two years. But I would argue that the most exaggerated—and often least important—aspect of big data is the big. The global data count becomes a kind of nerd’s parlor game of estimates and projections, an excursion into the linguistic backwater of zettabytes, yottabytes, and brontobytes. The numbers and their equivalents are impressive. Ninety percent of all of the data in history, by one estimate, was created in the last two years. In 2014, International Data Corporation estimated the data universe at 4.4 zettabytes, which is 4.4 trillion gigabytes. That volume of information, the research firm said, straining for perspective, would fill enough slender iPad Air tablets to create a stack more than 157,000 miles high, or two thirds of the way to the moon.

But not all data is created equal, or is equally valuable. The mind-numbing data totals are inflated by the rise in the production of digital images and video. Just think of all of the smartphone friend and family pictures and video clips taken and sent. It is said that a picture is worth a thousand words. Yet in the arithmetic of digital measurement, that is a considerable understatement, because images are bit gluttons. Text, by contrast, is a bit-sipping medium. There are eight bits in a byte. A letter of text consumes one byte, while a standard, high-resolution picture is measured in megabytes, millions of bytes. And video, in its appetite for bits, dwarfs still pictures. And forty-eight hours of video are uploaded onto YouTube every minute, as I write this, with the pace likely to only increase.

The big in big data matters, but a lot less than many people think. There’s a lot of water in the ocean, too, but you can’t drink it. The more pressing issue is being able to use and make sense of data. The success stories in this book involve lots of data, but typically not in volumes that would impress engineers at Google. And while advances in computer processing, storage, and memory are helping with the data challenge, the biggest step ahead is in software. The crucial code comes largely from the steadily evolving toolkit of artificial intelligence, like machine-learning software.

Data and smart technology are opening the door to new horizons of measurement, both from afar and close-up. Big-data technology is the digital-age equivalent of the telescope or the microscope. Both of those made it possible to see and measure things as never before—with the telescope, it was the heavens and new galaxies; with the microscope, it was the mysteries of life down to the cellular level.

Just as modern telescopes transformed astronomy and modern microscopes did the same for biology, big data holds a similar promise, but more broadly, in every field and every discipline. Far-reaching advances in technology are engines of economic change. The Internet transformed the economics of communication. Then other technologies, like the Web, were built on top of the Internet, which has become a platform for innovation and new businesses. Similarly big data, though still a young technology, is transforming the economics of discovery—becoming a platform, if you will, for human decision making.

Decisions of all kinds will increasingly be made based on data and analysis rather than on experience and intuition—more science and less gut feel.

Throughout history, technological change has challenged traditional practices, ways of educating people, and even ways of understanding the world. In 1959, at the dawn of the modern computer age, the English chemist and novelist C. P. Snow delivered a lecture at Cambridge University, The Two Cultures. In it, Snow dissected the differences and observed the widening gap between two camps, the sciences and the humanities. The schism between scientific and literary intellectuals, he warned, threatened to stymie economic and social progress, if those in the humanities remained ignorant of the advances in science and their implications. The lecture was widely read in America, and among those influenced were two professors at Dartmouth College, John Kemeny and Thomas Kurtz. Kemeny, a mathematician and a former research assistant to Albert Einstein, would go on to become the president of Dartmouth. Kurtz was a young math professor in the early 1960s when he approached Kemeny with the idea of giving nearly all students at Dartmouth a taste of programming on a computer.

Kemeny and Kurtz saw the rise of computing as a major technological force that would sweep across the economy and society. But only a quarter of Dartmouth students majored in science or engineering, the group most likely to be interested in computing. Yet most of the decision makers of business and government typically came from the less technically inclined 75 percent of the student population, Kurtz explained. So Kurtz and Kemeny devised a simple programming language BASIC (Beginner’s All-purpose Symbolic Instruction Code), intended to be accessible to non-engineers. In 1964, they began teaching Dartmouth students to write programs in BASIC. And variants of Dartmouth’s BASIC would eventually be used by millions of people to write software. Bill Gates wrote a stripped-down BASIC to run on early personal computers, and Microsoft BASIC was the company’s founding product. Years later, Gates fondly recalled the feat of writing a shrunken version of BASIC to work on the primitive personal computers of the mid-1970s. Of all the programming I’ve done, Gates told me, it’s the thing I’m most proud of.

Back in the 1960s, Kemeny and Kurtz had no intention of making Dartmouth a training ground for professional programmers. They wanted to give their students a feel for interacting with these digital machines and for computational thinking, which involves analyzing and logically organizing data in ways so that computers can help solve problems. The Dartmouth professors weren’t really teaching programming. They were trying to change minds, to encourage their students to see things differently. Today, when people talk about the need to retool education and training for the data age, it is often a fairly narrow discussion of specific skills. But the larger picture has less to do with a wizard’s mastery of data than with a fundamental curiosity about data. The bigger goal is to foster a mind-set, so that thinking about data becomes an intellectual first principle, the starting point of inquiry. It’s a mentality that can be summed up in a question: What story does the data tell you?

The promise of big data is that the story is far richer and more detailed than ever before, making it suddenly possible to see more and learn faster—or in the McKesson executive’s words, to make the invisible visible. And the improvement is not a little bit better, but fundamentally different. I think of this as the deeper meaning of Moore’s Law. In a technical sense, the law, formulated by Intel’s cofounder Gordon Moore in 1965, is the observation that transistor density on computer chips doubles about every two years and that computing power improves at that exponential pace. But in a practical sense, it also means that seemingly quantitative changes become qualitative, opening the door to new possibilities and doing new things. In computing, you start by calculating the flight trajectory of artillery shells, the task assigned the ENIAC (Electronic Numerical Integrator and Computer) in 1946. And by 2011, you have IBM’s Watson beating the best humans in the question-and-answer game Jeopardy!

To a computer, it’s all just the 1’s and 0’s of digital code. Yet the massive quantitative improvement in performance over time drastically changes what can be done. Trained physicists in the data world often compare the quantitative-to-qualitative transformation to a phase change, or change of state, as when a gas becomes a liquid or a liquid becomes a solid. It is an apt, descriptive comparison. But I prefer the Moore’s Law reference, and here’s why. When the temperature drops below thirty-two degrees Fahrenheit or zero degrees Celsius, water freezes. It happens naturally, a law of nature. Moore’s Law is an observation about what had happened for years, and what could well happen in the future. But it is not a law of nature. Moore’s Law has held for so many years because of human ingenuity, endeavor, and investment. Scientists, companies, and investors made it happen.

The same is true of big data. It has become technically possible thanks to a bounty of improvements in computing, sensing, and communications. But the steady advance in software and hardware, and the rise of data-ism, will happen because of brains, energy, and money. The big-data revolution requires both trailblazing individuals and institutional commitment. The narrative of this book is built around one of each—a young man, and an old company. The young man is Jeffrey Hammerbacher, thirty-two, who personifies the mind-set of data-ism and whose career traces the widening horizons of data technology and methods. Hammerbacher grew up in Indiana, went to Harvard University, and then briefly was a quant at a Wall Street investment bank, before building the first team of so-called data scientists at Facebook. He left to be cofounder and chief scientist of Cloudera, a start-up that makes software for data scientists. Then, beginning in the summer of 2012, he embarked on a very different professional path. He joined the Icahn School of Medicine at Mount Sinai in New York, where he is leading a data group that is exploring genetic and other medical information in search of breakthroughs in disease modeling and treatment. Medical research, he figures, is the best use of his skills today.

At the other pole of the modern data world is IBM, a century-old technology giant known for its research prowess and its mainstream corporate clientele. Its customers provide a window into the progress data techniques are making, as well as the challenges, across a spectrum of industries. IBM itself has lined up its research, its strategy, and its investment behind the big-data business. We are betting the company on this, Virginia Rometty, the chief executive, told me in an interview.

But for IBM, big data is a threat as well as an opportunity. The new, low-cost hardware and software that power many big-data applications—cloud computing and open-source code—will supplant some of IBM’s traditional products. The company must expand in the new data markets faster than its old-line businesses wither. No company can match IBM’s history in the data field; the founding technology of the company that became IBM, punched cards, developed by Herman Hollerith, triumphed in counting and tabulating the 1890 census, when the American population grew to sixty-three million—the big data of its day. Today, IBM researchers are at the forefront of big-data technology. The projects at McKesson and Emory, which will be examined in greater detail later, are collaborations with IBM scientists. And IBM’s Watson, that engine of data-driven artificial intelligence, is no longer merely a game-playing science experiment but a full-fledged business unit within IBM, supported by an investment of $1 billion, as it applies its smarts to medicine, customer service, financial services, and elsewhere. The Watson technology is now a cloud service, delivered over the Internet from distant data centers, and IBM is encouraging software engineers to write applications that run on Watson, as if an operating system for the future.

The new and the old, the individual and the institution are at times conflicting forces but also complementary. It is hard to imagine that Hammerbacher and IBM would ever be a comfortable fit, but they are heading in the same direction—and both are big-data enthusiasts.

Another conflicting yet complementary subject runs through this book, and it centers on decision making. Big data can be a powerful tool indeed, but it has its limits. So much depends on context—what is being measured and how it is measured. Data can always be gathered, and patterns can be observed—but is the pattern significant, and are you measuring what you really want to know? Or are you measuring what is most easily measured rather than what is most meaningful? There is a natural tension between the measurement imperative and measurement myopia. Two quotes frame the issue succinctly. The first: You can’t manage what you can’t measure. For this one, there appear to be twin claims of attribution, either W. Edwards Deming, the statistician and quality control expert, or Peter Drucker, the management consultant. Who said it first doesn’t matter so much. It’s a mantra in business and it has the ring of commonsense truth.

The second quote is not as well known, but there is a lot of truth in it as well: Not everything that can be counted counts, and not everything that counts can be counted. Albert Einstein usually gets credit for this one, but the stronger claim of origin belongs to the sociologist William Bruce Cameron—though again, who said it first matters far less than what it says. Big data represents the next frontier in management by measurement. The technologies of data science are here, they are improving, and they will be used. And that’s a good thing, in general. Still, the enthusiasm for big-data decision making would surely benefit from a healthy dose of the humility found in that second quote.

For more than a decade at the New York Times, I have covered the technology ingredients and issues that now carry the big data label—well before the term entered the vernacular and became yet another unavoidable buzzword. And I still do. But this book is an effort to go both deeper and wider by surveying the projects and ideas on this frontier across the broader economy—and by talking to the individual scientists, entrepreneurs, and business executives who are confronting the technological and human challenges that data-ism inevitably creates. My reporting has been guided by the belief that if modern data technology is going to be a big deal economically, it has to go mainstream; it has to be deployed in almost every industry. The early triumphs of the consumer Internet—personalized search, targeted online ads, tailored movie recommendations, and the like—are impressive. But applying these technologies and techniques to huge industries of the physical world, like medicine, energy, and agriculture, is a more difficult challenge—and ultimately a more significant achievement, affecting far more people in far more ways. In the pages that follow, we will take a look at the progress of big data across the broader economy. We will be looking for the substance behind the salesmanship. Where is data-ism taking us? Where does big data shine, and where does it stumble?

2

POTENTIAL. POTENTIAL. POTENTIAL.

Jeffrey Hammerbacher is trying to win converts. He stands beside a lectern, pacing back and forth, addressing about a hundred people in an auditorium at the Mount Sinai medical school on the Upper East Side of Manhattan. Many in the audience wear the white lab coats of physicians. Hammerbacher has deep-set piercing eyes, an angular nose, a close-cropped beard, and a head of thick dark-brown hair. Brushing it into place is not always a priority. His title at the medical school is assistant professor of genetics and genomic sciences, but white lab coats are not his style. His shirts of choice are tight-fitting pullover jerseys or T-shirts. Both show off his brawny shoulders, thick biceps and forearms—the physique of the star baseball pitcher he was in high school; he still does his pitcher’s weight workout a couple of times a week.

To this gathering of physicians and medical researchers, Hammerbacher delivers a brisk overview of his data tactics and philosophy. He runs through some of the basics of data handling: instrument everything you can with data-generating sensors; store all the data you can immediately, and figure out what to ask it later; make your data open to others in your organization, and let them experiment with it.

The practice of data-driven discovery, Hammerbacher observes, is just getting under way in most fields. Observation rather than prediction should be the near-term goal. "Before you can predict the