The future belongs to the companies and people that turn data into products.

We’ve all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O’Reilly said that “data is the next Intel Inside.” But what does that statement mean? Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science — the technologies, the companies and the unique skill sets.

The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application. There’s a database behind a web front end, and middleware that talks to a number of other databases and data services (credit card processing companies, banks, and so on). But merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.

One of the earlier data products on the Web was the CDDB database. The developers of CDDB realized that any CD had a unique signature, based on the exact length (in samples) of each track on the CD. Gracenote built a database of track lengths, and coupled it to a database of album metadata (track titles, artists, album titles). If you’ve ever used iTunes to rip a CD, you’ve taken advantage of this database. Before it does anything else, iTunes reads the length of every track, sends it to CDDB, and gets back the track titles. If you have a CD that’s not in the database (including a CD you’ve made yourself), you can create an entry for an unknown album. While this sounds simple enough, it’s revolutionary: CDDB views music as data, not as audio, and creates new value in doing so. Their business is fundamentally different from selling music, sharing music, or analyzing musical tastes (though these can also be “data products”). CDDB arises entirely from viewing a musical problem as a data problem.

Google is a master at creating data products. Here’s a few examples:

Google’s breakthrough was realizing that a search engine could use input other than the text on the page. Google’s PageRank algorithm was among the first to use data outside of the page itself, in particular, the number of links pointing to a page. Tracking links made Google searches much more useful, and PageRank has been a key ingredient to the company’s success.

Spell checking isn’t a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate. They’ve built a dictionary of common misspellings, their corrections, and the contexts in which they occur.

Speech recognition has always been a hard problem, and it remains difficult. But Google has made huge strides by using the voice data they’ve collected, and has been able to integrate voice search into their core search engine.

Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing searches that people were making in different regions of the country.

Google isn’t the only company that knows how to use data. Facebook and LinkedIn use patterns of friendship relationships to suggest other people you may know, or should know, with sometimes frightening accuracy. Amazon saves your searches, correlates what you search for with what other users search for, and uses it to create surprisingly appropriate recommendations. These recommendations are “data products” that help to drive Amazon’s more traditional retail business. They come about because Amazon understands that a book isn’t just a book, a camera isn’t just a camera, and a customer isn’t just a customer; customers generate a trail of “data exhaust” that can be mined and put to use, and a camera is a cloud of data that can be correlated with the customers’ behavior, the data they leave every time they visit the site.

The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science.

In the last few years, there has been an explosion in the amount of data that’s available. Whether we’re talking about web server logs, tweet streams, online transaction records, “citizen science,” data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it. And it’s not just companies using their own data, or the data contributed by their users. It’s increasingly common to mashup data from a number of sources. “Data Mashups in R” analyzes mortgage foreclosures in Philadelphia County by taking a public report from the county sheriff’s office, extracting addresses and using Yahoo to convert the addresses to latitude and longitude, then using the geographical data to place the foreclosures on a map (another data source), and group them by neighborhood, valuation, neighborhood per-capita income, and other socio-economic factors.

The question facing every company today, every startup, every non-profit, every project site that wants to attract a community, is how to use data effectively — not just their own data, but all the data that’s available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

To get a sense for what skills are required, let’s look at the data lifecycle: where it comes from, how you use it, and where it goes.

Where data comes from

Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented. At O’Reilly, we frequently combine publishing industry data from Nielsen BookScan with our own sales data, publicly available Amazon data, and even job data to see what’s happening in the publishing industry. Sites like Infochimps and Factual provide access to many large datasets, including climate data, MySpace activity streams, and game logs from sporting events. Factual enlists users to update and improve its datasets, which cover topics as diverse as endocrinologists to hiking trails.

1956 disk drive

One of the first commercial disk drives from IBM. It has a 5 MB capacity and it’s stored in a cabinet roughly the size of a luxury refrigerator. In contrast, a 32 GB microSD card measures around 5/8 x 3/8 inch and weighs about 0.5 gram.

Much of the data we currently work with is the direct consequence of Web 2.0, and of Moore’s Law applied to data. The web has people spending more time online, and leaving a trail of data wherever they go. Mobile applications leave an even richer data trail, since many of them are annotated with geolocation, or involve video or audio, all of which can be mined. Point-of-sale devices and frequent-shopper’s cards make it possible to capture all of your retail transactions, not just the ones you make online. All of this data would be useless if we couldn’t store it, and that’s where Moore’s Law comes in. Since the early ’80s, processor speed has increased from 10 MHz to 3.6 GHz — an increase of 360 (not counting increases in word length and number of cores). But we’ve seen much bigger increases in storage capacity, on every level. RAM has moved from $1,000/MB to roughly $25/GB — a price reduction of about 40000, to say nothing of the reduction in size and increase in speed. Hitachi made the first gigabyte disk drives in 1982, weighing in at roughly 250 pounds; now terabyte drives are consumer equipment, and a 32 GB microSD card weighs about half a gram. Whether you look at bits per gram, bits per dollar, or raw capacity, storage has more than kept pace with the increase of CPU speed.

The importance of Moore’s law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put into it. The data exhaust you leave behind whenever you surf the web, friend someone on Facebook, or make a purchase in your local supermarket, is all carefully collected and analyzed. Increased storage capacity demands increased sophistication in the analysis and use of that data. That’s the foundation of data science.

So, how do we make that data useful? The first step of any data analysis project is “data conditioning,” or getting data into a state where it’s usable. We are seeing more data in formats that are easier to consume: Atom data feeds, web services, microformats, and other newer technologies provide data in formats that’s directly machine-consumable. But old-style screen scraping hasn’t died, and isn’t going to die. Many sources of “wild data” are extremely messy. They aren’t well-behaved XML files with all the metadata nicely in place. The foreclosure data used in “Data Mashups in R” was posted on a public website by the Philadelphia county sheriff’s office. This data was presented as an HTML file that was probably generated automatically from a spreadsheet. If you’ve ever seen the HTML that’s generated by Excel, you know that’s going to be fun to process.

Data conditioning can involve cleaning up messy HTML with tools like Beautiful Soup, natural language processing to parse plain text in English and other languages, or even getting humans to do the dirty work. You’re likely to be dealing with an array of data sources, all in different forms. It would be nice if there was a standard set of tools to do the job, but there isn’t. To do data conditioning, you have to be ready for whatever comes, and be willing to use anything from ancient Unix utilities such as awk to XML parsers and machine learning libraries. Scripting languages, such as Perl and Python, are essential.

Once you’ve parsed the data, you can start thinking about the quality of your data. Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn’t always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It’s reported that the discovery of ozone layer depletion was delayed because automated data collection tools discarded readings that were too low1. In data science, what you have is frequently all you’re going to get. It’s usually impossible to get “better” data, and you have no alternative but to work with the data at hand.

If the problem involves human language, understanding the data adds another dimension to the problem. Roger Magoulas, who runs the data analysis group at O’Reilly, was recently searching a database for Apple job listings requiring geolocation skills. While that sounds like a simple task, the trick was disambiguating “Apple” from many job postings in the growing Apple industry. To do it well you need to understand the grammatical structure of a job posting; you need to be able to parse the English. And that problem is showing up more and more frequently. Try using Google Trends to figure out what’s happening with the Cassandra database or the Python language, and you’ll get a sense of the problem. Google has indexed many, many websites about large snakes. Disambiguation is never an easy task, but tools like the Natural Language Toolkit library can make it simpler.

When natural language processing fails, you can replace artificial intelligence with human intelligence. That’s where services like Amazon’s Mechanical Turk come in. If you can split your task up into a large number of subtasks that are easily described, you can use Mechanical Turk’s marketplace for cheap labor. For example, if you’re looking at job listings, and want to know which originated with Apple, you can have real people do the classification for roughly $0.01 each. If you have already reduced the set to 10,000 postings with the word “Apple,” paying humans $0.01 to classify them only costs $100.

Working with data at scale

We’ve all heard a lot about “big data,” but “big” is really a red herring. Oil companies, telecommunications companies, and other data-centric industries have had huge datasets for a long time. And as storage capacity continues to expand, today’s “big” is certainly tomorrow’s “medium” and next week’s “small.” The most meaningful definition I’ve heard: “big data” is when the size of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.

What are we trying to do with data that’s different? According to Jeff Hammerbacher 2 (@hackingdata), we’re trying to build information platforms or dataspaces. Information platforms are similar to traditional data warehouses, but different. They expose rich APIs, and are designed for exploring and understanding the data rather than for traditional analysis and reporting. They accept all data formats, including the most messy, and their schemas evolve as the understanding of the data changes.

Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale. Managing sharding and replication across a horde of database servers is difficult and slow. The need to define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what’s important until after you’ve analyzed the data. Relational databases are designed for consistency, to support complex transactions that can easily be rolled back if any one of a complex set of operations fails. While rock-solid consistency is crucial to many applications, it’s not really necessary for the kind of analysis we’re discussing here. Do you really care if you have 1,010 or 1,012 Twitter followers? Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative: if you’re asking whether sales to Northern Europe are increasing faster than sales to Southern Europe, you aren’t concerned about the difference between 5.92 percent annual growth and 5.93 percent.

To store huge datasets effectively, we’ve seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases, though neither term is very useful. They group together fundamentally dissimilar products by telling you what they aren’t. Many of these databases are the logical descendants of Google’s BigTable and Amazon’s Dynamo, and are designed to be distributed across many nodes, to provide “eventual consistency” but not absolute consistency, and to have very flexible schema. While there are two dozen or so products available (almost all of them open source), a few leaders have established themselves:

Cassandra: Developed at Facebook, in production use at Twitter, Rackspace, Reddit, and other large sites. Cassandra is designed for high performance, reliability, and automatic replication. It has a very flexible data model. A new startup, Riptano, provides commercial support.

HBase: Part of the Apache Hadoop project, and modelled on Google’s BigTable. Suitable for extremely large databases (billions of rows, millions of columns), distributed across thousands of nodes. Along with Hadoop, commercial support is provided by Cloudera.

Storing data is only part of building a data platform, though. Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning.

The most popular open source implementation of MapReduce is the Hadoop project. Yahoo’s claim that they had built the world’s largest production Hadoop application, with 10,000 cores running Linux, brought it onto center stage. Many of the key Hadoop developers have found a home at Cloudera, which provides commercial support. Amazon’s Elastic MapReduce makes it much easier to put Hadoop to work without investing in racks of Linux machines, by providing preconfigured Hadoop images for its EC2 clusters. You can allocate and de-allocate processors as needed, paying only for the time you use them.

Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS, a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components. If anything can be called a one-stop information platform, Hadoop is it.

Hadoop has been instrumental in enabling “agile” data analysis. In software development, “agile practices” are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. It’s easer to consult with clients to figure out whether you’re asking the right questions, and it’s possible to pursue intriguing possibilities that you’d otherwise have to drop for lack of time.

Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter. These features only require soft real-time; reports on trending topics don’t require millisecond accuracy. As with the number of followers on Twitter, a “trending topics” report only needs to be current to within five minutes — or even an hour. According to Hilary Mason (@hmason), data scientist at bit.ly, it’s possible to precompute much of the calculation, then use one of the experiments in real-time MapReduce to get presentable results.

Machine learning is another essential tool for the data scientist. We now expect web and mobile applications to incorporate recommendation engines, and building a recommendation engine is a quintessential artificial intelligence problem. You don’t have to look at many modern web applications to see classification, error detection, image matching (behind Google Goggles and SnapTell) and even face detection — an ill-advised mobile application lets you take someone’s picture with a cell phone, and look up that person’s identity using photos available online. Andrew Ng’s Machine Learning course is one of the most popular courses in computer science at Stanford, with hundreds of students (this video is highly recommended).

There are many libraries available for machine learning: PyBrain in Python, Elefant, Weka in Java, and Mahout (coupled to Hadoop). Google has just announced their Prediction API, which exposes their machine learning algorithms for public use via a RESTful interface. For computer vision, the OpenCV library is a de-facto standard.

Mechanical Turk is also an important part of the toolbox. Machine learning almost always requires a “training set,” or a significant body of known data with which to develop and tune the application. The Turk is an excellent way to develop training sets. Once you’ve collected your training data (perhaps a large collection of public photos from Twitter), you can have humans classify them inexpensively — possibly sorting them into categories, possibly drawing circles around faces, cars, or whatever interests you. It’s an excellent way to classify a few thousand data points at a cost of a few cents each. Even a relatively large job only costs a few hundred dollars.

While I haven’t stressed traditional statistics, building statistical models plays an important role in any data analysis. According to Mike Driscoll (@dataspora), statistics is the “grammar of data science.” It is crucial to “making data speak coherently.” We’ve all heard the joke that eating pickles causes death, because everyone who dies has eaten pickles. That joke doesn’t work if you understand what correlation means. More to the point, it’s easy to notice that one advertisement for R in a Nutshell generated 2 percent more conversions than another. But it takes statistics to know whether this difference is significant, or just a random fluctuation. Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid. Statistics plays a role in everything from traditional business intelligence (BI) to understanding how Google’s ad auctions work. Statistics has become a basic skill. It isn’t superseded by newer techniques from machine learning and other disciplines; it complements them.

While there are many commercial statistical packages, the open source R language — and its comprehensive package library, CRAN — is an essential tool. Although R is an odd and quirky language, particularly to someone with a background in computer science, it comes close to providing “one stop shopping” for most statistical work. It has excellent graphics facilities; CRAN includes parsers for many kinds of data; and newer extensions extend R into distributed computing. If there’s a single tool that provides an end-to-end solution for statistics work, R is it.

Making data tell its story

A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph. Edward Tufte’s Visual Display of Quantitative Information is the classic for data visualization, and a foundational text for anyone practicing data science. But that’s not really what concerns us here. Visualization is crucial to each stage of the data scientist. According to Martin Wattenberg (@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how bad your data is, try plotting it. Visualization is also frequently the first step in analysis. Hilary Mason says that when she gets a new data set, she starts by making a dozen or more scatter plots, trying to get a sense of what might be interesting. Once you’ve gotten some hints at what the data might be saying, you can follow it up with more detailed analysis.

There are many packages for plotting and presenting data. GnuPlot is very effective; R incorporates a fairly comprehensive graphics package; Casey Reas’ and Ben Fry’s Processing is the state of the art, particularly if you need to create animations that show how things change over time. At IBM’s Many Eyes, many of the visualizations are full-fledged interactive applications.

Nathan Yau’s FlowingData blog is a great place to look for creative visualizations. One of my favorites is this animation of the growth of Walmart over time. And this is one place where “art” comes in: not just the aesthetics of the visualization itself, but how you understand it. Does it look like the spread of cancer throughout a body? Or the spread of a flu virus through a population? Making data tell its story isn’t just a matter of presenting results; it involves making connections, then going back to other data sources to verify them. Does a successful retail chain spread like an epidemic, and if so, does that give us new insights into how economies work? That’s not a question we could even have asked a few years ago. There was insufficient computing power, the data was all locked up in proprietary sources, and the tools for working with the data were insufficient. It’s the kind of question we now ask routinely.

Data scientists

Data science requires skills ranging from traditional computer science to mathematics to art. Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:

… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization 3

Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.

Scientists also know how to break large problems up into smaller problems. Patil described the process of creating the group recommendation feature at LinkedIn. It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn’s membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members’ profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn’s data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database — but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.

This is the heart of what Patil calls “data jiujitsu” — using smaller auxiliary problems to solve a large, difficult problem that appears intractable. CDDB is a great example of data jiujitsu: identifying music by analyzing an audio stream directly is a very difficult problem (though not unsolvable — see midomi, for example). But the CDDB staff used data creatively to solve a much more tractable problem that gave them the same result. Computing a signature based on track lengths, and then looking up that signature in a database, is trivially simple.

Hiring trends for data science

It’s not easy to get a handle on jobs in data science. However, data from O’Reilly Research shows a steady year-over-year increase in Hadoop and Cassandra job listings, which are good proxies for the “data science” market as a whole. This graph shows the increase in Cassandra jobs, and the companies listing Cassandra positions, over time.

Entrepreneurship is another piece of the puzzle. Patil’s first flippant answer to “what kind of person are you looking for when you hire a data scientist?” was “someone you would start a company with.” That’s an important insight: we’re entering the era of products that are built on data. We don’t yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. Her job as scientist at bit.ly is really to investigate the data that bit.ly is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they’re all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they’re entrepreneurs.

Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”

The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it’s mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian’s quote that nobody remembers says it all:

The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.

Data is indeed the new Intel Inside.

O’Reilly publications related to data science

R in a Nutshell
A quick and practical reference to learn what is becoming the standard for developing statistical software.

Beautiful Data
Learn from the best data practitioners in the field about how wide-ranging — and beautiful — working with data can be.

Beautiful Visualization
This book demonstrates why visualizations are beautiful not only for their aesthetic design, but also for elegant layers of detail.

Head First Statistics
This book teaches statistics through puzzles, stories, visual aids, and real-world examples.

Head First Data Analysis
Learn how to collect your data, sort the distractions from the truth, and find meaningful patterns.

1 The NASA article denies this, but also says that in 1984, they decided that the low values (whch went back to the 70s) were “real.” Whether humans or software decided to ignore anomalous data, it appears that data was ignored.

Thanks for this post. It might be useful to note that data science has been recognized as a discipline since at least 2001; that one of the first scholars to use “data science” in the sense used in this post was Dr. William S. Cleveland in his 2001 article, “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics,” http://j.mp/a8mQeP ; that people working in this discipline have organized professional associations, including CODATA, the International Council for Science: Committee on Data for Science and Technology http://www.codata.org/ ; that there are conferences devoted to this discipline, including the International CODATA Conference http://www.codata2010.com/ ; and that there are journals that publish data science research, including Journal of Data Science http://www.jds-online.com/ (founded in 2003), and Data Science Journal http://j.mp/9YYh5r (founded in 2002).

Michael F. Martin

Accountants are a kind of data scientist too.

Andrew Walkingshaw

As one of the three founders of Timetric (http://timetric.com), all of whom have a background in the physical sciences, I’m biased: but we couldn’t agree more.

Alex Tolley

Good article. I do question the term data ‘science’, as in “Data science enables the creation of data products.”. Science is a process for discovering truth (or at least discarding falsity). Extracting information from data is more like cooking, an art.

I also am a bit skeptical about the fetish-izing of ‘big data’. Certainly sometimes data sets are necessarily large, but arguably better data, acquired through better questions, is preferable. A second best can be data sampling to reduce the analyzable data set to more tractable proportions. Mining large data sets to death is very similar to “spreadsheet-itis” that spread quickly in te 1980’s once spreadsheets became widely available. Nothing wrong with using spreadsheets, except that they tended to focus attention on the questions where data that was available, rather than the questions that really needed answers.

Brandyn

“In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search across thousands of processors, and then combine the results into a single set of answers.”

You wouldn’t use MapReduce like this in a low-latency query situation. It is made for high throughput batch processing, not quick turn around.

Joel

…and still, to date, the best tool to mess around with large data sets is SAS – been around for ages. Seems that as the web matures, so too does the tool sets and methodologies it uses to extract value.

I’ve only started trying it out, but Protovis is just Javascript and SVG, and that’s a big plus for me.

Kirk

Thanks Mike, “terabyte drives are consumer equipment” are the cord wood to my campfire. I want to know more about how video data is being analyzed. Does Google’s new machine transcription service change the game?

Brian Ahier

This is the best post on data I have ever read. I am stunned…

Ken

Aw heck, I hate to say this Mike, but “data are”, not “data is”. But despite my nitpicking, this a good article.

First (or most recently, at least) yall made up Web 2.0. Now you make up Data Science.

If you want to create new information from existing, read Codd. That’s what relations do.

As to R, yes, it is useful, but SAS still rules the job listings.

The explosion of text data (distinct from audio, video, etc.) is wholly due to Kiddies such as Google who didn’t want to build BCNF datastores because they’re (the Kiddies, not the data) just too dense. So they build, using xml to compound the insult, massively redundant flat files, which in turn demand vast amounts of HDD and CPU to process. And they (and you) pat them on the back as being oh so clever. Yikes.

The errors in the R and Stat books are such to make it obvious your authors (you, too) have an understanding of inferential statistics (the only variety of any value) about as deep as the ISO-9000 and Six Sigma folks. No, that isn’t an AttaBoy.

Michael R. Bernstein

“The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success.”

Also important is the ability to figure out how to tap into existing sources of data, rather than collecting your own.

Eric Christensen

Data is cool. Data Science is cool. Put out the feelers and collect more data (figure out what to do with it later).

Muhammad Mudssar

Good article about data and its its importance specially importance of massive unstructured data. The point is how one can get benefit from unstructured massive amount of data. I think by using MapReduce one can find out the hidden patterns in unstructured data.
I think data along with better processing techniques is the future.

marc

Your graph of job listings vs time displays no indication of its vertical scale. Does the maximum vertical point on the graph represent 3 listings or 300,000? In an article on the importance of data, that leaves me wondering how much of the content is hype and how much is well thought out.

Alan Howlett aka @Technontologist

I see lots of commonality with Richard Saul Wurman’s definition of Information Architect[ure] (not the co-opted Web developer version), “”information architect. 1) the individual who organizes the patterns inherent in data, making the complex clear. 2) a person who creates the structure or map of information which allows others to find their personal paths to knowledge. 3) the emerging 21st century professional occupation addressing the needs of the age focused upon clarity, human understanding, and the science of the organization of information.”

@Technontologist

Jorn Bettin

Very nice article. However there is still a long way to go for data products. The value of data analysis and representation critically depends on access to trustworthy source data. Quantification of trustworthiness is one of the harder problems.

For example, the entire financial system hinges on trust, and the level of trust in financial instruments is not known to be particularly stable. Manufacturing and publicising statements that are designed to influence the level of trust in financial instruments is an entire industry in its own right.

The article would be even better if it didn’t contain simple calculation errors like “RAM has moved from $1,000/MB to roughly $25/GB — a price reduction of about 4000″. Errors like this one have a direct impact on trustworthiness.

http://radar.oreilly.com/mikel Mike Loukides

Ouch. That arithmetic error really hurts. Thanks for pointing it out–it’s been corrected.

The problem of trust is interesting. It depends to a large extent on the nature of the product: we demand a different level of trust for financial products (and that trust has obviously been shaken badly over the past two years) than we do for Facebook or LinkedIn recommendations. And you’re right, in finance there’s a marketing industry that’s designed to manufacture trust entirely aside from the quality of the data or the analysis.

Brian Shaler

I’m going to cite the hell out of this article. While a little scatterbrained, it covers an amazingly broad array of subjects within the realm of data science. The number of references & links to languages, platforms/frameworks, tools, and projects was very impressive! Thanks!

While all these hip new projects have been cropping up during the last five years, we’re still at the beginning of this conversation. I’m looking forward to seeing where the industry goes in the next five years and hope to play a part!

Brian Shaler

I want to clarify on my last comment, when I said “all these” I was not referring to all data-related projects in general. I’m aware that tools have been available to manipulate data for decades. I probably should have phrased it more like, “With all these hip new projects cropping up […]”

The two physical phenomena are often confused, but please don’t further this problem!

They share the fact that their principal drivers are forms of pollution from human activities, but they are not the same in their impacts or the actions needed to mitigate them.

Recently a meteorological link was discovered whereby a the increased UVB radiation due to the depleted ozone layer does affect air circulation patterns in the Antarctic, but this is basically a reflection of the fact that almost any aspects of the earth system if you look hard enough.

http://radar.oreilly.com/mikel Mike Loukides

I’m glad you liked the article. I’ve just fixed the reference to the Ozone Layer. Thanks for pointing that out.

Megan

Kudos for encouraging thoughtfulness about data, and interdisciplinary approaches to data analysis problems. However, I must make two points:

1) actuarial science != Statistics

2) For those commenting that this is the best article they’ve read about data, please allow me to introduce you to W. Edwards Deming (http://en.wikipedia.org/wiki/W._Edwards_Deming). Statisticians have been thinking hard about data for many, many years. Let’s not discard them with a flippant comment about actuaries and the false belief that anyone can run a few programs in R and correctly analyze data.

G. Boyd

Data Science exists in order to usurp an individual’s OODA loop without their knowledge, as described by former Air Force Pilot John Boyd. Of course data science is described as a sexy gig when you’re bosses are control freaks.

It doesn’t matter whether we get the image of the golden ring, or the network centric system, as both images represent identical ends.

Ben Hoyle

Great article.

In many ways “data science” and “information engineering” are synonymous; both deal with a large amount of unstructured data and form conclusions, inferences and predictions. Both can also be thought of as overlapping with AI and brain science; what does the brain (human, fly etc.) do if not process lots of (sensory) data extremely cleverly.

We are reaching a stage where the great limiting factor to significant progress is our intelligent processing systems, we are beginning to have the data but we don’t yet have suitably developed systems to intelligently process that data (both in a human and non-human manner).

G. Boyd

@Ben Hoyle,

First, information engineering is a means to an end, this end which can be referred to as “social engineering”, or the designing of society and its participants. Usurp ones “Observe” and “Orient” functions in their closed-loop decision processes, and you can then predict an individual’s “Decide” and “Act” part of the loop (as described by John Boyd).

Second, of course we have, or will soon have, the algorithms available to process this data. What do you think “intent harvesting” and “intent generation” is about? This is Facebook’s play in the display advertising game, with their launch of the “I Like” feature. We’re talking about generating individual profiles on the ‘feedback’ side of the loop such that ‘controls’ can be implemented on other half of the loop.

Loss of privacy and the simultaneous rise of data engineering is no accident. All these moves are driven by the goal of usurping ones OODA loop. Do that across enough people in the public domain, and guess what you have.

Now, why is O’Reilly et all not discussing this critical aspect of the technologies and strategies that are being pushed on to the tech community for deployment?

Tell us it’s not relevant doesn’t cut it.

M S Prasad

Great knowledge filled post and good coverage of tehnology.

Data science , as I understand has been a forte of statistics & mathematicians for long . we converted it as a formidable tool and gave a face to understand it better by user’s.( something like putting relativity theory in simple words e.g. LINUX for Dummies .

Information Engineering is a process to extract the useful knowledge or result from a set of information which can be data and sometime intangibles also or perceived ones.
As far image data sets are concerned , being multidimensional has a total mathematical basis for its processing and result retrievals.

just my thoughts.
i have posted this article on Cloud Secuirty alliance group in Linked in.
thanks

Torrance S

An incredibly well written article…

martin king

There are many aspects to science much of it consists of the daily grind with data and the tools of science.

I think we need to be careful not to lose sight of the drivers of science and where much of science comes from – intuition and great ideas .. after which we create experiments to test hypotheses which inevitably involve data of some kind.

Jewel Ward

I would add to this list of aspects of Data Science the ability to determine the best way to migrate, store, & archive data. As well as whether or not to keep it. Storage, is indeed, cheap, and getting cheaper. But given petabytes or yottabytes of data, if you aren’t using 90% of what you are storing, does it make sense to keep paying to store, migrate, etc., the data (barring legal requirements to do so)? Think in terms of the larger scale of it — the electricity required (both in generating the electricity itself & paying for use), plus the machines that must be purchased, and the humans who must be paid to manage the data through its lifecycle. Cost-benefit analysis is an important component.

As I keel reading and reading the article, the academic inside me keeps repeating the same question: do we (academics) do a good, or even adequate job, at educating future data scientists? Should we consider creating data science programs? What can we do to offer better more relevant education to individuals who may be pursuing or using data science? And who can possibly advise us about the practical realities of data science?

Vassilis Nikolopoulos

Excellent article on new data management trends… traditional approaches to DB and storage / analytics have to be changed, in order to cope with this huge evolution on data and stream decision analysis…

OLAP, BI and traditional Data mining have to be “updated” with new R&D tips…

David Alan Christopher

Fascinating article Mike. I’m a librarian and our lot faces a handful of data as well. We’ve been writing some Java and Ruby code lately for in-house data analysis and even some basic segmentation of our huge inventory.

We often need to pull content from external public websites (usually government and NPO’s), which don’t have any form of exportable data formats. We’ve started using a tool called Feedity – http://feedity.com for that purpose. So far it’s worked out pretty well as we’ve been able to “scrape” raw data as structured information from hundreds of public sources.

I bet this space will only grow as data becomes more and more important for organizations world-wide.

Ellen O'Neal

This stuff fascinates me! I love the examples of Facebook, LinkedIn and Amazon using patterns to suggest other relationships and cross sell. Another example that comes to mind is Pandora. A friend of mine wrote an article that I think may be interesting to fellow readers:

This is a great article about how getting information from data is going to be so important in the future. I’ve been writing about this for years. I love how “data science” is really taking on a life of its own now.

let me however add that much of the data that is collected is irrelevant. CDDB is a great success because their data model is simple and the data are really very limited, a few million rows. Filtering the relevant stuff is not easy and mostly an intuitive human skill as long as the model can be grasped.

Statistical processing does however not replicate reality, but is always built on a model illusion. Don’t forget that all models are wrong but some are useful (Einstein). Further it depends how the data is collected: how it is timed, how accurate it is, how it is filtered, which questions were asked or points that were measured, and a few more elements that rely on model assumption. Statistics further only produce correlation but not causation.

Humans tend to misinterpret many effects as causes because the effect can be seen while THE CAUSE doesn’t exist as a singular event but is a complex web if feeding stimuli that need a complex web of receptive context to actually make something happen. I.e. bringing a pot of water to boil is not caused by switching on the stove. Think of all the other causal elements that are needed including air pressure and saline content.

All human activity cannot be causally interpreted because social activity must be seen as complex adaptive systems that undergo continuous change that is hardly ever reflected in the data model and processing. Which is why managing a business with BPM workflows and predictive analytics is for idiots who don’t understand nature and science. They look for predictability where there is none.

The models of global warming and climate change are complete rubbish and so will be many of the data models that try to make sense of all the random data collected. Yes, some of it will be useful as an AHA moment, but it certainly won’t predict the future or allow us to control things better. And if that isn’t possible, why bother.

I won’t even go into the already mentioned issues of privacy and misuse of the collected data by governments and big business.

http://attorneydirectoryofamerica.com Attorney

While data science is incredibly beneficial and helping to improve the standard of living of humanity, I’m not so sure that I go so far as to agree with Hal Varian that statistics is the next sexy job. That’s a bit of a stretch! Important? Absolutely. Sexy? Well….

Great post by the way.

Jim

http://www.revitolstretchmarkcream.com Revitol

A very interesting and thorough article! I do feel a little uneasy, though, reading about the ability to extract all the types of information from raw data. It looks like the big companies such as Google may know a little bit too much about us…

Rissy

Very interesting post..I get a lot of information and insights here..I just stumbled here in your blog and thanks a lot for this great post.:D

http://www.ardalahmet.com Ahmet Ardal

Thanks for this excellent article. It really encouraged and motivated me a lot as a newbie data scientist.

None

I like this information, thank you!

wfk3

Good article. Does it seem to anyone else that we are on our way toward creating a new “reality” with all this data? The data may soon be more “real” than the real world phenomena it describes (data is still conventionally seen as something than describes something else, and so not the atomic thing itself) but that may change so that the data itself is the atomic thing, it need not describe anything but exist for its own sake.

Featured Video

Is Privacy Becoming a Luxury Good? Julia Angwin discusses how much she has spent trying to protect her privacy, and raises the question of whether we want to live in a society where only the rich can buy their way out of ubiquitous surveillance.