Five big data predictions for 2012

Edd Dumbill looks at the hot topics in data for the coming year.

As the “coming out” year for big data and data science draws to a close, what can we expect over the next 12 months?

More powerful and expressive tools for analysis

This year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won’t be long before that’s dull, yet necessary, infrastructure.

Looking up the stack, there’s already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Tableau and R. But there’s a way to go in making big data more powerful: that is, to decrease the cost of creating experiments.

Here are two ways in which big data can be made more powerful.

Better programming language support. As we consider data, rather than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop machinery. In other words: write shorter programs that make it clear what we’re doing with the data. These abstractions will in turn lend themselves to the creation of better tools for non-programmers.

We require better support for interactivity. If Hadoop has any weakness, it’s in the batch-oriented nature of computation it fosters. The agile nature of data science will favor any tool that permits more interactivity.

Streaming data processing

Hadoop’s batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn’t need to be up-to-the-minute. However, batch processing isn’t always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.

Over the next few years we’ll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.

For some applications, there just isn’t enough storage in the world to store every piece of data your business might receive: at some point you need to make a decision to throw things away. Having streaming computation abilities enables you to analyze data or make decisions about discarding it without having to go through the store-compute loop of map/reduce.

Emerging contenders in the real-time framework category include Storm, from Twitter, and S4, from Yahoo.

Rise of data marketplaces

Your own data can become that much more potent when mixed with other datasets. For instance, add in weather conditions to your customer data, and discover if there are weather related patterns to your customers’ purchasing patterns. Acquiring these datasets can be a pain, especially if you want to do it outside of the IT department, and with some exactness. The value of data marketplaces is in providing a directory to this data, as well as streamlined, standardized methods of delivering it. Microsoft’s direction of integrating its Azure marketplace right into analytical tools foreshadows the coming convenience of access to data.

Development of data science workflows and tools

As data science teams become a recognized part of companies, we’ll see a more regularized expectation of their roles and processes. One of the driving attributes of a successful data science team is its level of integration into a company’s business operations, as opposed to being a sidecar analysis team.

Software developers already have a wealth of infrastructure that is both logistical and social, including wikis and source control, along with tools that expose their process and requirements to business owners. Integrated data science teams will need their own versions of these tools to collaborate effectively. One example of this is EMC Greenplum’s Chorus, which provides a social software platform for data science. In turn, use of these tools will support the emergence of data science process within organizations.

Data science teams will start to evolve repeatable processes, hopefully agile ones. They could do worse than to look at the ground-breaking work newspaper data teams are doing at news organizations such as The Guardian and New York Times: given short timescales these teams take data from raw form to a finished product, working hand-in-hand with the journalist.

Increased understanding of and demand for visualization

Visualization fulfills two purposes in a data workflow: explanation and exploration. While business people might think of a visualization as the end result, data scientists also use visualization as a way of looking for questions to ask and discovering new features of a dataset.

If becoming a data-driven organization is about fostering a better feel for data among all employees, visualization plays a vital role in delivering data manipulation abilities to those without direct programming or statistical skills.

Throughout a year dominated by business’ constant demand for data scientists, I’ve repeatedly heard from data scientists about what they want most: people who know how to create visualizations.

Get the O’Reilly Data Newsletter

Edd,
Great comments and particularly about visualization. As data sets become more complex, partcularly in cloud computing and virtualisation, infrastructure managers are demanding better visability and a holistic view of their estate.
That’s why we have developed HyperGlance the world’s first infratstructure manager that uses the power of computer gaming software to show really complex data in real time. You can show tens of thousands of devices on a single pane of glass and fly straight to the issue with one click.
HyperGlance can be used as either a planning or operational tool to show complex VM estates and problem solve by seeing whats changed with the powerful correleation engine.
You can see the latest software release at http://www.real-status.com

Let me submit a shameless but relevant plug about RHadoop, an open source project by Revolution Analytics. Instead of creating a new language or a new software we just connected statisticians’ favorite language, R, with hadoop. We had your points 1 and 2 on elegance and interactivity in mind as we designed these collection of packages. It’s just one example, but in the case of k-means we achieved a 5-fold code length reduction compared to a pig/python/java implementation by the fine people at Hortonworks. Please check it out https://github.com/RevolutionAnalytics/RHadoop/wiki/Comparison-of-high-level-languages-for-mapreduce:-k-means

Great post, Edd! Very prescient. In the vein of data marketplaces, I’d add a 6th bullet point around data provenance. As data is increasingly opened up I see more and more people asking where the data come from, who mucked with them, how they were collected, and why. Maybe this is more of a 2013 item, but I think we’ll see an explosion in efforts to standardize data formats or at least add some kind of paper trail (bit trail?) describing what adventures the data’s been on before it’s made its way onto your machine.

Jake, yes, I’ve been seeing exactly the same trend too. Metdata support is going to become a big deal for data-driven process to actually be credible and governable. And that’s a really hard problem. I think we’ll see awareness of the need crystallize in 2012 and perhaps some solutions in the year following.

Interesting concept on the data marketplace, Edd. This one is particularly interesting, because we know the data is there, and at the same time we know the many obstacles that need to be overcome so that people feel comfortable with their data being exposed.

For example in the field of healthcare a compelling arguement for patient safety can be made around appropriate sharing of data, but the reality is that almost all organizations are unwilling to share data with each other out of concern for privacy. At some point the value of the information will hopefully outweigh the risk concerns, but this is yet to be seen.

Hello Edd – thanks for the interesting article. I’m espeically interested in your comments about streaming data processing. We are doing a lot of work that is characterized by the following 2 patterns –
– the ability to use rich analytics that allows incoming or streaming data to be filtered by organizational context. This approach can be used to determine relevance before you go through the cleanse, and load process, which can be used to provide better time to market for operational or analytical use cases.

– applying SAS analytical processing using the streaming API of Hadoop so that one can combine the power of the Hadoop parallel processing with world class analytics.