5 trends that are changing how we do big data

It’s time to rethink the who, what, where, why and how of big data. After a surge of important news in the past couple weeks, we’re approaching a period of relative calm and can finally assess how the space has evolved in the past year. Here are five trends shaping up that should change almost everything about big data in the near future, including how it’s done, who’s doing it and where it’s consumed. Feel free to share the trends you’re seeing in the comments.

The democratization of data science

Advertisement

The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it’s new education platforms (e.g., Coursera and Udacity) teaching students fundamental skills in everything from basic statistics to natural language processing and machine learning. Elsewhere, it’s products such as 0xdata that aim to simplify and add scale to well-known statistical-analysis tools such as R, or, like Quid that try to mask the finer points of concepts such as machine learning and artificial intelligence behind well-designed user interfaces and slick visual representations. Platforms such as Kaggle have opened the door to crowdsourcing answers to tough predictive-modeling problems.

Whatever the avenue, though, the end result is that individuals who have a little imagination, some basic computer science skills and a lot of business acumen can now do more with their data. A few steps down the ladder, companies such as Datahero, Infogram and Statwing are trying to make analytics accessible even to laypersons. Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.

From this point on — like with the Google(s goog) MapReduce framework on which Hadoop’s version of MapReduce was modeled — it seems likely we’ll see the latter grow less important. Presumably, the Hadoop community will focus more on using the platform’s distributed nature to support real-time processing and other new capabilities that make Hadoop a better fit in next-generation data applications. If Hadoop can’t fill the void, there are plenty of people working on other technologies — Storm and Druid, for example — that will gladly do so.

The HBase NoSQL database that’s built atop the Hadoop Distributed File System is a good example of what’s possible when Hadoop is freed from the MapReduce constraints. Large web companies such as Facebook (s fb) and eBay (s ebay) already use HBase to power transactional applications, and startups such as Drawn to Scale and Splice Machine have used HBase as the foundation for transactional SQL databases. More new products and projects, such as graph database Giraph, will look for ways to leverage HDFS because it gives them a file system that’s scalable, free, relatively mature and, perhaps most importantly, tied into the ever-growing Hadoop ecosystem.

Coming soon to an app near you

Of course, all of this technological improvement is nothing without applications to take advantage of it, so it’s good news that we’re seeing a wide range of approaches for making this happen. One of these approaches is making big data accessible to developers, which is where startups such as Continuuity, Infochimps and even Precog (a big data BI engine, by nature) come into play. They make it relatively easy for developers to create applications that tie at least some functions into a big data backend, sometimes via a process as simple as writing a script or generating a piece of code that programmers can insert directly into their application’s code.

Machine learning is everywhere

Machine learning has had something of a coming-out party in the past year and is now so prevalent it might be easy to mistake it for something that’s not difficult to do well. It’s easy to see why machine learning is so popular, though: In an age where consumers (and advertisers) want more personalization, and where computer systems are overwhelmed with data flying at them from all different directions, the prospect of writing models that continuously discover patterns among potentially countless data points has to be appealing.

Now, it’s difficult to imagine a new tech company launching that doesn’t at least consider using machine learning models to make its product or service more intelligent. Heck, even Microsoft (s msft) appears to be making a big bet on machine learning as the foundation of a new revenue stream. The technology to store and process lots of data is out there, and the brainpower looks to be coming along as well. Soon, there will be few excuses for building applications that don’t learn as they go, for example, what users want to see, how systems fail or when customers are about to cancel a service.

They know where we go, who our friends are, what’s on our calendars and what we look at online. Thanks to a new generation of applications such as Siri (s aapl), Saga and Google Now trying to serve as personal assistants, our phones can understand what we say, know the businesses we frequent and the foods we eat, and the hours we’re at home, at work or out on the town. Already, their developers claim such apps can augment our limited vantage point by automatically telling us the best directions to our upcoming appointment, or the best place to get our favorite foods in a city the app knows we haven’t been to before.

I think the most interesting thing in big data right now is using it for predictive analytics! The concept that you are able to use big data to get these predictions, from massive amounts of data divided into insane numbers of columns in a matter of minutes, is a priceless advantage for any business and Pervasive Software’s Rush Analytics is doing just that! And on top of that, they are giving away the $1000/month desktop version for FREE right now on their website and the trial last for a whole month! Check it out! You can download it right on to your system!! ENJOY!http://bigdata.pervasive.com/Products/Download-Center.aspx

Right on: Big data without tool to help users understand and levarage in a meaningful way and, performance wise, reallistically will remain just that – big data. Users have the right to expect at least information if not knowledge out of big data. Analytics tools working closely with big data should become the tools of the masse, not just for a few so-called data scientists. That should be the next trend of Big Data. SQL-like on top of Hadoop/MapReduce in an on-line interactive is a good start.

Someday, with Big Data and popularized analytics tools, one should be able to ask: “What question should I ask” from this big data, instead of “how may times that name appears in this corpus of data.”

Don’t forget Predixion Software. The team is the former data mining team from Microsoft and have developed a cloud architected big data machine learning platform. They can push their Excel generated models (MS or Mahout) directly into Hadoop, SQL, Greenplum etc and it’s much easier.

What about data creation? Who is managing that? What’s needed is not more “analysts” who simply react to existing data. Instead, we need “researchers” who take an active role in determining what data are required to answer our critical business questions and how to best obtain that information.

Companies who use informed hypotheses to dictate what information is gathered are much less likely to find themselves drowning in “big data.”

Bringing features enabled by big data down to the hands of small businesses and consumers is absolutely the next step. At SRCH2 (http://srch2.com), weâ€™re working on enabling small and mid-sized e-commerce retailers to offer high-end full-text search, and the many features that they are not yet tapping. These include: fuzzy search, full-text search, rapid geo-search, real-time updates, and much more. Thereâ€™s still a whole lot left to do.

As for opportunists, that is clearly a risk. One of the things we find is that many stretch the definition of â€œbig data,â€ and many also offer tools which are warmed over and modified versions of existing search software. These approaches are limited, as backwards integration of existing solutions leads to sub-optimal performance. If you have gone through the trouble of creating a big data stack, with several new elements built for speed and size, the last thing you want to do is have your search be the new bottleneck.

The angle I am researching is not “big data” per se but = how the data are generated in the first place, e.g., by government agencies that in the course of their legislatively mandated programs produce data that can be used by their target users as well as by others. More here: “A Framework for Transparency Program Planning and Assessment” http://www.ddmcd.com/outline.html

As mentioned in some of the discussion above I think a key issue is that BigData is no longer just the domain of the large companies. Companies of all sizes now facing large amounts and importantly a large variety of data many companies are not in a position to be hiring data scientists to help deal with their data problem.

At BIME moving forward we are very excited about BigData analytics in the cloud – Google BigQuery offers an analytical database as a service that scales to petabytes of data. It means companies that previously would have needed very large infrastructure and an operational team can now analyze their data with only a web browser. http://bigquery.bimeanalytics.com/

Derrick, we are seeing an increase in businesses seeking specialized skills to help address challenges that arose with the era of big data. The HPCC Systems platform from LexisNexis helps to fill this gap by allowing data analysts themselves to own the complete data lifecycle. Designed by data scientists, ECL is a declarative programming language used to express data algorithms across the entire HPCC platform. Their built-in analytics libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. More at http://hpccsystems.com

The ongoing data science revolution is a promising development, but one that has been somewhat shortsighted so far. As this article indicates, there is a strong push to teach people about various analytical techniques and tools, which is good. We need people who understand statistics and can use software, like R.

But how much attention has been given to critical thinking, problem formulation, content knowledge, and research methods? Excellent quantitative and programming skills are not very useful without the ability to ask the right questions and design the kind of study needed to answer those questions.

Going forward, I think organizations that view data science as a complete, end-to-end research process will be much more successful than those that think of it merely as the analysis of large, pre-existing bodies of data.

I agree that it’s important to know what the right questions are to ask. We have so much information at our finger tips, even without some of these new tools, it’s always a good idea to take a step back and think… “What decisions can I actually make with this knowledge? Is this important?”

And of course… the eternal favorite question of service providers and consultants (at least the good ones) – “Why?”

Fascinating look into the variety of products and services that are helping companies to understand or use big data. The area that we are focused on, a few steps down the ladder, hasn’t been covered.

We are working with marketers to quickly uncover actionable insights and to do this we have to speak their language. Big data is rarely in their vocabulary. Any software needs to be incredibly easy to use and no data, programming or statistical knowledge can be required.

Whilst Data Scientists and Marketers can work together, we won’t see wholesale changes of consumer engagement until marketers are ubiquitously using data for themselves. This needs to be in all companies, not just the digitally native ones. Here’s a blog article that was written for the ESCP Europe Creativity Marketing Centre (European Business School) http://bit.ly/PL9FbW

I think the biggest trend not mentioned is the term “big data” becoming as meaningless as “cloud computing”. Those vendors that can go beyond buzzwords and solve specific, high-value pain points will win over the long term…

Well said. I prefer to say ubiquitous data as that’s the real challenge. Sure, volume, velocity, variety are on the rise, but that’s an incremental problem solved by technology. The challenge is to make systems that are harmonious with data’s ubiquity…systems that can find information and react to it. Now that’s interesting.

Interesting write-up and very connected to the small-vendor world. That’s a good thing since many small vendors have a hard time finding their voice with the shouting going on by the largest. Your piece brings up a very good point without stating it outright…that Big Data is not a theory anymore and is very much a given for success in 2012.