Has the term "big data" completely lost meaning yet?

There are some terms in IT that make their way straight into the hype stratosphere. Unfortunately "big data" is one of these. I see very few systems that I'd contend are actually "big data". However, I endlessly see the term applied to data stores that are trivial by today's standards. This might help the marketing teams but it's sad none-the-less. There are some technological challenges that really do start to bite as the data volume really does start to become large, and as the proportion of unstructured and semi-structured data increases. There are also some very interesting new tools that allow us to process larger volumes of data faster, particularly in relation to analytics, and a large market building around Hadoop and its derivatives.

I also see entire teams that claim to focus on big data, yet whenever I discuss the projects with them, none of them are working with databases that are even vaguely in the ballpark of what anyone would have considered big data ten years ago, let alone today. None of the people involved have ever dealt with even really large tables by today's standards. It's interesting that so many data-related jobs have suddenly become "big data" jobs. I'd love to know what these teams think they mean, when they say that they "focus" in these areas. It simply isn't possible for so many of them to do so.

For a more serious take on this subject though, there is some interesting material in Stephen Few's recent blog post: Big Data, Big Deal. Stephen argues that big data is simply a marketing campaign. As always, the comments associated with the blog post make for reading that's as interesting as the post itself. I don't totally agree with Stephen on this, as there really has been quite a shift in the available tooling in recent years, but much of his discussion is right on target.

Ironically yesterday I was working with a team that has a project that I would qualify as "big data", yet they had never thought to call it that.

I suspect we as an industry need to start to quantify what the term "big data" really means, at a given point in time. It's clearly a relative term that changes over time. Otherwise, we should lose the term entirely or further define it, as there is currently a great deal of confusion around it.

The whole discussion reminded me of this wonderful xkcd cartoon that compared production levels in the energy industry: http://xkcd.com/1162/

One of the more amusing calls I had last year was with a US based fast food chain. They told me that they were ok using SQL Server for their analytic work but they'd decided they needed to use Oracle for the main database engine, based on the volume of data that they needed to handle efficiently. The Oracle sales guy had done a good job. I was intrigued about what volumes of data they thought would justify this. Later, it became apparent that it was about 30GB...

Without triggering a "that's not a knife, that's a knife" moment, I'd love to hear what others currently consider "big data". I don't consider "using Hadoop (or HDInsight) as a synonym for "working with big data".

Sounds very similar to the UK. I've been meeting with a big data user group for the last 6 months but finding very few people who actually have an appropriate data set - frequently big data would fit in main memory on a reasonable laptop. I think for many this is more an aspirational rather than a reality.

Part of the issue appears to be in the common definition. Volume, Velocity, Variability may cause you to have a big data problem, but very few are ready to stick their neck out and quantify what counts. A year ago I'd have loosely said it was any data analysis task where it was necessary or more economic to handle through scale out database systems rather than scale up, but the market place is now too polluted with v.small big data solutions for this to stick.

I like the big data "three vectors" definition...volume (huge amount of data), variety (lots of data with different schemas in a single context DB) and velocity (dramatic growth of data). If your data has one of these vectors, then you have a potential "big data" problem.

For example, you could have 0 gig of data initially...however, if you plan on storing every stock transaction going forward you will have a "big data" problem because of data velocity.

"Big Data" is well defined, although few are willing to openly admit what that definition is. To wit: Big Data is the excuse to dump standard RDBMS/SQL datastores with their (nearly) transparent, and client language agnostic, syntax in favour of bespoke file storage tied to a specific client language. The amount of data needed to meet "Big" threshold moves down as the Kiddie Koders flummox yet more Suits. Yet another attempt to get Back to the Future of COBOL/VSAM applications.

Ironically, those systems are finding that writing a TPM for each and every application is a pain, so some are setting out to reinvent CICS. Such folks are blind to the irony. But that shouldn't be surprising, they've already demonstrated their blindness to data management.

Big data is mostly big JUNK data when we look into what's really being stored in these BD solutions. Most of the "big data" platforms are used as containers for social network blogs, comments, ratings and so on. They are not ideal for RMDBs so BD comes in to help. So far so good.

Problems come up when we try to make use of such data. They are not really useful data to begin with (how much value is there when an anonymous poster gives some article 4 stars anyway). It's hard to efficiently analyse data that's stored in a nonstructural way. There's no short-cut here. We don't have a data-structure storing it, we pay the price later. Low efficiency + vast volume of data = analysis headache. To make matter worse, such "social" data decays. If we don't analyse it fast enough its value rots away so we end up with a big pile of worthless data (junk) wasting hard drives.

That "big data big deal" guy is right. Lots of the big data advocates are merely selling the perception of value to gullible CIOs / CEOs. Big data is a hype.

Like codepro said, what we're really talking about is storing and analyzing semi-structured data, in ways that can scale to many terabytes, but can also be applied to much smaller data sets. It's a question of the right tool for the job. Log files, social graph data etc don't fit well into an RDBMS, regardless of size.