Big data: An overview

Data is being generated about the activities of people and inanimate objects on a massive and increasing scale. We examine how much data is involved, how much might be useful, what tools and techniques are available to analyse it, and whether businesses are actually getting to grips with big data.

Computing devices and networks have been storing and processing data in increasingly large amounts for decades, but the rate of expansion of the 'digital universe' has accelerated massively in recent years, and now exhibits exponential growth.

Big Bang

Colossus Mk 2 review

Computing's 'Big Bang' moment came during World War 2, in the shape of the world's first programmable digital computer, Colossus. Built at the UK's Bletchley Park codebreaking centre to help break the German High Command's Lorenz cipher, Colossus could store 20,000 5-bit characters (~125KB) and input data at 5,000 characters per second via paper tape (~25Kbps). Small data in today's terms perhaps, but Colossus decrypts made a vital contribution to the Allied planning for D-Day, in particular.

The Digital Universe

In December 2012, IDC and EMC estimated the size of the digital universe (that is, all the digital data created, replicated and consumed in that year) to be 2,837 exabytes (EB) and forecast this to grow to 40,000EB by 2020 — a doubling time of roughly two years. One exabyte equals a thousand petabytes (PB), or a million terabytes (TB), or a billion gigabytes (GB). So by 2020, according to IDC and EMC, the digital universe will amount to over 5,200GB per person on the planet.

Source: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East (IDC &amp; EMC, December 2012)

In 2012 the US and Western Europe still accounted for over half (51%) of the digital universe (see diagram above right), but by 2020 IDC and EMC estimate that 62 percent will be attributable to emerging markets, with China alone accounting for 21 percent.

Big data

Not all of the myriad streams of data generated by and about people (and, increasingly, things) in this digital universe will be actually or even potentially useful. According to IDC and EMC, some 33 percent of 2020's 40,000EB (13,200EB) total might be valuable if analysed. In 2012, the figure is 23 percent of the 2,837EB total (652EB) — with only 3 percent (85EB) suitably tagged and just half a percent actually analysed. That still amounts to 14.185EB (14,185 petabytes, or 14.185 million terabytes) — 'big data' in anyone's book, but a mere footprint on a vast and largely unexplored cosmos of information.

Big picture

While we're still examining the big picture, it's worth looking at how Big Data has progressed along Gartner's Hype Cycle in recent years:

Gartner's Hype Cycles for Emerging Technologies, 2011-2013.

In 2011, the analyst firm placed Big Data (along with 'Extreme Information Processing and Management') in the Technology Trigger phase (since renamed Innovation Trigger), with mainstream adoption envisaged in 2-5 years. Last year saw it approaching the Peak of Inflated Expectations, which it has all but scaled in 2013. Gartner also revised its outlook for Big Data in 2013, placing mainstream adoption 5-10 years in the future, with the Trough of Disillusionment opening up before it.

Continued

Big data: definitions and applications

Big data is commonly characterised by three vectors — volume, variety and velocity. Volume clearly refers to the sheer amount of data; variety refers to its 'polystructured' nature (i.e. a mixture of structured, semi-structured and unstructured data such as text, audio and video); and velocity refers to the rate at which it is generated and analysed (which in some applications needs to be in real time, or near real-time). Big data is not generally amenable to analysis in traditional SQL-queried relational database management systems (RDBMSs), which are primarily designed to handle smaller and more predictable flows of structured data. In particular, performance can suffer as the size or user population of an RDBMS grows. A variety of scalable database tools and techniques have therefore evolved, Apache's open-source Hadoop distributed data processing system (which includes the HBase database and Hive data warehouse system) being the best-known solution. A related set of non-relational databases go under the NoSQL banner, leading examples being Dynamo DB (Amazon), MongoDB, Neo4j, Couchbase and Cassandra (Apache).

Hadoop: the elephant in the Big Data room

There is also a relatively new job description, that of the data scientist, whose role is to orchestrate often disparate big data sources, perform analyses using the most appropriate tools, and present the results in digestible form (as dashboards, for example), to decision-makers. Data scientists are currently in short supply, however — a skills gap that leaves many organisations with few options other than to pay expensive consultancy rates or remain data-rich but information-poor. Consequently, there is much activity and interest in the area of 'self-service' big data analysis tools that can be used by non-specialists, and in converging the two strands of the database world: internet-centric Hadoop/NoSQL and enterprise-centric SQL/RDBMS.

There are myriad kinds of big data that could deliver value if properly orchestrated. In the EMC/IDC study mentioned earlier, four classes are highlighted in addition to traditional transactional data in enterprise data warehouses: surveillance footage (useful in crime, retail and military applications, for example); data from embedded and medical devices (for real-time epidemiological studies, for example); information from entertainment and social media (mining the wisdom — or otherwise — of the crowds on multiple topics); and consumer images (if tagged and analysed when uploaded to public websites). To these we would add the increasing amounts of data generated by all manner of sensors in the fast-developing Internet of Things.

Big data in business today

If, as IDC and EMC estimate, there are millions of terabytes of usable data available for big data analysis today, has it actually become part of the everyday fabric of business? A recent survey from Steria's Business Intelligence Maturity Audit (biMA), entitled Are European Companies Ready for Big Data?, gives a clue as to the current state of play in Europe.

Note that third in the list of challenges is 'internal competencies insufficient': that's a skills gap in the well-established field of business intelligence, not to mention the relatively new and less familiar area of big data analytics.

The BI data volumes in Steria's survey also suggest a low prevalence of big data activity, with only 16 percent of companies reporting volumes of more than 50TB in their analytical databases:

Source: Steria/biMA, 2013

When asked to rank the relevance of big data, only 23 percent of respondents scored it positively (marked in red, below), compared to 51 percent who were cool on the idea (marked in blue):

Source: Steria/biMA, 2013

Despite this moderate showing, Steria's respondents saw a wide range of potential benefits from big data, even if no single 'killer application' is apparent in this survey:

Source: Steria/biMA, 2013

Although it's only one survey (see our own ZDNet/Tech Republic big data survey for another take), this Steria/biMA report tends to support the overall picture described earlier: there are plenty of potential benefits in big data, but it's not yet delivering value day in, day out, to ordinary businesses.

The big data market

The chances are, though, that big data will take its place in the mainstream of IT activities in due course. That's certainly the view of analyst firm IDC, which in March 2012 forecast big data to become a $17 billion business by 2015 with a CAGR of 39.4 percent over the preceding five years (since updated to $23.8bn by 2016 with a CAGR of 31.7%):

Not surprisingly, the storage sector — servicing large-scale Hadoop clusters and other similar systems — shows the biggest forecast growth rate (61.4%), with servers bringing up the rear (27.3%). According to IDC, big data storage will account for 6.8 percent of the entire worldwide storage market by 2015.

Continued

Big data vendors

If you're looking to exploit big data in your business, who are the vendors you should be considering? As might be expected, there's a great deal of activity in this area, with many startups, a few emerging 'star' companies, and established database vendors working hard to adapt to the latest developments in data management, analysis and visualisation.

Current and future 'stars' of the big-data world are likely to be found among the 'pure play' vendors who derive 100 percent of their revenue from this market. These are graphed below, along with MarkLogic, whose big data revenue Wikibon estimates to be 88 percent of its total. This established company (founded in 2001) is the leader (in revenue terms) among those that specialise in Hadoop or NoSQL solutions (highlighted in red). Also prominent in the Hadoop/NoSQL community are Cloudera, MongoDB (formerly 10gen), MapR and Hortonworks:

None of Wikibon's top four pure-play big data vendors are Hadoop/NoSQL specialists: CIA-fundedPalantir initially concentrated on data mining for US intelligence and law enforcement agencies, but its software is increasingly widely used in mainstream business; fast-growing Splunk majors on searching for, capturing, indexing, analysing and visualising machine-generated data; Opera Solutions offers big data analytics as a service in a number of business sectors; and Mu Sigma integrates a variety of commercial and open-source tools and technologies into a 'decision support ecosystem', placing much emphasis on training data scientists in its own internal 'university'.

When we look at all big data vendors in Wikibon's analysis (excluding those in which hardware accounts for more than 50% of their big data revenue), we find several classes of company heading the revenue chart: broad-portfolio tech giants (IBM, HP, Oracle, EMC); leading software houses (Teradata, SAP, Microsoft); and professional services companies (PwC, Accenture):

Also represented on the all-vendors chart are web behemoths like Amazon and Google. Big data analytics is part of these companies' internal DNA, and they have turned their expertise and infrastructure into products and services such as Elastic MapReduce and Redshift (Amazon), and BigQuery (Google).

The sheer number of companies involved in big data and the revenues being generated show that it's definitely not all hype. As ever in a developing market, we can expect plenty of future merger-and-acquisition activity as established companies cherry-pick the startups and growing firms jostle for position.

Outlook: Big, and getting Bigger

The size of the 'digital universe' is growing apace, as is the number companies involved in developing tools and techniques for managing, analysing and visualising big data. Many companies (especially large enterprises, which by definition routinely deal with 'big' data) are already exploiting big data, but despite widespread awareness of the potential benefits, it has yet to achieve mainstream adoption.

The database world now has two camps: the internet-centric, open-source-based world of scalable distributed databases, where much of the recent big data innovation has occurred; and the enterprise-centric world of traditional, heavily siloed, relational database management systems, where much of the expertise needed to actually run businesses resides. Finding ways to get the best from both worlds, creating a new generation of 'data scientists', will be key to big data's journey from hype to the mainstream.

Big data may well spend some time in Gartner's 'Trough of Disillusionment' as the various barriers to mainstream adoption are dismantled, but there's just too much valuable data out there for it to remain there for long.