Going Big Data: What IT Managers Need to Know

The traditional SQL database is under siege by modern information stores that can process larger volumes of data designed for the cloud era more rapidly. Here's your guide to what's going on in the big data world and the major players to watch.

Unless you've been living under a rock, you know one of the biggest drivers of IT and cloud computing initiatives is the need to gather, process, store and analyze -- often in real time -- "big data."

Businesses and government agencies alike are stepping up initiatives where they're mining everything from their CRM systems and data feeds to tweets mentioning their organizations that can alert them to a sudden problem with a product to a potential market opportunity spawned by an event. Online and big-box retailers are using big data to automate their supply chains on the fly. Law enforcement is analyzing huge amounts of data to thwart potential crime and terror attacks.

Big data drove an estimated $28 billion in IT spending last year, according to market researcher Gartner Inc. That figure will rise to $34 billion this year, Gartner estimates. In addition to pulling data in from social networks, a growing number of big data applications involve machine data from sensors, telemetry systems and other non-human interfaces -- as well as large volumes of unstructured content -- in order to determine trends to deliver insights and intelligence in near-real-time. Gartner noted 10 percent of new IT spending on application infrastructure and middleware is in some way influenced by big data.

Hadoop Proliferation
Many of the big data initiatives under way now are the result of the growing proliferation of Apache Hadoop-based data stores consisting of the Hadoop Distributed File System (HDFS). Hadoop can run alongside such analytics engines as Apache Hive, originally developed by Facebook before the company contributed it to the open source community.

Hadoop-based repositories let users store terabytes of unstructured information in massively distributed clusters based on commodity servers. Using a rapidly growing market of query tools from traditional suppliers and a slew of startups, users can find and access that content faster than ever before.

Microsoft Previews HDInsight
Microsoft took an important step forward in its quest to bring big data to the cloud in late March when it released the public preview of its Windows Azure HDInsight offering. The service, first made available on a limited basis last fall, aims to let enterprises process big data using Microsoft SQL Server and the Hortonworks Inc. distribution of the Hadoop file store, which the companies emphasize as 100 percent Apache-compatible. Spun out of Yahoo! Inc. with the help of Benchmark Capital in 2011, Hortonworks formed a partnership with Microsoft to enable SQL Server to use its Hadoop distribution.

The HDInsight Service in Windows Azure lets organizations spin up Hadoop clusters in Windows Azure in a matter of minutes, noted Eron Kelly, general manager for the Microsoft SQL Server group, in a March 20 blog post.

"These clusters are scaled to fit specific demands and integrate with simple Web-based tools and APIs to ensure customers can easily deploy, monitor and shut down their cloud-based cluster," Kelly noted. "In addition, [the] Windows Azure HDInsight Service integrates with our business intelligence tools including Excel, PowerPivot and Power View, allowing customers to easily analyze and interpret their data to garner valuable insights for their organization."

Among the first to test HDInsight is Ascribe Ltd., a U.K.-based Microsoft partner that provides health care management systems for hospitals and large medical practices. Its solution handles the lifecycle of patient care using key new components of the Microsoft portfolio -- including Windows 8-based tablets, SQL Server 2012 and HDInsight -- to perform trending analysis using anonymous patient data.

Paul Henderson, Ascribe head of business intelligence, demonstrated the application at the GigaOM Structure:Data conference in New York in March. "Rather than building our own server farm or incurring huge capital costs, HDInsight provides us with the ability to process that volume of stuff at scale, and that's a huge benefit," said Henderson in an interview.

Scores of players are now talking up new ways of capturing, analyzing and processing huge amounts of data. While the biggest alternatives to SQL Server were RDBMSes from Oracle, IBM, Teradata Corp. and, more recently, MySQL, now there are a vast number of players looking to offer modern alternatives to traditional SQL database stores.

EMC-VMware's Pivotal Move
One major entry spawned last month when VMware Inc. and its corporate parent EMC Corp. spun out a key application infrastructure and big data and analytics portfolio to a new venture called Pivotal Inc., which is now headed by former VMware CEO Paul Maritz. Just as EMC saw the opportunity to create VMware as an independent entity, the company is taking a similar strategy with Pivotal, Maritz said at a presentation for investors in New York on March 13.

22 Companies Targeting Big Data

By some estimates, there are 100 or more suppliers of technology that allow organizations to store, analyze or provide connectivity to disparate data sources. Here are some that have gained a foothold or appear poised to do so.

10gen: The lead distributor of the popular open source MongoDB, a choice of many companies looking to build cloud-based databases.

Alteryx Inc.: Designed to run on-premises and in the cloud, Alteryx Strategic Analytics lets analysts use dashboards to identify trends using big data. It provides connectors to numerous in-house and cloud-based applications and data sources.

Amazon Web Services (AWS) Inc.: The company recently released its Redshift data-warehousing service that lets customers build and run their own data queries right in the public cloud.

DataStax: A leading distributor of the highly available database architecture based on Apache Cassandra.

GigaSpaces: Its in-memory data repository is targeted at high-performance, real-time business analytics using big data. Its most recent eXtreme Application Platform (XAP) 9.5 enables integration with applications built on the Microsoft .NET Framework, as well as with NoSQL data repositories such as those based on Cassandra.

Hadapt: The company's namesake database is unique in that it combines SQL and Hadoop, allowing customers to analyze structured and unstructured data without requiring connectors. The Cambridge, Mass.-based company was founded by Yale students and received a $17 million round of funding in 2011 from Atlas Venture, Bessemer Venture Partners and Norwest Venture Partners.

Hortonworks Inc.: Spun out of Yahoo! Inc. in 2011, the company is Microsoft's Hadoop distributor of choice.

IBM: Big Blue's Hadoop-based BigInsights and Streams software analyzes structured and unstructured data at high speeds modeled after Watson, the computer that famously appeared on Jeopardy in 2011. The company has spent $16 billion to acquire 35 companies associated with big data analytics over the past seven years.

MapR Technologies Inc.: Another key Hadoop developer, its distribution is used by AWS for the Amazon Elastic MapReduce service and also used by Google Inc. for its Hadoop clusters available on the Google Compute Engine. Cisco and EMC Corp. are also key partners.

MemSQL: This high-speed database provider's namesake database moves data in memory and translates SQL into C++, which the company says provides optimized query execution by eliminating custom code. This allows it to read and write data at high speeds via a relational interface, replacing a temporary cache with the database and letting users analyze data faster than alternative offerings, the company says.

Microsoft: While SQL Server is Microsoft's flagship database, the company has a number of other stores including SQL Azure and its Windows Azure Table Storage Service. Through its Hortonworks partnerships, SQL Server can serve big data applications through the Hadoop Distributed File System (HDFS). Looking to bring big data to the cloud, Microsoft released the public preview of its Windows Azure HDInsight offering. The next generation of SQL Server, code-named "Hekaton," will gain in-memory capabilities.

MicroStrategy: One of the last major independent business intelligence (BI) platform providers, MicroStrategy's new 9.3 platform has a connector that lets users pull and analyze data from Hadoop stores. The new platform, which is designed to let business analysts create their own BI dashboards, also connects to SQL Server 2012, IBM DB2 10, Teradata V14, ParAccel 3.5 (the technology used for Amazon Redshift) and SAP HANA.

Neo Technology Inc.: The company's engineers developed a graph database, which it believes is the optimal way to model and query connected data. Its Neo4j database is used by the likes of Adobe Systems Inc., Cisco and Deutsch Telekom AG.

NuoDB Inc.: This startup of veterans in the relational database industry recently launched Starlings, which is optimized to scale-out while also supporting traditional SQL commands and Atomicity, Consistency, Isolation and Durability (ACID) transactions in both structured and non-SQL models.

Oracle: The company every rival wants a piece of, its database is still the leading platform for business-critical and transaction-oriented applications. Oracle has targeted big data with its own Oracle Data Integrator providing to Hadoop repositories. And, like Microsoft, Oracle is advancing an in-memory database with its TimesTen platform and its Exalytics In-Memory Machine.

Pivotal Inc.: EMC is hoping it can catch lightning in a bottle twice. Just as it spun off VMware Inc. into a virtualization giant, it's doing the same with Pivotal -- and tapping former VMware CEO Paul Maritz to create a $1 billion big data business by 2017. The company's first deliverable is Pivotal HD, its Hadoop-based distribution that will take on Cloudera, Hortonworks and MapR.

SnapLogic Inc.: The SnapLogic cloud integration platform is deigned to pull data from numerous data sources into a dashboard using Representational State Transfer (REST)-based Web services. Its online SnapStore has more than 150 so-called "Snaps," which provide connectors from the likes of Salesforce.com Inc., Oracle, SAP AG, NetSuite Inc., Box and Microsoft (Access and SQL Server).

SiSense: A startup touting an alternative to in-memory databases with the same advantages. Rather than requiring huge amounts of RAM, many applications can scale by using CPU power, the company argues. "The technology we have is in-memory and columnar, so it gives the best [performance]," says Bruno Azia, the company's VP of worldwide marketing. "Today most of the in-memory technology uses RAM, which is 50 times faster than disk. We use CPU-based, which is 50 times faster than RAM, so we're two generations faster. We can scale." While most of SiSense's customers are startups, Merck & Co. Inc. and Target Brands Inc. also are using its technology.

TempoDB Inc.: The company's time-series Database as a Service offers a store to analyze time-series data from sensors, meters, servers and other machine-generated systems. Its service is to provide real-time and historical reporting; it uses a REST API for data access. It's designed to support billions of time-series inputs.

-- J.S.

"There's a large market to go after here," Maritz said, noting the core assets brought into Pivotal from EMC and VMware, including its Greenplum analytics platform that's now Hadoop-based; the Cetas real-time analytics engine; GemFire, a cloud-based data management platform for high-speed transaction processing that's often used in trading applications; Cloud Foundry; the Java-based Spring Framework; and Pivotal Labs, the destination of many customers looking to take business problems from concept to a deliverable application.

Now a $300 million business, Maritz believes Pivotal can grow to $1 billion in revenues by 2017. EMC and VMware are arming the venture with a $400 million investment and technology under development for several years with more than 100 engineers. "We're moving to where the puck is going," Martitz said.

The first key deliverable, Pivotal HD, surfaced last month. Based on its own Hadoop distribution, Pivotal HD is aiming to expand the capabilities of the store with HAWQ, a high-performance Hadoop-based RDBMS. It offers a Command Center to manage HDFS; HAWQ and MapReduce, as well as its Integrated Configuration Management (ICM) tool, to administer Hadoop clusters; and Spring Hadoop, which ties it into the company's Java-based Spring Framework. It also includes Spring Batch, which simplifies job management and execution.

Experts say Pivotal HD could put pressure on the leading Hadoop distributors Cloudera Inc., MapR Technologies Inc. and Hortonworks "because you have this very robust, proven MPP [massively parallel processing] SQL-compliant engine suddenly sitting on top of Hadoop and HDFS," says George Mathew, president and COO of Alteryx Inc., a San Mateo, Calif.-based provider of connectors that enable organizations to create dashboards for big data time analysis.

One of the reasons EMC and VMware are spinning out Pivotal is to give the company the freedom to focus on all cloud environments, including the widely used Amazon Web Services (AWS).

For its part, AWS recently launched its own cloud-based, data-warehousing platform called Redshift. Early indications are that many customers are considering Redshift because it offers a much lower cost of entry than incumbent data-­warehouse providers, says Darren Cunningham, VP of cloud marketing at Informatica Corp., which itself recently released a connector that links Redshift to existing data stores.

NoSQL Enables Cloud Alternatives
Along with Redshift, a growing number of customers are using various NoSQL databases -- those that can store and process both SQL and unstructured data with much higher levels of availability than traditional ones -- which lend themselves to cloud deployments. "The majors, as we may call them -- Amazon, Google and Microsoft -- all have multiple plays going on in the cloud database world," noted Blue Badge Insights analyst Andrew Brust, during a cloud database panel at the Structure:Data conference.

Many customers are running their cloud-based apps on the open source MongoDB, the most popular purveyor of which is 10Gen Inc. A number of alternative highly available databases include those from Basho Technologies Inc., which offers the open source, distributed NoSQL database Riak (and Amazon Simple Storage Service [S3]-compatible Riak Cloud Storage [CS] platform); NuoDB Inc., which in January launched Starlings, based on what it describes as unique technology (and patents) that addresses the issue of scaling out while also supporting traditional SQL commands and reliable Atomicity, Consistency, Isolation, and Durability (ACID) transactions supporting both structured and non-SQL models; and DataStax, a leading distributor of the highly available database architecture based on Apache Cassandra.

Cassandra Targets High Availability
The appeal of Cassandra is that it's fully distributed. There's no reliance on a master replica that can go down or create a bottleneck. That allows for continuous availability, where every single node is available for full reads and writes directly. By not requiring a master, Cassandra can failover much faster. "That means [users] can hit any node at any time, with response times of 50 ms or less," says DataStax CEO Billy Bosworth.

"When people come to us it's because they want a database that's always available, meaning it's not tied to any master-replication strategy," Bosworth adds. "Users who come to us have an online application that they never want to think about being down." Does that mean Cassandra is the death knell for Oracle, IBM DB2 and Microsoft SQL Server, among others?

"I don't see the role of the relational database going away," Bosworth says. "It's a $16 billion market, [and those] don't just fall off cliffs -- but they will be used for a smaller percentage of the workload in the application architecture."

Incumbents vs. New Players
Indeed, despite the growing number of players and approaches, Blue Badge Insights' Brust believes many customers will look for the mainstream providers to embrace them. "We're seeing specialized products from specialized companies doing things that the major databases have glossed over," Brust said. "That's great, but when it's going to really become actionable for companies is when the mega-vendors either implement thisstuff themselves or do some acquisitions and bring these capabilities into their mainstream databases that have the huge installed bases. Then it becomes a lot more approachable to enterprise companies."

Noted cloud and database analyst David Linthicum, also on the Structure:Data cloud database panel, was more skeptical. "It pushes them to be more innovative, but I haven't seen much innovativeness come out of these larger database organizations in the last couple of years," Linthicum said.

Microsoft's In-Memory Plan
Microsoft isn't ceding the market to upstarts. Its Windows Azure Table Storage Service is designed to support large volumes of data while offering more-efficient access and persistence. Microsoft is also addressing growing demand for in-memory databases, made popular last year by SAP with HANA. In-memory databases can perform queries much faster than those written to disk. Microsoft revealed its plans to add in-memory capabilities to the next release of SQL Server, code-named "Hekaton," at the SQL PASS Summit back in November 2012.

"This is a separate engine that's in the same product in a single database and will have tables optimized for either the conventional engine or the in-memory engine," Brust said of Hekaton. "You can join between them so you're going more toward an abstraction."

But with a growing number of players looking to offer new types of data repositories, Microsoft is now in a more crowded field. While Microsoft has broadened its data-management portfolio with SQL Azure and now HDInsight, the requirement to find, process and analyze new types of information is greater than ever. Looking forward, all eyes will be on Hekaton and Microsoft's ability to deliver new levels of performance to SQL Server.