Hashing Out an Architecture for Advanced Analytics

There are a number of reasons why customers are adopting analytic database technologies (see http://www.tdwi.org/articles/2010/03/31/Advanced-Analytics-Architecture.aspx). One big driver, according to Philip Russom, senior manager at TDWI Research, is the growing complexity of analytic workloads -- and, particularly, of the kinds of queries associated with what Russom and other experts term "advanced analytic" technologies.

It's a trend that analytic database players tend to see as especially salient as it both exposes the limitations of conventional DW architectures and showcases the putative benefits of next-generation data warehousing platforms. Such DW systems are almost always based on massively parallel processing (MPP) technology. In many (but not all) cases, they reprise the use of a column-based (or "columnar") architecture, too.

Without exception, analytic database players tout MPP as a sine qua non for advanced analytics. Conventional DW systems, they allege, just can't get the job done. What's more, Russom concedes, there's a sense in which they're right on the merits: he cites a TDWI survey in which fully 40 percent of respondents expressed misgivings about the analytic capabilities of their existing DW platforms. (In the same survey, 51 percent of respondents said that they planned to adopt an analytic database platform at some point over the next five years.) Conventional data warehouse implementations are designed chiefly to address reporting or basic OLAP, Russom explains.

"There are multiple forms of advanced analytics, including those based on data mining or statistics and those based on complex ad hoc SQL statements. The former may or may not run in a DBMS -- depending on the vendor's analysis tool capabilities -- which is a problem when it forces users to move data out of the data warehouse for the sake of analysis, then back in," Russom explains.

Bringing Brawn to Bear

The upshot, Russom observes, is that advanced analytic approaches which rely chiefly on complex or ad hoc SQL statements are particularly hamstrung by poor query performance. Almost half (45 percent) of analytic adopters cited "poor query response" as a decisive factor in their deployment decisions.

It's in this respect, especially, that analytic database specialists like to target conventional data warehouse platforms, such as out-of-the-box Oracle, SQL Server, or DB2. "Our sweet spot is where you've got queries where you need the response in a matter of seconds, or sometimes in sub-seconds," comments Barry Zane, CTO with columnar database specialist ParAccel.

Zane claims that his company's ParAccel Analytic Database (PAD) is an "extremely mature, extremely full-featured" platform, but concedes that -- for many prospects -- PAD's primary selling point is its columnar MPP brawn.

"You're talking about a class of querying -- whether it's interactive or whether it's just many, many users -- where you're using these extremely complex [SQL] queries and you need responses in seconds. You can't wait hours. That's where we're seeing the most interest, to be honest," he continues.

ParAccel, like other specialty analytic players, takes aim at all of the entrenched heavies -- including, significantly, high-end data warehousing stalwart Teradata Corp. Zane, for example, articulates a variation on a theme -- namely, that MPP brawn deployed in combination with columnar technology can whip most query performance issues -- that's echoed (with a vendor-specific emphasis on the importance or unimportance of a columnar architecture) by most other analytic database players. It's an intriguing message that -- in ParAccel's and other cases -- seems tailored to counter Teradata's pitch, in particular.

"I will say that without a doubt, Teradata has absolutely the best controls for setting up priority lists and managing concurrency, but -- it's really simple enough -- when you have something that's blazingly fast, people can coexist and share the machine without setting up priority lists," Zane says. "If [users are] getting their responses in several seconds or at most a minute, concurrency becomes a smaller issue."

The same can be said for Vertica Inc., which -- like ParAccel -- markets a columnar MPP database system. "We see columnar becoming the de facto standard [for analytic requirements]," comments industry veteran Dave Menninger, vice president of marketing with Vertica. "You see even the row-oriented vendors attempting to retrofit or shoehorn some columnar capabilities into their products. It's like that with MPP, too. No one seriously disputes the performance benefits of using [either technology] in analytic [applications]."

"What we have is a [strong] MPP engine. It's very scalable. You can add a blade or two blades or five blades, and scale from there. In fact, we announced … the largest Oracle database in the world," said Hinshaw, during a sit-down interview at last month's TDWI Winter World Conference in Las Vegas.

Hinshaw was alluding to one of Dataupia's most prominent reference customers: Subex Ltd., a billing and operations-support specialist based in Bangalore, India. Subex maintains a 510 TB data warehouse that supports its revenue operations center.

Randy Lea, vice president of product and services marketing with Teradata, disputes this claim, dismissing it as the stuff of oversimplification or caricature.

"Workload management continues to be a huge focus for us, a huge differentiation," he avers, arguing that -- for all of their burgeoning strategic chic -- most analytic database platforms are still deployed in tactical implementations, e.g., as data marts. In such a scheme, Lea says, workload management might not seem to matter; at most, you have a limited number of user classes accessing the system. The shift to an enterprise data warehouse (EDW) topology drastically complicates this arrangement, however. Teradata, Lea concludes, is still a big believer in the virtues of the EDW, notwithstanding its recent concessions in the data mart arena (see http://esj.com/articles/2009/03/11/teradata-appliances-big-way.aspx).

"Even if I have a data mart, I still have business rules and requests that I would like to implement. [For example,] the CEO gets high priority on his requests. That's probably a good decision," he explains. "We have the ability based upon time, based upon query execution, [or] based upon user [or] application, to [enforce] some of these business rules [so] that you are best utilizing the warehouse."

Who's right? Is the MPP pitch championed by the likes of Netezza, Dataupia, ParAccel, Vertica and others chiefly a function of what Teradata's Lea likes to describe as a "non-existent" workload management strategy?

Yes and no. Experts say Teradata has refined its WLM-or-bust pitch in response to the bottom-feeding incursion of Netezza, Dataupia, and other vendors into its bread-and-butter, high-end data warehousing market. The truth is that both sides have merit.

"WLM is useful when you're following a unified platform model and you need to guarantee a real-time SLA for some portion of the workload from the data warehouse," comments veteran DW architect Mark Madsen, a principal with consultancy Third Nature Inc. Now as ever, Madsen says, DW practitioners must choose between what might be called All-Encompassing, Top-Down and Loosely-Federated, Bottom-Up approaches.

"One choice is a big unified platform and WLM to fit a heavily centralized architecture. The other choice is to construct marts that are designed with high availability and response time to meet those operational needs, and leave the more heavy analytical queries on the main platform," he points out. "Then [there's] the big but … what if the heavy queries also need current up-to-date data?"

All the same, Madsen isn't persuaded by a brawn-beats-all pitch.
"I don't think more brawn deals with the problem, because it's one of concurrency and light versus heavy work. If a system is designed for throughput of big things, small ones will still get stepped on, just faster and more frequently," he concludes.

"WLM as a focal point seems to be driven more from a centralized bottleneck-inducing architecture for data management. Still, I've wanted better features to do it in my own centralized, bottleneck-inducing architectures. Sometimes you don't have an alternative."