Posts by Paul Johnson (vldbsolutions.com)

What's not to like?

“This cost model-based optimization is really something rare, and in fact, no one else in the industry working with Hadoop, working with query engines on top of Hadoop, has anything like this."

A very important point, me thinks.

To quote Steven Brobst, Teradata’s CTO, “you are the optimiser” when you write a Hadoop application. Business users are (usually) capable of specifying the ‘what’ via SQL but can’t be expected to figure out the ‘how’.

Most IT folks would struggle to figure out the optimal execution plan when developing an app to run in parallel on a large cluster. Even understanding the explain output from a parallel query optimiser can be tricky…especially for those pesky 20 way joins users like to concoct.

1 billion rows on a 60 node cluster is a measly ~17m rows/node. Assuming multi-socket and multi-core nodes, with each core running a separate database segment, that’s not a lot of rows per segment - maybe ~1.5m each for a 2 x 6 core node. There was probably nothing else running on there, so why did it take 13 seconds?!?!?

It’s no surprise that the Greenplum DBMS outperforms Impala, given that Greenplum was a standalone parallel database company before being acquired by EMC. Greenplum is also relatively mature compared to Impala. They also made some very smart hires early in the development process about 7-8 years ago to make sure they got things right.

“if you double the nodes you should be able to do the job twice as quickly”

While this should be true for any MPP architecture, whether it is always observed or not depends on the data demographics and query in question i.e. “your mileage may vary”.

Mind The Flying Pigs...

...having worked in what is now called the 'big data' world for nearly 25 years I've still yet to meet anyone that would come close to possessing "machine learning techniques, data mining, statistics, maths, algorithm development, code development, data visualisation and multi-dimensional database design and implementation."...and I have worked with some *very* skilled individuals over the years.

Should this superhuman being ever be found, how useful would their output be to the typical blue-chip company if it is not repeatable, understandable and supportable by mere mortals?

A jack-of-all-trades who is also expected to be a master-of-all-trades is surely a recipe for disaster.

Yet Another Big Data Article...

There have been, and will no doubt contiunue to be, many attempts to arrive at a succinct description of what constitutes 'big data'. For me, it's any dataset that can't easily be manipulated given the compute resources on hand. For many decades the IT industry has seen the world through the narrow lens of structured data held in a DBMS and managed by SQL. And mighty lucrative that's been for some.

Two of the main three V's (volume and velocity) that have been used to describe 'big data' are not enough on their own to compromise large parallel, SQL-based DBMS systems such as Teradata, Netezza and Greenplum etc. The real challenge comes with the addition of variety to the mix - those use cases where the data does not lend itself to tabularisation, which in turn makes life impossible when SQL is the main/only/preferred data manipulation tool available.

With the advent of Hadoop, the new paradigm of parallel processing outside of a DBMS using procedural languages (i.e. not SQL) has opened up new data processing possibilities. This has allowed variety - not volume or velocity - to be handled economically for the first time at scale.

The main issue for me with this new paradigm is that the rest of the data processing world is highly SQL-centric. Just like Cobol on the IBM mainframe, that's not going to change any time soon. The 'new' (Hadoop/no-SQL) and 'old' (DBMS/SQL) data procssing worlds will have to learn to play nicely for the former to enter the mainstream, no matter how cheap or vogue it becomes.

Big data...

...is apparently defined by volume (how much), variety (what types) and velocity (how fast), or some combination of all three.

The term is in vogue due to the likes of Google, Yahoo and Facebook introducing the world to new analytic paradigms based on the MapReduce framework, open source software (Linux, Hadoop etc), commodity hardware and the notion of 'noSQL'...and also because the IT industry needs new buzzwords du jour. At the moment it's the turn of 'big data' and 'cloud'.

In theory, 'big data' as done by the likes of Google is all about unstructured data. In reality, there's a lot of structured data still out there, and I'd argue that all data has some structure anyway, so 'semi-structured' may be a better term.

Ebay has a multi-petabyte 256 node Teradata system chock full of structured data, in addition to the large Hadoop stack for web analytics, so there's clearly life in the old structured dog yet.

There's nothing new in 'doing analytics' - a lot of companies have regarded analytics as a competitive differentiator for a long time. There are companies out there, even in the lil' ol' UK, that have been using Teradata, which only does analytics, since the 1980's. I started my career at one of them.

For the typical mid-market company, if there is such a thing, all we ever tend to see is SQL Server on top of SAN/NAS. It's cheap, feature-rich, easy to tame and works OK until data volumes increase beyond a few hundred GB or so. The pain threshold is obviously dependant on the hardware, DBA/developer skill, schema and application complexity.

All SMP based databases suffer the same scaling issues, hence Microsoft's attempt to build an MPP version of SQL Server, (Madison/PDW), Oracle's Exadata and HP’s NeoView.

IBM in the BI mid-market is not something we see very often. Netezza Skimmer has never been sold as a production system before, as far as I know. IBM's own web site describes it as for 'test and development. A proprietary IBM blade based system running Postgres on Linux is hardly a good fit for the Windows/SQL Server/SAN/NAS/COTS hardware crowd.

Having said that, we did deploy a pre-IBM Netezza system as far back as 2003 for a small telco with only 100,000 customers, but they did have several billion rows of data and complex queries to support.

Teradata is the only database built from day 1 (in the 1980's) to support parallel query execution using an MPP architecture across an arbitrary number of SMP nodes all acting in tandem as a single coherent system. That is very, very hard to do - ask Microsoft, Oracle, HP or IBM.

Overall, Teradata 'just works'. All those big name users can't be wrong.

The Teradata secret sauce for me is the scalable 'bynet' inter-node interconnect. This is used for data shipping between SMP nodes in support of join/aggregation/sort processing. The bynet is scalable and resilient and 'just works'. It also performs merge processing for final results preparation.

Other MPP systems typically have a non-scalable interconnect bandwidth consisting of a dumb bit-pipe. Even worse, those that ship intermediate results to a single node for final aggregation/sort/merge processing can hardly claim to be linearly scalable. Some Exadata clusters run tens of TBs of RAM on the master node to address this issue.

Teradata's bynet has processing capability that enables final merge operations to be executed in parallel in the bynet interconnect fabric without landing intermediate results in any single place for collation. Cool eh?

See here for more info: http://it.toolbox.com/wiki/index.php/BYNET

Teradata consists of OEMd Dell servers running SUSE Linux and dedicated storage from LSI or EMC. Teradata was historically regarded, quite rightly, as 'reassuringly expensive', but the launch of the new line of Teradata 'appliances' a few years ago has made Teradata price-competitive with the likes of Netezza, thus eroding Netezza's disruptive pricing model. Competition is a healthy thing etc.

Appliance adoption has been a key feature of Teradata's strong performance over the last few years, as reported several times on El Reg.

Have you ever run an Oracle query across a 20 node system running hundreds of virtual processors all working together? I did a few minutes ago - a 250m row count(*) in under 1 second with no caching, no metadata, no indexes, no tuning, no partitions and no concern for what else is running.

I can't remember when I last submitted a query to Teradata that either didn't finish or caused the system to barf. That happens a lot on Oracle/SQL Server.

The last project I worked on was a 20TB Teradata system that supports a very wide range of applications, including real-time loading of web data and several tables of over a billion rows. Total downtime for the year, including planned maintenance, is measured in single hours.

“But I could do all that with X, Y and Z”, we often hear. Off you go then. If you can get it to work, and that’s a big ‘if’, your boss won’t bet the farm on it. That’s another reason the likes of Teradata win business – it’s a safe bet for the decision makers.

Hub and Spoke <> Analytics

The DW should contain application-neutral atomic data, with only the dependent delivery data marts (if they exist) containing application-specific summary/aggregate data to support query performance.

Queries don't all have to be answered quickly. The real DW value often comes from serving a small community of explorers asking iterative/complex questions, not from serving the larger community of farmers running KPI reports. Query elapse time is far less important for the explorers – “data scientists” in the current parlance. A good DW serves the needs of both communities, plus others.

For those that use a general purpose DBMS, on a standard “SMP plus SAN/NAS” platform, I would agree that "a single data warehouse to gather up data and do queries also happens to be impossible". For those not bound by such restrictions, politics and economics are the issue, not technology. This was discussed a few months ago on Curt Monash's excellent DBMS2 blog: http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/.

Appliances, such as those offered by Teradata and Netezza, are not an admission that the EDW concept is "no longer sufficient". Teradata has offered appliances for over 20 years, it's just that they've started marketing some of their offerings as appliances relatively recently, mainly in response to the rise of Netezza.

The phrase "data warehouse appliance" dates back 8-9 years and is attributed to Netezza co-founder and former CTO Foster Hinshaw: http://bi-insider.com/portfolio/overview-of-a-dw-appliance/.

A conventional DW is more than capable of loading and querying web log data in a good old-fashioned relational schema. The challenge is the complex transformation process between the web server and the back-end data warehouse. This is precisely the capability provided by Celebrus: http://www.celebrus.com/productindex.aspx.

Perhaps the most suitable technology for covering in a DW hub-and-spoke article might have been Microsoft’s PDW?

Teradata bynet

Teradata has had a proprietary interconnect since the 1980's:

http://it.toolbox.com/wiki/index.php/BYNET

The current bynet is scalable, resilient and load balanced. As the system is scaled-out though the deployment of more SMP nodes the bynet bandwidth increases. This is essential in order to maintain overall system performance.

For my money though, the real 'smarts' lie in the bynet's ability to perform final aggregation and sort processing on-the-fly within the interconnect fabric, without impacting the DBMS nodes.

Parallel databases lacking this capability ship partial result sets to a master node to perform the same function - there is still an interconnect, it's just not smart.

$10bn 'niche'

Being picky, according to their own timeline, Teradata wasn't acquired by NCR until 1991: http://www.teradata.com/history/.

The NCR years can be 'blamed' for Teradata being wedded to NCR's servers and 32bit Unix MP-RAS for too long. The relatively recent move to OEMd Dell servers running 64bit SUSE Linux with lots of RAM was very welcome.

In addition to the ability to support a "large number of nodes" - tightly coupled to support the scale-out/MPP architecture - Teradata's other differentiators are intra-node parallelism through the use of virtual processors and the 'bynet' interconnect to enable high speed, scalable and reliable inter-node data movement. See http://it.toolbox.com/wiki/index.php/BYNET.

As the author says, Teradata's "tight focus on data warehousing and data analysis" and the fact that the DBMS is "designed from the ground up for data warehousing and decision support" are clear differentiators.

General purpose databases used for decision support a) generally don't scale out and b) were never designed to support high-speed, high-volume join, aggregate, sort and scan operations. Add complex/ad hoc queries and high concurrency to the mix and general purpose databases soon struggle.

IBM with Netezza, EMC with Greenplum, Microsoft with Datallegro/PDW and Oracle with Exadata are clearly aiming for a slice of the descision support action. After ditching Neoview, HP bought Vertica so they are still theoretically in the race.

They can't all be wrong...can they?

Let's not forget that most of the open-sourced based massively parallel processing (MPP) database offerings that have appeared over the last 7-8 years chose to use Postgres - think Netezza, Greenplum, Aster, Dataupia. They can't all be wrong, can they?

I think the biggest table I've built using Postgres was 125 billion rows running on a 75 node MPP system.

Doh!

HP, Oracle, EnterpriseDB, Red Hat musings...

I tend to agree with the comments on this piece. It would seem to make no sense for HP to buy either EDB or RHEL. HP being the server 'whore' is a good place to be, surely? While Orasun and IBM try and force the world into single vendor lock-in, HP offering choice on Windows, and Linux/Unix is *a good thing*. I have bought HP servers for all sorts of OS/DBMS/app combinations. People like choice.

Oracle compatibility is a very big challenge indeed. I've worked for a vendor that offers this to a good degree of coverage and I know first hand how hard it is. Someone always wants that next little piece of capability that just isn't available yet. HP offering to deal with the missing functionality on a case-by-case basis is not even near to Oracle compatibility. No sir.

When it comes to Oracle, like the man said, in most instances folks will go with "better the devil you know". Oracle know that. As do HP.

A Pint To Anyone...

What problem are we solving with NUMA?

In the late 90's I worked for a bank where we were tasked with migrating 2 years of transaction data from a very expensive Sequent Symmetry NUMA-Q system running Oracle. I think it cost £20m and made the front page of Computer Weekly when it went live.

In a nutshell, it didn't work - there was one DBA per user (4 of each), only a small percentage of the fact table could be included in a query and no joins were possible, unless you wanted to wait forever or if you insisted that your querys didn't crash the system.

The billions of rows were dropped onto Teradata, an MPP system, where it worked fine, and where the table and app sits to this day...only a lot bigger.