Sunday, June 7, 2009

Is betting on the "MySQL mass market for data warehousing" a good idea?

I came across a podcast the other day where host Ken Hess interviewed the CEO of Kickfire, Bruce Armstrong (http://www.blogtalkradio.com/FrugalFriday/2009/05/22/Frugal-Friday-with-guest-Bruce-Armstrong-CEO-KickFire --- note: Armstrong doesn't come on until the 30 minute mark and I suggest skipping to that since the discussion at the 17 minute mark made me a little uncomfortable). Kickfire intrigues me since they are currently at the top of TPC-H for price performance (http://www.tpc.org/tpch/results/tpch_price_perf_results.asp) at the 100 and 300GB data warehouse sizes (admittedly these are pretty small warehouses these days, but Kickfire feels that the market for small data warehouses is nothing to sneeze at). Although TPC-H has many faults, it is the best benchmark we have (as far as I know), and I've used it as the benchmark in several of my research papers.

In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker's voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore's law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore's law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past (Netezza is a good example of a business succeeding in selling proprietary DBMS hardware, though of course they will tell you that they use all commodity components in their hardware).

Anyway, the main sales pitch for Kickfire is that they want to do for data warehousing what Nvidia did for graphics processing: sell a specialized chip for data analysis applications currently running MySQL. The basic idea is that you would switch out your Dell box running MySQL with the Kickfire box, and everything else would stay the same, since Kickfire looks to the application like a simple storage engine for MySQL. You would get 100-1000X the performance of MySQL (assuming a standard storage engine like MyISAM or InnoDB) at only about twice the price of the Dell box. And, by the way, they are a column-store, which I'm a huge fan of.

But at the 50 minute mark of the above mentioned podcast, Armstrong started talking about potentially being acquired by Oracle. Although he did use the term "down the road", it struck me as a little weird to start talking about acquisition as such a young startup (it seems to me like if you want to maximize the purchase price, you need to establish yourself in the market before being acquired). It made me start wondering, maybe things aren't going as well as the Kickfire CEO makes it seem. If I remember correctly, they burst onto the scene in 2007 in topping the TPC-H rankings, launched at a MySQL conference somewhere around the middle of 2008, didn't make any customer win announcements for the whole year (as far as I recall), and then relaunched at another MySQL conference in the middle of 2009 (last month) along with (finally) a customer win announcement (Mamasource).

Maybe Kickfire is doing just fine and I'm reading way too much into the words of the Kickfire CEO. But if not, why would a company with what seems to be a high quality product be struggling? The conclusion might be: the go-to-market strategy. Kickfire has decided to target the "MySQL data warehousing mass market" and their whole strategy depends on there really being such a market. But do people really use MySQL for their data warehousing needs? My research group's experience with using MySQL to run a data warehousing benchmark for our HadoopDB project (I'll post about that later) was very negative. It didn't seem capable of high performance for the complex joins we needed in our benchmark.

Meanwhile, Infobright and Calpont have chosen similar go-to-market strategies. I don't have much more knowledge about Calpont than can be found in Curt Monash's blog (e.g., http://www.dbms2.com/2009/04/20/calpont-update-you-read-it-here-first/), but I've been hearing about them for years (since they are also a column-store), and I haven't heard about any customer wins from them either. Meanwhile Infobright (another column-store that I like, and their technical team --- lead by VP of Engineering Victoria Eastwood --- are high quality and were very helpful when my research group played around with Infobright for one of our projects) recently open sourced their software which is either an act of desperation or their plan all along, depending on who you ask.

The bottom line is that I've having doubts about whether there really is a MySQL data warehousing mass market. I know this blog is still very young and does not have many readers, so there are unlikely to be any comments, but if you do have thoughts on this subject, I'd be interested to hear them.

13 comments:

I have spoken to Kickfire on a number of occasions so let me give you my opinion on them. Firstly I don’t think they are going after the pure play data warehousing market. There are plenty of other great companies out there for that.

Instead, they are targeting the thousands of MySQL shops that have started out building on native MySQL (typically web, click stream etc) and have reached a point where either their data volumes or query load are exceeding the capabilities of their current solution. In this scenario the organization could:1 – Invest time and money into scaling their current solution in some manner2 – Redesign, recode and redeploy on with an alternative architecture on an alternative platform 3 – Plug in a Kickfire appliance and get quick benefit with low investment and low re-working

The nice thing about having the MySQL compatibility is that it gives you a “plug and play” quality. The intention is if you have built your solution on MySQL today and you have performance issues (in the area of analytical style queries) then you can essentially take it “as is” and put it on Kickfire and get immediate performance benefit. A problem with being a “start-up” with an appliance without any existing vendor compatibility is your sell is much harder. You are selling the DBMS, the platform, the development model, the investment in developer training etc, etc. And you have to convince the customer to buy it before they start developing against it, meaning the time to benefit is much longer. Kickfire on the other hand is just selling a hardware solution to optimize an organizations existing investment.

I like Kickfire, I don’t know anything about how they are going financially but I think it is a great gap filling solution between existing MySQL solutions and pure data warehousing.

Is there a mass market for "small" datawarehousing? Probably yes. The question is whether MySQL is a good vector. Again, probaly yes.Concern is that MySQL is rather perceived as a toy (look at what Curt Monash says about this RDBMS). So when it comes to datawarehousing, there are two points:- for small footprints (below 1 TB), why would a company not use its preferred RDBMS rather than MySQL? Then why would it use an appliance model such as Kickfire instead of commodity hardware? The Netezza success story can help here, but the answer is not obvious.- for bigger footprints, (Infobright already claims 50+ TB, and has announced the 100TB range), can we speak about mass market? I do not think so. The opensource model is here a bright idea, but from 10TB upwards there is a face to face confrontation which every individual datawarehouse vendor, and without market education, I am not sure MySQL can math the comparison.

I am based in Europe, and over here, datawarehouses are rather smaller than in the US. So, a "MySQL mass market" is certainly more meaningful in Europe.

Unfortunately, it seems that Kickfire does not (yet?)sell outside the US. As for Infobright, can we say that since they are MySQL based, they may be perceived as mass market? I am not sure, although, again, opensource is a very good idea yet to be improved...

I don't know if there is a market, but there is a need and maybe the market won't develop until MySQL is better at supporting large data warehouse workloads. With the exception of the custom storage-engine vendors, there is more focus now on improving OLTP performance for MySQL.

I am not sure how much Infobright and Kickfire get from their MySQL association. They replace most of the code in the server (optimizer, execution) in order to get great performance. Maybe they will announce support for customers from other RDBMS vendors.

Keep on blogging. Curt Monash referenced your blog so you may get a few more visitors.

Thanks for your comments. Pretty cool to have someone from across the Atlantic participating in the discussion so soon. Not that many years ago, this would be unheard of.

Bernard: I agree with most of what you say. Though I think that even if Infobright can go up to 50+TB, I would rather have a MPP database at more than 1TB scale if I'm doing a lot of table scans; otherwise I/O will be a severe bottleneck. That said, the HadoopDB project we're working on at Yale will take any non-MPP database (like Infobright) and turn it into a MPP database. But that's a story for another day.

mysqlha: Interesting predication --- you seem to be insinuating that some subset of Kickfire/Infobright/Calpont will announce transparency to another DBMS in addition to MySQL (e.g. PostgreSQL) since all they really use from MySQL are the drivers/parser/interface. I wouldn't be surprised if you were right, especially given that people are worried about MySQL under Oracle's stewardship.

"We have noticed a reasonably sized MySQL market for DW in theweb/clickstream space. What's curiousis that anyone with big data has already left the safe territory of plugin compatibility. Anyone running more than 100gb (that may even behigh) seems to have gone the sharding route. I've talked to a fewpeople running systems like these and they admit it's incrediblypainful. I don't see how kickfire would necessarily help there sincethey'd either have to buy many kickfire boxes (at about 10x the cost ofa Dell box) or have to reverse engineer back to their original design."

1) We actually came out of stealth mode as a company at the April 2008 MySQL User Conference, where we announced our world records for TPC-H at 100GB and 300GB;

2) Our product went GA at the end of 2008, which we formally announced at the April 2009 MySQL User Conference along with one of our production reference customers, Mamasource, a Web 2.0 online community doing clickstream analysis that hit performance and scalability limitations with MySQL at 50GB;

3) Our focus is on the data warehouse "mass market" with databases ranging from gigabytes to low terabytes, where over 75% of deployments are today according to IDC/Computerworld survey 2008;

4) We chose MySQL as a key component because it has emerged as a standard (12 million deployments) and 3rd-most deployed database for data warehousing according to IDC;

5) While we do take over much of the processing with our column-store pluggable storage engine and parallel-processing SQL chip, we feel it's important to minimize any changes to a customer's database schema and/or application and to allow transparent interoperability with third-party tools;

6) Having come from 15 years at Teradata (and after that Sybase and Broadbase), I know that the high-end of data warehousing is very, very different from the mass market - both are techincally challenging in their own right and require very different product and go-to-market approaches;

7) Finally, regarding Oracle and whether they would "embrace" Kickfire (the question I was asked by Jason on the Frugal Friday show), we believe the data warehouse mass market could create several winners - and having recently raised $20M from top-tier silicon valley investors, we believe we have the resources to be one of them.

Thanks again for the post - we look forward to more from you and the community!

Thanks so much for the comment, and for taking the time to respond in your own words. I'll give your response more prominence in a separate blog posting later today when I get a chance and include my commentary on your response ...

Daniel, thanks for the kind words about the support you received from Infobright. Much appreciated. We do take pride in providing excellent support to our community users (see www.infobright.org).

On the question about MySQL and the use of MySQL for data warehousing applications, market data shows that the vast majority of the “data warehousing” applications are between 10’s of GB’s to low TB range. (I put data warehousing in quotes because it has evolved well beyond the traditional enterprise data warehouse and products like Infobright’s are used for many analytic use cases that don’t necessarily resemble an EDW.)

That’s not to say that there are not many data warehouses well above that range but from a market percentage perspective, a significant portion of the DW space has relatively small volume. Although there are conflicting reports on how much data MySQL can handle, there are still plenty of DW applications that can successfully use MySQL. However, there are also a growing number of MySQL users that need to implement large volume, read dominant applications. That is a great fit for Infobright, as they perform much better on a column oriented database designed for analytics. About half of our 50+ enterprise customers are MySQL users with this requirement.

The MySQL community (and the open source community in general) expects to be able to download and trial the software with no special requirements (up and running in less than 15 minutes). This is something Infobright offers to customers with applications pushing the limits of MySQL and in need of an alternative. We don’t meet the needs of every application but the ability to figure out the fit within a matter of minutes is key to success for many customers struggling with limited resources and time. Personally, I think the growth of analytic applications and expanded data warehousing use cases has been severely limited by the proprietary nature and overall cost of most solutions. MySQL offers an alternative for the smaller volume applications, while Infobright enables higher volumes, faster load times and excellent performance.

In regard to why we moved to open source, we believed there was a great market opportunity for us. Our product was perfectly suited to it as it takes just minutes to download and implement, requires no special hardware, and doesn’t require the complex administration and tuning that most products do. Today, we have had over 10,000 downloads of our open source product and grew our enterprise customer base from about 8 in the beginning of 2008 to well over 50 today. The amount of community feedback and input has contributed enormously to the growth of the company and the advancement of our products. It has also allowed us to become a key part of the open source BI community – as we work closely with companies like Pentaho, Jaspersoft and Talend and have joint downloads so users can easily download an end to end BI/DW stack in minutes. That is hard to duplicate in traditional commercial companies.

I definitely agree that moving to open source was the right move for Infobright. However, a key advantage of this move in my opinion, which you also alluded to, is that by being open source, it's really easy for people to download and check if Infobright is well-suited for their (new) application. I'd argue that open source Infobright is much more geared to *new* applications (rather than the preexisting MySQL market). I continue to maintain that the market for upgrading analytical applications already running MySQL is much smaller than people think. I further hypothesize that this is the reason why Infobright failed as a propriety product.

I do not buy your argument that becoming open source helped you partner with Pentaho/Jaspersoft/Talend. Pentaho partners with Aster Data, Greenplum, Kickfire, Netezza, ParAccel, and Vertica, all of whom compete with Infobright with proprietary software. Jaspersoft partners with Aster Data, Greenplum, Kickfire, Oracle, ParAccel, and Vertica. Talend partners with Dataupia, Greenplum, Kickfire, ParAccel, Teradata, and Vertica. And Talend has lots of proprietary vendors listed as "featured" and Infobright is not in the featured vendor list.

MySQL community is certainly a very good vector for datawarehousing, as far as installed base is concerned. This is why Infobright went opensource, and it was obviously the right move.

That being said, MySQL is not the alpha and the omega of datawarehousing. Victoria is right to remind that most datawarehouses are below the 1TB. Then, if a product addresses volumes far beyond that ceiling (Infobright already gets to 50TB and is said to soon address the 100/130TB range), then frontal fight against non MySQL competitors in unavoidable. So is the *migrating-my-current-datawarehouse* syndrom. Exclusively addressing *new* applications is obviously getting to a limit.

Without getting into the detail, Kickfire was 10-20x faster than #1. Kickfire was 2-4x as fast as #3.

Thus, Kickfire provided a substantial performance improvement over MySQL on the same hardware in loading large amounts of data into a Star Schema model and then loading from the Star Schema into a reporting table that did a full table scan applying statistical algorythms, aggregations and derivations for 2 years of history.

The issue we had with Kickfire were space recovery/table space reorg. Also, deleting records is a non-standard method so many of our processes for loading and dupe checking would need to be rewritten.

The platform is actually 2 blade servers. One for disk drives, the 2nd is the database server.

The entire appliance cost about $32k. A similarly equiped Dell configuration costs around $18k. Of course you still need a database on the Dell box.

Daniel Abadi

About Me

Daniel Abadi is the Darnell-Kanal Professor of Computer Science at the University of Maryland, College Park, doing research primarily in database system
architecture and implementation. He received a Ph.D. from MIT and a M.Phil. from Cambridge. He is best known for his research in column-store database systems (the
C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, which was commercialized by VoltDB and the Calvin project which inspired FaunaDB),
and Hadoop (the HadoopDB project, which was commercialized by Hadapt). Abadi has been a recipient of a Churchill
Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD
Jim Gray Doctoral Dissertation Award, a VLDB best paper award, a VLDB 10 year best paper award, the 2013-2014 Yale Provost's Teaching Prize, and the 2013 VLDB Early Career Researcher Award. He blogs at http://dbmsmusings.blogspot.com and
tweets at http://twitter.com/#!/daniel_abadi.