Thoughts on Open Source, Analytics

Category Archives: Open Source

It is my unfortunate duty to inform all of you that DynamoBI is ceasing commercial operations October 31, 2012; we are immensely grateful for all the support that you all have shown our company, in so many different ways, over the past 3 years and we hope to make this shutdown as painless as possible for all involved. We know that we are not the only people who are invested in LucidDB, so we wanted to explain our rationale for shutting down along with the implications for the entire LucidDB community (not just our customers).

We started DynamoBI 3 years ago when we saw our most favorite open source project, LucidDB, finding limited prospects for adoption without a growth to full, commercial support which many (most!) companies need to be able to adopt open source software. We had been actively working with LucidDB for a long while, and knew that it is a fantastic piece of database/analytic software; to say that it’s a gem and provides some amazing capabilities in an open source package is an understatement.

However, markets and businesses are not quite as simple as having a great open source project and community. I think separately I may blog about the lessons learned from this startup (the entrepreneurial badge of honor #fail blog) but the community deserves to know that, for the most part, the failure to achieve success was about the market and selling environment (and our successes here) than any innate defects in LucidDB.

In short, we were not successful in the marketplace for two primary reasons:

1) In a crowded, loud market of more than 40+ Analytic data storage solutions, raw single query speed remains the singular priority. LucidDB often well improved over MySQL/Oracle but was not as fast as our Analytic peers. All of our other very interesting and compelling features (versioning of data, EII type connectors, pluggable/extensible systems) were often not even evaluated as we were often eliminated from evaluation based on the single raw query speed. LucidDB performs as advertised (great BI database, much faster than what you’re currently using), but that wound up not being enough.

2) Open Source price points are compelling for customers, but work only if you can build a high volume business. It became clear earlier this year (even with building enough cash flow to pay full time staff, etc!) that the size of our “funnel” was not large enough to support a high growth, interesting business. We determined that if we had X number of downloads we ended up with Y prospects that converted to Z customers at price A. We experimented with price, offering, prospect development, etc. We improved our conversion rates over time, but ultimately found that unless we could find some way to increase the mouth of the funnel by more than 100x we wouldn’t have a growing business that would allow us to continue/further our investment in LucidDB.

There are other reasons as well, many of which are missteps or mistakes by me personally. That could fill an entire other blog (and likely will at some point).

We’ve been working with our customers over the past few months to help them prepare for the future with us no longer providing the customer support. We’ve been communicating this message to them, and now we’re bringing it to the greater community about our future participation in LucidDB.

DynamoBI will:
1) Host the git repositories and continue to provide a legal contribution framework so that the IP for the project remains clean for all. The Apache license means that DynamoBI remains free and accessible for anyone/everyone wishing to use it (or parts of it).
2) Contribute any “interesting” pieces of the amazing framework to projects that can use it. In particular, we’re thrilled to see the Optiq project leveraging, as a starting point, some of the LucidDB components.
3) Host the forums, and wiki, and issue tracking for the LucidDB community as we have been for the past few years (http://luciddb.org).
4) Continue to participate as active users in the community; we are still fond of LucidDB and hope to see the community/project be successful.

However, DynamoBI will no longer:
1) Provide releases or builds. We’ve shut down our continuous integration server and do not plan on making any release after 0.9.4.
2) Offer any commercial services for LucidDB (consulting, services, sponsored development, etc).
3) Provide active development on the core project, or ancillary projects.

Once again, thank you for your support over the past few years and we encourage you to continue to look at LucidDB, even though we were unable to make it a commercial success. It has some very unique features that are a perfect fit for some use cases (Big Data access via BI tools, etc) that make it a great open source project.

This has been a long time in the making, but the LucidDB project is leaving the Eigenbase foundation to continue our development outside that organizations IP sharing, framework, and governance.

Community members will notice (or have already):

We are no longer using Perforce (YAAY!) and are now doing our primary LucidDB, Farrago, Fennel, and relevant extensions/test/build development work at github: https://github.com/dynamobi/luciddb/

The Wiki is now hosted at http://www.luciddb.org/wiki. We will, over time, remove references to Eigenbase in that project documentation/etc.

Issue tracking is now ALSO over at github, and we have migrated all issues (historical and outstanding) over to the github project.

Part of the impetus for leaving Eigenbase was our desire for a more inclusive license, to permit additional use/collaboration by other companies in the spirit of open source. We initiated this process, in good company and like minded individuals early last year. Long story short this plight and political battles cost Eigenbase the resignations of the two, most critical participants at Eigenbase: Julian Hyde and John V. Sichi. I join them now, as I resigned from the Eigenbase Board March 26.

Today I’m announcing that DynamoBI has released the entirety of the codebase, under the Apache Software License 2.0. We welcome our community members ongoing contributions, and hope that companies looking to leverage such a great framework and technology take a look. We welcome, wholeheartedly, your participation in the project under it’s new permissive license.

We continue to serve our existing customers with annual subscriptions to DynamoDB, our QA’ed and prepackaged commercial version of LucidDB.

For those that weren’t able to attend the fantastic NoSQL Now Conference in San Jose last week, but are still interested in the slides about how people are doing Ad Hoc analytics on top of NoSQL data systems, here’s my slides from my presentation:

We obviously continue to hear from our community that LucidDB is a great solution sitting in front of a Big Data/NoSQL system. Allowing easy SQL access (including super fast, analytic database cached views) is a big win for reducing load *AND* increasing usability of data in NoSQL systems.

By far, the most popular way for PDI users to load data into LucidDB is to use the PDI Streaming Loader. The streaming loader is a native PDI step that:

Enables high performance loading, directly over the network without the need for intermediate IO and shipping of data files.

Lets users choose more interesting (from a DW perspective) loading type into tables. In particular, in addition to simple INSERTs it allows for MERGE (aka UPSERT) and also UPDATE. All done, in the same, bulk loader.

Enables the metadata for the load to be managed, scheduled, and run in PDI.

However, we’ve had some known issues. In fact, until PDI 4.2 GA and LucidDB 0.9.4 GA it’s pretty problematic unless you run through the process of patching LucidDB outlined on this page: Known Issues.

In some ways, we have to admit, that we released this piece of software too soon. Early and often comes with some risk, and many have felt the pain of some of the issues that have been discovered with the streaming loader.

In some ways, we’ve built an unnatural approach to loading for PDI: PDI wants to PUSH data into a database. LucidDB wants to PULL data from remote sources, with it’s integrated ELT and DML based approach (with connectors to databases, salesforce, etc). Our streaming loader “fakes” a pull data source, and allows PDI to “push” into it.

There’s mutliple threads involved, when exceptions happen users have received cruddy error messages such as “Broken Pipe” that are unhelpful at best, frustrating at worse. Most all of these contortions will have sorted themselves out and by the time 4.2 GA PDI and 0.9.4 GA of LucidDB are released the streaming loader should be working A-OK. Some users would just assume avoid the patch instructions above and have posed the question: In a general sense, if not the streaming loader how would I load data into LucidDB?

Again, LucidDB likes to “pull” data from remote sources. One of those is CSV files. Here’s a nice, easy, quick (30k r/s on my MacBook) method to load a million rows using PDI and LucidDB:

This transformation outputs to a Text File 1 million rows, waits for that to complete then proceeds to the load that data into a new table in LucidDB. Step by Step the LucidDB statements

— Points LucidDB to the directory with the just generated flat file
— LucidDB has some defaults, and we can “guess” the datatypes by scanning the file
CREATE or replace SERVER csv_file_server FOREIGN DATA WRAPPER SYS_FILE_WRAPPER OPTIONS ( DIRECTORY ‘?’ );
— Let’s create a foreign table for the data file (“DATA.txt”) that was output by PDI
>create foreign table applib.data server csv_file_server;
— Create a staging, and load the data from the flat file (select * from applib.data)
CALL APPLIB.CREATE_TABLE_AS (‘APPLIB’, ‘STAGING_TABLE’, ‘select * from applib.data’, true);

We hope to have the streaming loader ready to go in 0.9.4 (LucidDB) and 4.2 (PDI). Until then, consider this easy, straight forward method of loading data that’s high performance, proven, and stable for loading data from PDI into LucidDB.

Use Hive for it’s massive scalability, distributed data processing capabilities.
Hive is great at processing huge amounts of data. It’s scales to hundreds of servers, and has a bunch of fantastic features for structured and semi structured data processing, fault tolerance, etc. Hive is a great way to do heavy lifting sorting through petabytes of data to arrive at some interesting, pre-aggregated datasets.

Cache the output of Hive views into LucidDB.
Now when we’re talking about taking the output of Hive views into LucidDB, we’re not talking about SMALL datasets (10k rows) we’re talking about 50 or 100, or 500 million rows. Some might think that number is small (by Hive standards it often is) and others might think that’s big (our entire DW is only 200 million rows). However, LucidDB has provided the ability to draw in data from Hive via easy MERGE/INSERT statements.

We’ve done some work on LucidDB (yet to be committed and will be a POST 0.9.4 commit) that allows the use of Hives, well, somewhat unique driver. Hive’s driver has a bunch of quirks in terms of metadata, etc that we’re now recognizing and using properly over in LucidDB.

We’ve updated to the 0.7.0 release. We’re now ready to go with the latest and great Hive features.

We’ve enabled some pushdowns to work to allow for easier day to day loading of LucidDB tables from Hive, along with a limited workload of Ad Hoc SQL access.

Live, real time, per query access to the Big Data system that is useful and practical (ie, filters, etc).
This means that you need to be able to allow the user, via simple Parameter or simply by hitting a different schema or table access to the live data.

Easy, full ANSI SQL access to high performance, low latency, aggregated data.
Dashboards need results that come back in seconds, not minutes. LucidDB and the data cached there provide a great “front end” for easily doing real BI work on top of data that sits inside Hive.

We’ve updated our connectors to allow some filtering/projection pushdowns to work with Hive.

Here’s a simple example. We have a report or dashboard which is looking for only a subset of data in Hive. We want to allow the filtering of data to occur and for Hive to receive the filtering from our OLAP/Dashboard.

By default LucidDB will read the entire table and do all SQL functions over in our

However, pulling over the entire table is really not going to work well for us. This would really be the worst of both worlds; you’d be better off just querying Hive directly. However, luckily we’ve enabled some pushdowns be pushed down to Hive.

Notice that the condition IN( values ) is being pushed down to the remote Hive server.

Let’s try something a bit more complex!

Currently, we’re able to push down most filters and projections.

Let’s take now take the use case where we’re trying to ONLY UPDATE records that have been updated since the last time we checked (ID > 97). More likely the key that we’d use to do this push down filter would be a date, but you can simply use your imagination.

Consider the following SQL:

merge into hive_test.”local_POKES” l
using (select * from hive_test.pokes where “foo” > 97)
ON l.”foo” = p.”foo”
when matched then update set “bar” = p.”bar”
when not matched then insert (“foo”, “bar”) values (p.”foo”, p.”bar”);

This SQL is a typical “incremental” load from a remote system. Syntactically a little dense, but it’s actually a VERY high performance method to load directly into LucidDB often eliminating the need entirely to draw the data through an intermediate server and process (ETL Tool).

Our enhancements allow the Hive portion to be pushed down. Hive will ONLY return values greater than 97 and we’ll simply intelligently keep any changed records “up to date” in LucidDB for reporting.

Many of these changes will be in a patched version of LucidDB; we’ll make this patched release available to any customers who want these optimizations available, immediately for use with Hive. Let us know what you think by joining this conversation at the LucidDB forums: Hive Connector Pushdown

In a subsequent blog we’ll cover how to now match up data coming from Hive (or CouchDB) with data in other systems for reporting.

Our last LucidDB release was now, just more than 12 months ago on June 16, 2010. We were really really trying to beat the 1 year mark for our 0.9.4 release but we just couldn’t. A tenet of good, open source development is early and often and we need to do better. Since the 0.9.3 release we’ve:

Julian Hyde and his cohorts on the Mondrian project have been busy at work for nearly 5 years (spec 0.5 done in 2006!) working on the difficult, but worthwhile effort of standardizing client side access to OLAP in Java.

They just released version 1.0! This is a big deal; bigger players have attempted and failed at this before (ahem JOLAP). Kudos to Julian, Luc and the others involved to get such a *real* standard in place!

There’s a few reasons why this matters to everyone in Business Intelligence. Not just Java devs and Open Source BI fans.

Only existing “de facto” standards are owned by MSFT

XML/A was touted as **the** industry standard for OLAP client server communications. You can think of XML/A as the SOAP equivalent of OLAP client libraries. There are a few problems with this.First is that MSFT always treated this like they do all “open” standards; just open enough to get what they need out of it (SQUASH JOLAP) but never really open. For instance, reading the spec, notice that the companies involved specifically note that they all absolutely reserve the right to enforce their patent rights on their technology, EVEN IF it’s part of the spec. ie, it’s open, but if you actually IMPLEMENT it you might have to pay MSFT for it.

Second is that XML/A is now a fragmented standard. Similar to SQL, MDX support and other line protocol extensions (ahem, Binary Secured XML/A) means that there’s no one really making any sort of technology toolkit, collection of drivers, etc. Simba does much of this in their lab in Vancouver, but they’re the exact opposite of open. In fact, when the XML/A council vanished they pounced and picked up the site which is now a simple shill for their products. A couple of guys at a single company without any open publication on variations in MDX/implementations is counterproductive to real interoperability.

Third is that SOAP is soooooo 1999. SOAP is fundamental in XML/A and there are many interesting (Saiku) ways of serving client server. REST, direct sockets, in memory, etc.

Helps keep Mondrian from being fused to Pentaho Analyzer

Mondrian is a very successful open source project and serves as the basis (server part) of Pentaho’s Analyzer (acquired from LucidEra). Pentaho has clearly signaled their (lack of) commitment to upkeep of their open source frontends; Analyzer is proprietary software that Pentaho has committed all their OLAP UI efforts behind, leaving the community with an aging JPivot front end to Mondrian. Clearly underestimating what the community has to offer, the community has delivered a replacement project Saiku to address this.

OFF TOPIC: I’ve made several Open Source BI predictions and with the exception of Pentaho Sreadsheet Services (which technically wasn’t OSS) I’ve been right every time. Here’s one for ya: Saiku will outshine Analyzer in the next 18mos and both technologies will be worse off because Pentaho, ironically and increasingly, chose proprietary instead of community. Ahhh… I feel better having said it.

Keeping Mondrians primary exterior API as a standard helps ensure that Mondrian can not be subsumed (entirely) by Pentaho and that innovation can continue with multiple community projects doing shiny UI work on top of Mondrian.

A single, pragmatically useful API enables binding to other languages as well (ie, non Java)

Saiku, basing their open source RESTful server on top of OLAP4J has now enabled cool mashable OLAP access to not JUST Mondrian (which was already available via SOAP/.xactions) but anyone else who creates a driver (SAP, SSAS, etc). By actually having a real project that can collect up a real open driver implementation with a few implementations means that projects like Saiku (which actually has client APIs for C, Obj-C, Ruby, ActionScript, etc).

I wouldn’t be surprised if there are others layers (ADOMD?) that leverage OLAP4J as well.

At yesterdays Eigenbase Developer Meetup at SQLstream‘s offices in San Francisco we arrived at a new logo for LucidDB. DynamoBI is thrilled to have supported and funded the design contest to arrive at our new mascot. Over the coming months you’ll see the logo make it’s way out to the existing luciddb.org sites, wiki sites, etc. I’m really happy to have a logo that matches the nature of our database – BAD ASS!

DynamoDB (aka LucidDB) is not just another column store database. Our goal is being the best database for actually doing Business Intelligence; while that means being fast and handling large amounts of data there’s a lot of other things BI consultant/developers need. I’ll continue to post about some of the great BI features that DynamoDB has for the modern datasmiths.

First feature to cover that’s dead easy, is the built in ability to generate a time dimension, including a Fiscal Calendar attributes. If you’re using Mondrian (or come to that, your own custom SQL on a star schema) you need to have a time dimension. Time is the most important dimension! Every OLAP model I’ve ever built uses one! It something that you, as a datasmith will need to do with every project; that’s why we’ve built it right into our database.

Here’s a dead simple way to create a fully baked, ready to use Time Dimension to use with Mondrian.

-- Create a view that is our time dimension for 10 years, with our
-- Fiscal calendar starting in March (3)
create view dim_time as select * from
table(applib.fiscal_time_dimension (2000, 1, 1, 2009, 12, 31, 3));

OK, that’s it. You’ve created a Time Dimension! * see NOTE at end of post.

So, we’ve created our time dimension, complete with a Fiscal calendar for 10 years in a single statement! Awesome – but what does it contain?

Generating the Time Dimension is accomplished using DynamoDBs ability to include Java based UDF Table Functions. Table functions are really powerful – they allow a BI developer to write custom functions that output a “table” that can be queried like ANY OTHER TABLE (mostly). Check out the wiki page FarragoUdx if your interested.

And of course: download LucidDB and give it a whirl!

NOTE: To be candid, doing it as a view isn’t the best approach. For anything beyond tiny (5 million +) we should actually create the table, and do an INSERT INTO SELECT * FROM TABLE(fiscal_time_dimension).

The book is certainly geared towards Java developers, and specifically, developers you are trying to simply use the Pentaho reporting library. I’d venture to say that MOST customers should be using Pentaho; in this case, the book is useful as a reference, but the HOWTO past Chapter 3 would probably be lost on many users; except for Chapter 11 (see below).

However, for people trying to embed Pentaho reporting, WOW: THIS IS THE DEFINITIVE RESOURCE. Buy it, RIGHT NOW! The information it contains was locked in just a few peoples minds (Thomas, Bunch of People sitting at the “citadel” in Orlando aka Pentaho Employees, a handful of consultants). Will has unlocked it and I’m glad he did.

Will taught me something new in this book. In fact, I hope this is “new” in 3.5 which was release just a few weeks back. If it’s been around longer than I’m a total dolt. Chapter 11 covers how to add your own custom Expressions/Formulas to Pentaho (including the PRD).

At customer engagements, or when I put on my Pentaho hat and teach their public courses, or custom onsite training, I’m asked all the time: Can I make my own Reporting Functions and plug them into Pentaho Report Designer? Up until WIll showed me how to do it on page 281, I thought this was only possible for Pentaho (the company). Will gives us a step by step guide to add our own “DoMyCustomThing” to the Pentaho Report Designer. Customers can now create their own corporate expressions/functions they can leverage across hundreds of reports.

I’ll keep several copies on my shelf, and give it away to any current/future “embedded Pentaho Reporting” customers. Thanks Will for such a great book!