Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:

Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.

Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.

In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.

Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.

Consolidating data marts onto one common technological platform has important benefits.

In essence, Greenplum is pitching the story:

Thesis: Enterprise Data Warehouses (EDWs)

Antithesis: Data Warehouse Appliances

Synthesis: Greenplum’s Enterprise Data Cloud vision

When put that starkly, it’s overstated, not least because

Specialized Analytic DBMS != Data Warehouse Appliance

But basically it makes sense, for two main reasons:

Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.

On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.

Of course, the EDC vision isn’t quite as new or differentiated as Greenplum ideally would wish one to believe.

Something like EDC can also be presumed to be implicit in the strategies of the other one-size-fits-all vendors — i.e., Oracle and IBM.

Greenplum has only implemented a little more of the EDC vision so far than have other firms, unless you give it credit for being cheap/fast/MPP/running on commodity hardware, but deny that credit to Teradata (specialized hardware, and not cheap in its most popular configurations), Oracle (ditto for Exadata), IBM (also not cheap), or Microsoft/DATAllegro (not released yet).

Specifically: In Greenplum Release 3.3, which is being announced today, Greenplum is introducing the (enhanced?) ability for data marts to be spun out as a background operation, while the database otherwise remains functional. As of 3.3, spinning out a data mart is a command-line operation. But in Release 3.4, Greenplum plans to offer a web-based interface for same, at which point the “self-service data mart creation” discussion will become operative. Otherwise, EDC is a roadmap/vision/statement-of-direction much more than it is a fully-baked technical project.

One particular source of potential confusion is Greenplum’s emphasis on the buzzphrase self-service (data mart). This seems to be a conflation of two related concepts:

End users should be able to create new data marts themselves. Strictly speaking, I view this ability as useless at most enterprises, and important at very few, because of logistical issues. (Who gives the permissions? Who decides which hardware is used?) That said, useless “end user” tools often wind up being important productivity aids for IT professionals, and this kind of “self-service” would surely be another example. Edit: Hmm. Doug Henschen inspired me to think that over again, and I’m beginning to soften. Suppose users could order up the data mart they want, perhaps test it at a very low processing priority (if they choose), and then send the completed request to IT for approval and provisioning. That would have some value.

End users should be able to manage data marts themselves, once created. That’s a great idea, full of agility and don’t-make-IT-a-roadblock goodness. Data miners and similar analytic professionals commonly have the technical ability to manage a simple database, and should be allowed to do so if it’s ensured that they don’t break anything for anybody else.

One thing that’s needed for this technology to come to full fruition is sophisticated data movement and synchronization. Ideally, some tables in a data mart could be virtual — views against a central database. But others would be physically recopied from the center, with all the ETL/ELT/ETLT/replication issues that entails. Meanwhile, it’s not obvious that the ideal architecture is a simpleminded hub-spoke — perhaps one should be able to spin data marts out of other marts, perhaps at least somewhat reducing the proliferation of tables and the recopying of data. And it should be easy for administrators to change deployment strategies, e.g. by starting a table out as a view and changing over to making it a physical copy as usage profiles change.

Oliver Ratzesberger of eBay also argues that workload management — not a current Greenplum strength — can be crucial. For example, if the CEO wants the CFO to get her an answer TODAY, the fastest approach may be to create an entirely virtual data mart, with very favorable SLAs (Service Level Agreements). More generally, if you’re setting up dozens of marts that contain views of the central database, sophisticated SLA management can be essential. There’s a big virtualization opportunity here — but virtualization requires a lot of system management infrastructure.

Well it looks like they’re basically saying look, go cobble something using metal/virtual/cloud together and we will fit on top of that. But it’s still your IT ops handling all the provisioning (for their private cloud). In essence what it seems like to me at this point is a set of best practices really relatively inline with their MAD EDW philosophy. Unless I’m missing something which is a distinct possibility – I mean geezus I initially assumed they were providing the cloud infrastructure

The EDC initiative is about 3 things:
– Platform technology that allow business analysts to self-serve provision warehouses/sandboxes via a web console and access/replicate data into their warehouse from anywhere in the EDC. (i.e. a ‘private cloud’ approach applied to scale-out data warehousing). This is not just about spinning up a database in virtual machines. We’re building a new layer of services that really allow business and IT to each focus on what they do best and reduce the areas of friction that exist today — e.g. self-serve cluster provisioning from server pools, local or geographically remote data replication, data lineage and cross-warehouse metadata, and more.
– A new data warehousing methodology that challenges the formal ‘everything in one database and one data model’ that has been prevalent over the past 25 years. This isn’t something that Greenplum has cooked up — it is simply a reflection of what our customers are putting into practice today.
– An ecosystem of customers and partners that believe in the vision and are working with us to shape and deliver on it.

Note that most enterprises that we work with aren’t looking to the public cloud for data warehousing – largely because the data is being generated in-house and they don’t want to push TBs over the Internet daily. But they do want to achieve many of the touted ‘cloud’ benefits in-house. i.e. They want to empower business analysts to serve themselves without lots of process or IT delays in the way. And they want IT to consolidate infrastructure, get their arms around data mart proliferation, and improve service levels but without some heavy-handed approach that requires unifying all the data models.

[…] only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy. Similarly, Vertica charges only for deployment, […]

[…] notion of self-service data marts has merit, but with certain caveats, Monash said in a blog posting Monday. “Suppose users could order up the data mart they want, perhaps test it at a very low […]

For example:
“7. It appears that the only part of the EDC initiative that Greenplum’s new version (3.3) has implemented is online data warehouse expansion (you can add a new node and the data warehouse/data mart can incorporate it into the parallel storage/processing without having to go down). All this means is that Greenplum has finally caught up to Aster Data along this dimension. I’d argue that since Aster Data also has a public cloud version and has customers using it there, they’re actually farther along the EDC initiative than Greenplum is …”

We’ve been running Greenplum internally on EC2 for almost 2 years now, and use both EC2 and internal VMware pools for a range of QA and scale testing work.

Making Greenplum run on EC2 is almost zero work — we just haven’t seen material demand from large enterprises wanting to put their production, mission critical data warehouses in the public cloud yet. There’s no doubt it’ll come over time, and we’re supportive of the direction, but it just isn’t here yet.

Matt Aslett from the the451 group wrote a nice analysis on this topic (unfortunately only available through paid subscription), where he reinforced this point:

“Enabling cloud-computing deployments is about more than simply offering a version of your product running on Amazon . . . Adoption of data warehousing on public clouds has so far been limited to proofs-of-concept evaluations and trials rather than production deployments, we believe, and Greenplum’s focus on datacenter platforms could serve it well as enterprises look to private cloud architecture as a method of improving datacenter efficiencies before identifying workloads that could be migrated to public clouds.”

We’re encouraged by folks like Aster, Vertica and others that find interest in public cloud offerings to serve the current market of Web 2.0 companies which is definitely a good use case. If anyone is seeing that large enterprises are ready today for meaningful adoption of public cloud services for data warehousing, we’re ready to serve

[…] (1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. […]

[…] (1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. […]

ahmad on
January 10th, 2010 5:58 pm

Dr:
hi, how are you please I’m student in university i want example for application data mart and explain this example.

[…] Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year’s EDC (Enterprise Data Cloud) vision statement and marketing campaign. […]

Big Data Analytics is not only for retail Business Intelligence, even toughh that is where some of the greatest advancements are currently occurring. Big Data Analytics is also the future of Infrastructure Asset Management. Each industry has core infrastructure that must be Asset Managed over its life-cycle through predictive modeling. Big Data Analytics will evolve too rapidly (with increasing volume, variety, velocity and complexity of available classes of data) for any industry organization to standardize and maintain THE method of doing Infrastructure Asset Management through Big Data Analytics.Those wishing to take a leadership role in the Big Data Analytics required for successful Integrated Asset Management of infrastructure need to establish the standards for the backbone of Big Data in their industry sector that is industry associations should establish the standards for data governance, management, control, and compliance through a Central Data Warehouse. Then let consultants, utilities, software companies, and academics knock yourself out and do the analytics you want any way you want to do it and provide competitive differentiation to the businesses. If you want to work with our data scientists, great; if you have your own data scientists or a third-party that helps you, fabulous. But there is only one place that you come to get the data and that’s [the industry association’s Central Data Warehouse.] From the CDW, infrastructure asset design and performance can be independently Validated, Verified (iV&V), and benchmarked against peers (as is done in the software sector). As a civil engineer, I believe this is the future of the engineering standard of care in all sectors and will change engineering practice as we know it.

hello!,I like your writing very a lot! percentage we communicate more about your
article on AOL? I require an expert on this house to resolve my problem.
May be that is you! Having a look ahead to peer you.