Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

Its been a few years... but I've run into a problem-set that reminds me of a schema design pattern and I'm trying to recall the details.

Essentially, the pattern consists of one database which is 3 NF (or something else) that services all Inserts, Updates, and Deletes. On some periodic or event-driven basis it would publish all (or some) of its state to another database which is optimized for high-performance read-only access. Essentially, the publish would generally target a completely different or de-normalized schema.

The question is... Is this a well known pattern? If yes, what is it called, and what are the pitfalls aside from duplicative storage of the same state.

2 Answers
2

What you are describing is a data warehouse. The live, normalized, read-write system is OLTP (online transaction processing) and the denormalized read-only snapshot is a data warehouse. The structure of the data warehouse could be a Star Schema, especially if it's highly denormalized. Data warehouses often have summarization in addition to denormalization. There can be many copies of the same data summarized over various dimensions and/or timeframes.

The disadvantages of this technique are that the snapshot is generally not 100% up to date and you have to be very careful about how your snapshot is taken or you could actually introduce discrepancies other than timeliness into your data warehouse. Another possible issue is that you may have difficulty doing some kinds of reporting out of a data warehouse because of the choices you made when rolling up details into you summary tables. Also, if your data warehouse has multiple summaries over different timeframes, for example, you have to be careful to keep these consistent with one-another.

The timeliness issue in particular is one you have to be careful about. I've seen users make a change to their online system and then get angry that it didn't show up right away in a report that is run against the data warehouse. Users tend not to know or care about the vagaries of reporting systems.

Joel Brown has summed up the nature of a data warehouse. I'll add something here about reporting requirements.

What is a data warehouse good for

A data warehouse is good for analytical reports where you want to calculate aggregates, trends or other statistical or financial metrics over a large volume of data. Generally a periodic load is best for this sort of application. Near-realtime loads for a data warehouse have two major drawbacks:

They're much, much more complex to implement.

Having financal data that changes from moment to moment can generate a whole class of spurious reconciliation bunfights. Unless you actually have a requirement for real-time data (see below) it's often better to have a system that can report a fixed as-at position that doesn't generate any ambiguity.

Operational vs. Analytic reports

If you have genuine realtime requirements then you have a requirement for an operational report. Generally, reports with genuine realtime requirements are not analytic in nature. Typically they are logs of work done, exception reports or to-do lists and the user needs to be able to do some work on the system and periodically re-run the report to see whether they still have anything outstanding that needs attention. Usually, they are also tied to a specific process on a specific operational system.

Usually, reports of this type can be run off the base system, or a replicated copy, and tend not to involve large volumes of data. In most cases this type of report is best run as an operational system, or a system that is replicated from that system.

Real time aggregate reporting

Occasionally people do have genuine requirements for analytics on data with low latency (near-realtime). Some examples of requirements for this system are:

Market data analytics where you have ticker data loaded directly to a trading floor system.

Accounts reporting systems where the users need to journal an entry and then report on it straight away

Web server log analytics.

For a typical enterprise data warehouse application, near-realtime feeds are almost always an anti-pattern as they greatly increase the complexity, generate moving targets for reconciliation and don't actually fulfil a genuine requirement. In this case batched loads are generally the way to go.

One characteristic found on realtime systems is that the data being reported n is almost always quite simple (market data feeds, accounting journals, web server log entries). This simplicity makes the realtime reporting feasible. Trying to get clever and do it on an enterprise wide scale is going to land you in a world of pain to achieve something for which there is almost certainly not a genuine requirement.

Separate operational and analytic reporting requirements

Operational and analytic reporting requirements are almost always in direct conflict with each other, and youre almost always better off managing them separately. Run your operational reports of a replicated copy of your production database and use a data warehouse for your analytic reports.

Unless you have a genuine requirement for near-realtime analytics, this is by far the best approach.

+1 for nice description of the application of data warehouses and the distinction between operational and analytic reporting. "Spurious reconcilliation bunfights" is worth +1 by itself.
–
Joel BrownApr 14 '12 at 17:49