Optimal Data Architecture for Clinical Data Warehouses

While electronic data capture brought modern computer front ends to clinical trials, back-end processing of clinical data remains relatively medieval. Trial designs for new products evolve rapidly, forcing downstream analysis teams to stovepipe each uniquely structured source data set into separate schemas and hand-adapt the extract, transform and load (ETL) code. Because this tactical approach is cumbersome, time-consuming and non-repeatable, it does not scale well, and many clinical data warehouse initiatives by even the biggest pharma/biomed companies implode under the pressure of mushrooming costs and elusive ROI.

The solution to warehousing the ever-changing structures of clinical data will require 1) a single, multi-trial repository based upon abstract rather than directly representational data models, and 2) adaptive, meta data-driven ETL programming.

Such an architectural shift will streamline the acquisition process for new trials and greatly simplify validation. By providing multi-trial data integration by design, this adaptive repository will also allow R&D companies to shorten the length of many clinical trials, even eliminating some altogether, greatly decreasing time-to-market in an industry where new products can generate millions of dollars of revenue per day.

If trial designs were ever finalized, this mission might be achievable; however, the clinical research evolves constantly. Figure 1 depicts the typical clinical trial process and highlights four common types of revised requirements that force IT to repeatedly reengineer even a single-trial data repository.

Significant change occurs between trials as well, as study teams invent new case report form (CRF) elements and technological advances introduce new ways to measure product safety and efficacy. Figure 2 depicts the pervasive impact that any one of these changes can wreak upon nearly every component of a CDW.

Figure 2: Pervasive Impact Change Has on CDW

With such erratic source data, low-maintenance, multi-trial clinical data warehouses are beyond the capabilities of traditional database techniques, yet life science firms must have such nimble CDWs to compete. Developing the data target structures and transformation scripts for a single trial basis can take weeks, delaying the revenue stream that new products promise. Even for a moderately sized life science company, a CDW whose structures and code can be simply reused could generate more than $20 million in revenue by eliminating this delay.

Additionally, if the CDW held multiple trials with accurate business meta data, biostatisticians could aggregate representative patient populations from previous trials and substitute these aggregates for real patient data in regulatory submissions. At more than $3,000 per patient, reducing a Phase III trial population from 5,000 to 2,500 represents more than $7 million in savings on one trial alone.

Furthermore, data mining with a multi-trial CDW could generate statistical evidence of a product's superior safety or efficacy for specific subpopulations of patients, enabling its marketing department to better define medical niches to dominate, thus enhancing margins and prolonging product life.

FIRST-GENERATION CDWs FALL SHORT

Seeking such high potential ROI, clinical IT organizations have typically tackled the challenges of CDW through what can be called a first-generation (1G) approach. Born out of classic relational database techniques, these 1G repositories include a separate table for every CRF in the actual trial.

The disadvantages of this approach are many. First, without some form of target schema automation, every column in the source requires at least one handcrafted target column and a concomitant block of ETL code. Designers must also know of these required columns in advance, leading to a painstaking requirements gathering effort which greatly delays the repository's go-live date.

Second, because new CRF elements can be accommodated only with considerable effort once the trial commences, IT will be slow to respond and might even "push back" when study teams wish to redesign their forms. Third, the differences in target schemas for non-equivalent trials make cross-trial data integration data difficult, impairing valuable data mining. Finally, so much labor is required to build and validate each trial's repository that IT teams forego many of the advanced features CDWs should have, such as recording scientific meta data and tracking data subsets used in publications.

LEAPING HURDLES WITH EAV

Given such disadvantages of repositories based upon directly representational data models, IT should refocus its efforts to base the CDW upon an abstraction of the trial data. Figure 3 presents how an "adverse events" (AE) record would appear once stored in such a meta model. Whereas in a traditional third normal form (3NF) model each attribute of the AE entity spawns a distinct column, in the meta model, each attribute has been pivoted down into a separate row. This technique is called "row modeling," and the resulting schemas are known as "EAV" models because their tables consist mostly of three columns: entity, attribute and value.

Figure 3: How an AE Record Appears When Stored in a Meta Model

EAV schemas are in fact only a slight extension of the name-value pair paradigm used in many Internet applications today. With this in mind, the function of the attribute and value columns are easy to understand. The entity column is added to record the source table (CRF) from which each observation originates. Accordingly, a trial with scores of separate source tables may very well land in an EAV repository with only a handful of target tables.

Veteran database designers will immediately suspect an EAV model's scalability and performance because each end-user query must unfold many EAV rows back into every record in the result set. Circumventing such performance issues for clinical data entails architectural adaptations in three areas - data, applications and process.

In data architecture, clinical EAV models stop short of row modeling all columns. The keys that would have linked the repository's tables if modeled in third normal form are left in 3NF to facilitate quick subsetting of the records. Unlike the attributes representing the observations in the patient data, these keys are dictated by high-level business considerations that rarely change. For example, every trial has sites, investigators, patients and visits.

By preserving the 3NF model's keys, the EAV data warehouse can also employ range and hash partitioning (common features of a contemporary DBMS) to accelerate loading and retrieval. Even a clinical trial with 5,000 patients and 1,000 observations each amounts to less than 100MB when loaded into a CDW; therefore, range partitioning on trial number and hash partitioning on entity works exceedingly well.

The EAV data architecture also pays off handsomely by permitting automated, meta data-driven ETL designs. A procedure can be written to scan the tables of a new trial, confirming that it can find the static key attributes that designers have recorded in a meta data table. After this scan, all columns listed in the meta data table will be transferred to a distinct column in the meta model - all other columns get "shredded" into the rows of the EAV model.

As new CRF items arise during or between trials, the EAV data warehouse no longer has to be retooled - new columns simply become new rows. If a new business key column should appear (perhaps due to a merger or acquisition), it will need to be added to the EAV tables with a default value provided; however, going forward, IT must only add the new key column to the CDW's meta data tables to have the ETL routines automatically adapt.

With such a meta data-driven EAV warehouse, maintenance no longer involves the laborious redesign and reloading of multiple tables, nor the re-coding and re-validation of intricate ETL code. Instead, the data architecture of the "second-generation" CDW has reduced maintenance to simply updating a configuration table.

The validation work for subsequent trials then shrinks to simply confirming that untouched CDW code still responds to the meta data tables as expected, further improving the time-to-market for the company's new products.

BETTER PROCESSES, LOWER RISK

One note of caution: end users will need to unfold the data stored in an EAV repository back into 3NF before they can use it with their typical analysis tools such as SAS. This unfolding proves a difficult task for ordinary SQL. Though one could employ the DBMS's procedural language, a better solution would be to use a business intelligence (BI) front-end tool to a) subset the data (using the key columns that CDW leaves in 3NF), and then b) pivot out the desired columns of trial data.

This alternative approach is more practical because BI tools are user friendly, having been designed for business analysts. Employing these tools improves the business architecture because journeyman BI developers can now prepare the trial's analytical data sets - a vexing task formerly performed by biostatisticians. Freeing up these senior staff members (often Ph.D.s) greatly benefits the organization by allowing them to focus on their far more valuable data analysis duties.

Additionally, IT can employ the same repository and BI tool to provide extremely valuable trial progress metrics for trial managers, eliminating altogether what is commonly an entirely separate database application.

Perhaps most importantly, advancing to an EAV data architecture also drastically reduces the risks of clinical data warehousing by minimizing the application's complexity. Scores of tables and volumes of handcrafted code are eliminated by an adaptive repository coupled with a GUI-based BI tool. Because CDW costs and labor requirements grow exponentially with complexity, simplifying the design avoids the exhaustion of the organization's finances and schedule, the two most common reasons CDW projects get cancelled.

BUYER BEWARE

Taken together, the concepts presented form a powerful framework for evaluating commercially offered CDW applications. The CDW marketplace is in its infancy, and the applications by even some of the biggest vendors provide little more than a SAS workspace with an audit trail.

While this minimal platform may facilitate compliance with FDA's CFR 21/11 regulations, it does little to accelerate the clinical data value chain or offload data management work from biostatisticians. One might then call these products CDW's "zero generation," simply to give them a place within the framework.

In the middle of the spectrum are vendors offering 1G CDW applications with true clinical target schemas for patient data along with CFR 21 support. While these products do allow for some process and organizational improvements, they fall short because the separate data schemas and ETL procedures must be adapted and validated for each trial. At the frontier, we find a handful of small vendors with true 2G CDWs that offer both automated analysis of the source systems and a single, multi-trial target repository. This approach offers the greatest reduction in time-to-market and maximizes IT support for biostatisticians.

Using this framework, clinical research organizations can quickly narrow their search from hundreds of proffered products to the few commercial applications that can currently achieve the full potential of clinical data warehousing.

There will still be some important gaps to fill, however, such as data exchange using industry standard XML schemas (CDISC), interfacing with the document management systems and connecting SAS to Microsoft Word for final FDA submissions documents.

With these gaps, clinical R&D organizations could equally justify buying or building the CDW; but either way, insisting upon a second generation architecture will eliminate several project iterations and shave years off delivering a true clinical data warehouse and realizing the enormous financial gains they make possible.

Clinical Data Warehouses

The solution to warehousing the ever-changing structures of clinical data will require 1) a single, multi-trial repository based upon abstract rather than directly representational data models, and 2) adaptive, meta data-driven ETL programming.