Featured Post

This example uses the same controller as in a previous post but adds a use case to support email validation. A Commons Validator object is ...

Friday, April 29, 2011

Vendor Neutral DaaS Architecture

Data-as-a-Service (DaaS) aggregates, cleans, normalizes, packages, secures, customizes, and formats data for use by customers. This blog post presents a vendor-neutral design of a DaaS implementation.

A DaaS implementation will include some or all of the following functions:

Aggregation,

Cleansing and normalization,

Packaging,

Security, and

Formatting and customization.

While a particular business may focus on a few of these attributes, a good DaaS design will be able to handle changes and services in all areas.

Role of Excel

Gathering the data is the starting point for DaaS. In any DaaS business, an ever-increasing data set will increase the business' value and number of potential customers. One way to get up-and-running with an implementation quickly is to use Excel as a primary means of inputting data into the system.
Ideally, data comes into the system through well-defined interfaces (web services, file transfers, etc.). However, when gathering a new dataset -- say from a collection of PDFs -- there may not be an application written to support the intake. A spreadsheet is viable because it can be worked on iteratively and archived for historical record and audit.

RDBMS

Some type of data store is recommended that will warehouse the data coming into the system. An RDBMS is a possible selection. The schema defined in the RDBMS will govern the business; different workbooks will conform to the RDBMS schema.

This data store is also important because it provides a logical and runtime division between the intake function (data loading processes) and the presentation (like XML-rendering jobs).

Distinct Presentations

Presentation includes the connection, transport, and formatting of the data. XML or JSON rendered from a SQL query served up as files from a web server are presentation examples. More dynamic presentation like parametrized reports, analysis tools, and querying web services are other examples. As much as possible, new external interface requirements should be handled by an ever-expanding set of presentation jobs. Don't try to chain several presentation jobs together as they're likely to change at different times.

Data Flow Diagram

The following data flow diagram shows a DaaS implementation divided into two parts: Intake and Presentation.

A DaaS Architecture

The intake function will load Excel workbooks after looking up a data provider record and assigning a load number to the activity. Business logic that validates the workbook is applied, and any rejects are handled by failing the entire data load or writing out erroneous records to a suspense file (to be loaded later).

Once in the database, the presentation jobs -- tailored to each format -- are run. This may happen immediately after a data load or at a convenient time.

The state of the processing can be studied by running reports against the RDBMS.

It's important for a DaaS implementation to anchor onto a data store that separate from both the intake function (Excel workbook) or the output (XML, JSON). This allows for operational flexibility in adjusting the load schedule, decoupling it from the presentation. The central data store is also a hub from which many different pieces can be added that are not dependent on each other.