Database Elaborations

Database Elaborations explores the human side of building and managing databases. The discussion revolves around data modeling as a semantic process, and how normalization only functions within a business rule context. Communication is hard; the meaning of everything is more fluid than we might wish to admit. Large gaps between logical versus physical or theory versus implementation are always worth reflection.

Data governance is a touchy subject. Many organizations have tried to build a practice…and failed. Other organizations feel too daunted to even try. IT in particular has a very conflicted relationship with data governance.

Petabytes, and more, of data are meaningless without guides to what it all is. Future data users need to know where it came from, how it was filtered, and how it was transformed. Without some functional level of governance we are headed toward a "datapocalypse"—having lots of content but little of value.

The objects and relationships identified within any database design need to match with the very subjective realities considered true within a given organization. An optimal data model must incorporate the semantics of the business and provide for a structural arrangement supporting the necessary corporate instrumentality. And that model should do so, even if in the details things within the model are contrary to what other organizations may consider to be the nature of their reality.

How does an organization acknowledge that data is important? An organization does so by enabling and supporting efforts for gathering and persisting information about the organization's data resources.

In information services, providing a solution that removes a user's pain can be a great boost to one's psyche. Being the hero, wins trust, and may even enhance job security. However, organizationally, too many sideline heroics create chaos. The day after the crisis has passed, things are left in a less than wonderful state.

In the early days of data warehousing, lines were very simple. The data warehouse reflected the business. In providing this reflection, data was summarized, data was cleansed, data was standardized, and data was even drastically reformatted for legibility and reporting usage. But, the big rule of thumb, the never-to-be-crossed line, was that the data warehouse did not create new data.

Many designers seem to approach a database design as if there is a single objective reality that circumscribes all truth and that happens to align with exactly how they see the world. Honestly, is it expected that there is an absolute truth for everything? And that single absolute truth goes from the highest level of abstraction to the lowest level of detail?

Shunryu Suzuki, in his classic book, Zen Mind, Beginner's Mind, tells us that in the beginner's mind there are many possibilities, but in the expert's mind there are few. Both a beginner's and an expert's perspective have value, but bias toward either an expert or a beginner approach seems to consume an organization far too often.

Big data tool vendors try to downplay the notion that data warehouses and data marts still need to exist, even in a big data world. Relational DBMSs are painted as "old-fashioned," "yesterday," and "inadequate." They beckon potential customers to take a dip in the refreshing data lake. The fact that big data, in all of its glory, is only part of a larger business intelligence solution is getting lost in the dialog.

It is fair to say that relational theory is the only solid framework for establishing a rational expression of data that falls anywhere inside the boundaries of formal logic. As people continue to laud the "death of relational" by coming up with one or other "new" physical implementations of coding or data engines, from object-oriented, XML, columnar, or anything else one might name, the primary short-coming is that these are physical implementations that avoid having any formalized logic underpinning them.

"On occasion, the inmates actually do take over the asylum. Similarly, business personnel can be defeated by their IT peers," writes Todd Schraml. Although some might argue that such an unexpected role reversal cannot happen, it may indeed occur when data architects skilled in their craft can identify the options that help achieve the business goals, versus those alternatives that stifle the business goals. "A good data architect must also be a very good business analyst," he notes.

Commitment to the creation of a business intelligence environment means embracing a well-measured and thoughtful allocation of resources with the right skills, the right tools, the right scope, and the right timelines

How does one tame the data beast to start making tangible progress toward an organization's business intelligence goals? For any progress to occur on the business intelligence-side, tasks, projects and goals need to be fashioned into small steps along a much larger path.

Finding the means to harmonize conflicting processing requirements is where a design comes alive. When done well, providing symmetry to a suite of processes greatly empowers those who develop and maintain those processes; changes and enhancement arise seemingly in the blink of an eye. In ETL, there are three key principles to driving exceptional design.

In many organizations, users find it hard to trust their own internal information technology (IT) group, leading them to try any possible option to solve problems own their own. The resulting stealth IT projects can lead to confusion or even complete chaos.

Changes and enhancement to solutions are hard, even under the best of circumstances. It is not usual that, as operational changes roll out into production, the business intelligence area is left uninformed, suggesting that data warehouses and business intelligence be categorized according to the view of the old comedian Rodney Dangerfield because they both "get no respect."

In the world of database products, trends seem to pop up quickly, like prairie dogs testing the air. However, regardless of the marketing jargon, or platform, vendor-product purchased, or open source utility downloaded, the one thing that remains unaltered is that in order to extract value from data, the data must be understood. The individuals within any organization who actually comprehend the data, the data structures, and all the exceptions to the usual rules are individuals who are considered critical resources.

Changes to database structures should be performed in a coordinated fashion as the application processes that support the new functionality are rolled out into production. While the "work" involved in adding a column or a table to a relational database is actually minimal, often there are circumstances where developers and DBAs create additional columns and additional tables in anticipation of future needs. Sadly, this "proactive" effort results in databases littered with half-formed ideas, fits-and-starts, and scattered-about columns and tables that provide no meaningful content.

Database management systems support numerous unique date and time functions - and while the date-related functions are many, they do not go far enough. One date-driven circumstance often encountered has to do with objects having a type of date range that needs to be associated with it. While there are some exceptions, this date range need generally ends up implemented via two distinct date columns—one signaling the "start" and the other designating the "end." Maybe, should the creative juices of DBMS builders' flow, such things as numeric-range-datatypes could be created in addition to a date-range data-type. Who knows where things could end up?

Data models attempt to express the business rules of an organization. A good data model reflects the semantics used within an organization to such an extent that business people within that organization can relate to and easily agree with what is being expressed. In this regard the data modeler's goal is to properly mirror back the organization's concepts onto those people within the organization. The goal is not to force an organization into a "standard" data model, nor is the goal to abstract everything in the creation of a master model that will never need to change even if the business rules were drastically shifted.

One of the principles within relational theory is that each entity's row or tuple be uniquely identifiable. This means the defined structure includes some combination of attributes whose populated values serve to identify an individual row within the table/relation. This, or these, attribute(s) are the candidate key(s) for the structure. The candidate key is also known as the primary key, or if a structure has multiple candidate keys, then one of them is designated as the primary key. When building up a logical design, primary keys should be identified by the actual data points in play.

The grain of a fact table is derived by the dimensions with which the fact is associated. For example, should a fact have associations with a Day dimension, a Location dimension, a Customer dimension, and a Product dimension, then the usual assumption would be for the fact to be described as being at a "by Day," "by Location," "by Customer," "by Product" metrics level. Evidence of this specific level of granularity for the fact table is seen by the primary key of the fact being the composite of the Day dimension key, Location dimension key, Customer dimension key, and Product dimension key. However, this granularity and these relationships are easily disrupted.

It seems that juggling is the most useful of all skills when embarking on a data warehousing project. During the discovery and analysis phase, the workload grows insanely large, like some mutant science fiction monster. Pressures to deliver can encourage rampant corner-cutting to move quickly, while the need to provide value urges caution in order not to throw out the proverbial baby with the bath water as the project speeds along. Change data capture is one area that is a glaring example of the necessary juggling and balancing.

Dimensions are the workhorses of a multidimensional design. They are used to manage the numeric content being analyzed. It is through the use of dimensions that the metrics can be sliced, diced, drilled-down, filtered and sorted. Many people relate to dimensions by thinking of them as reference tables. Such thoughts aren't exactly accurate. A dimension groups together the textual/descriptor columns within a rationalized business category. Therefore, much of the content coming from relational tables may be sourced from reference tables, but the relationship between each source reference table and the targeted dimension is unlikely to be one-for-one. These grouped-format dimensions often contain one or more hierarchies of related data items used within the OLAP queries supported by the structures.

Do not allow well-meaning but confused proponents to obscure concepts related to normalization and dimensional design. Under a normalized approach one usually would not expect for numeric data items and textual data items to fall into different logical relations when connected to the same entity object. Yet within a multidimensional approach that is exactly what happens. Multidimensional design and normal design are not the same, and one should not expect to claim that both approaches were used and that they resulted in the same data model.

Establishing a data warehousing or business intelligence environment initiates a process that works its way through the operational applications and data sources across an enterprise. This process focuses not only on identifying the important data elements the business lives and breathes, but the process also tries very hard to provide rationality in explaining these elements to business intelligence users.

Multi-dimensional design involves dividing the world into dimensions and facts. However, like many aspects of language, the term "fact" is used in multiple ways. Initially, the term referred to the table structure housing the numeric values for the metrics to be analyzed. But "fact" also is used to refer to the metric values themselves. Therefore, when the unique circumstances arise wherein a fact table is defined that does not contain specific numeric measures, such a structure is referred to by the superficially oxymoronic characterization of a "factless fact."

Within the information technology sector, the term architect gets thrown around quite a lot. There are software architects, infrastructure architects, application architects, business intelligence architects, data architects, information architects, and more. It seems as if any area may include someone with an "architect"status. Certainly when laying out plans for a physical building, an architect has a specific meaning and role. But within IT "architect" is used in a much fuzzier manner.

In writing a definition for an entity, an attribute, or any other element within a database design, the desired end is a descriptive text that is clear, factual and concise. Semantics are an ambiguous and often painful tool to employ. Balancing the need for clarity against the desire to avoid redundancy can be a juggling act that is hard to accomplish. One might not easily recognize what is complete versus what is lacking, versus what has gone too far. But even so, within a definition if one finds oneself listing valid values and decoding the value's meaning, then one has likely already moved beyond what is "concise." Lists of values easily add bulk and quantity of verbiage into a definition, yet such lists do not usually increase the quality of a definition.

The beauty of a truly wonderful database design is its ability to serve many masters. And good database designers are able to empathize with those who will use their designs. In business intelligence settings, three perspectives deserve consideration when composing designs.

It seems easy to fall into a state where projects and activities assume such soft-focus that routine takes control, where one simply does necessary tasks automatically, no questions are raised regarding what is moving through the work-life production line and everyone is essentially asleep at the switch. Certainly, we may have one eye open ensuring that within a broad set of parameters all is well, but as long as events are basically coloring inside the borders we continue allowing things to just move along. In this semi-somnambulant state we can easily add columns to tables, or even add new entities and tables, or triggers and procedures to our databases, then eventually at some point down the road have someone turn to us and ask, "Why this?" or, "What does this really mean?" And at that point, we surprise ourselves with the discovery that the only answer we have is that someone else told us it was what we needed, but we do not really understand why it was needed.

A database design may occasionally show evidence that it lacks proper prioritization. Data models should express truths about the business, or about the universe of discourse. But in expressing business truth this does not mean a data model should express absolutely every truth that anyone might conceive. Some relationships are significant while other relationships are not. And as a general rule, database design is not an exercise in trivial pursuit. Insignificant truths only clutter up a design, increasing complexity, causing users' eyes to glaze over more quickly, and adding no real value towards the endeavors of the enterprise.

The whole world can be divided into two groups, these being splitters and lumpers. Design battles are waged across conference rooms as debates rage over whether to split or to lump. Splitters take a group of items divide them up into sub-groups and sub-sub-groups occasionally going so far as to end with each lowest level becoming a group of one. On the other side of the design fence, lumpers combine items until everything is abstracted into group objects covering very broad territory, such as a "Party" construct, or ultimately an "Object" object. Within data modeling, arguments arise, such as whether to sub-type an entity. Or perhaps lumping is discussed as the grain of a multidimensional fact is proposed. This debate underlies much of the decision-making involved in determining what domains to create within a data model. The split-versus-lump issue is ubiquitous and universal. The question to split or lump arises across many kinds of choices, in addition to the entity definition, table grain, or the domain grain mentioned in the previous examples; this issue is at the heart of deliberations regarding establishing functions, overriding methods, or composing an organizational structure.

In the dim, dark past of data warehousing, there was a time when the argument was put forward that "history does not change." It was posited that once a piece of data was received by the data warehouse, it was sacrosanct and nonvolatile. A fact record, once processed, was to remain unchanged forever. Dimensions, due to their descriptive nature, could be changed following the prescribed Type 1, 2, or 3 update strategies, but that was all. It was the expectation that due to their very nature, fact tables would become huge and in being huge would give poor update performance; performance so poor that updates would be virtually impossible to enact.

It seems only reasonable that what one person can do, others can learn. On the other hand, taking people through training does not usually result in the creation of great new database administrators (DBAs). It often appears as if those who are exceptional at the craft operate at higher levels as they dive into a problem. Can training alone provide folks with the attention to detail, the urge to keep digging, or the ability to recall minutiae that allow them to rise from simply holding the DBA title to becoming someone who is a great DBA? Or must the genetic potential exist first, and then one might fall into the DBA occupation and astound those around them. It is very hard to say with any degree of certainty whether great DBAs are made or born; yet again the battle between nature and nurture arises.

Organizations are populated with solutions entitled DW, EDW, BIC, ODS, EIW, IW, CIF, or BW. Why must every organization have a data warehousing or analytics solution identified by monikers from a very limited pool of choices? Why must every deployment of a database that is expected to function as an operational data store be called ODS? For internal solutions it seems that plain and dreary naming approaches are de rigueur. Two and three letter acronyms have long been a part of corporate-speak; but when it comes to IT systems, these TLAs have become exceedingly narrow and soul-less.

Solution development work is usually accomplished via projects, or a combination of programs and projects. This project perspective often leads to thoughts of documentation as project-owned. And while many documents are project-specific, such as timelines, resource plans, and such, not everything is project-specific. Unless projects are established in a fashion whereby each is very limited in scope to the creation or enhancement of a single application or system, specification and design documents belong to the final solution and not to the project.

Lectures related to master data bring forth all sorts of taxonomies intended to help clarify master data and its place within an organization. Sliding scales may be presented: at the top, not master data; at the bottom, very much master data; in the middle, increasing degrees of "master data-ness." For the longest of times everyone thought metadata was confusing enough ... oops, we've done it again. And, we have accomplished the establishment of this master data semantic monster in quite a grand fashion.

Retaining the particulars of change over time is a fairly intricate configuration. Audit log or shadow tables are sometimes employed, but on occasion there is a need for the "old" and "new" rows to exist in a single operation table for application use. Far too often, the implementation of temporal data structures is shoddy, loose, and imprecise; rather than the fairly complex dance move such temporal arrangements must perform in actuality. The sub-optimal result is much like one's performance of the Funky Chicken at a friend's wedding; the desired moves are mimicked, after a fashion, but it is unlikely to earn high marks on "So You Think You Can Dance." The usual temporal implementation simply slaps on start and stop dates, debates a little over default date values versus NULLs, then moves on to the next subject.

The cost for new development can often be easily justified. If a new function is needed, staffing a team to create such functionality and supporting data structures can be quantified and voted up or down by those controlling resources. Money can be found to build those things that move the organization forward; often, the expense may be covered by savings or increased revenue derived from providing the new services.

It is not magic. Building a successful IT solution takes time. And that time is used in various ways: obtaining an understanding of the goal; mapping out what components are necessary and how those components interact; testing components and their interaction; and finally migrating those components into the production environment - otherwise known as analysis, design, development, testing, and deployment. Regardless of the methodology employed, these functions must always be addressed. Different approaches focus on differing needs and aspects. But any complete methodology must fill in all the blanks for accomplishing each of these tasks.

Every project is, or should be, driven by user requirements. Requirements are the organization's way to articulate what needs to happen in order to provide value. Yet time and again requirements are looked at as something overly technical, mysterious, and too confusing to easily handle. Repeatedly, organizations use a template to ensure that requirements are defined early in the solution process. Sadly, the intended purposes are habitually defeated as these templates are filled with a lack of understanding for what information belongs in a given section, resulting in people creating documents that the authors themselves do not understand. Across many organizations requirements documents are created, reviewed, and even agreed to, that far too often are incoherent monstrosities saying nothing of actual value.

When assembling a database design, one of the keys for success is consistency. There should be more than just similarity in the way things are named, the manner in which tables or groups of tables are constructed; the manifestation of these elements should follow standards and practices that are documented and understood. If one tries to rely on the idea that individual developers will simply look at existing tables and glean standards via osmosis as they add on or create new tables, then one does not actually have any standards at all.

The Broadway tune goes, "The sun will come out tomorrow ... it's only a day away." The words from this optimistic jingle are often heard on IT projects that are overburdened with features and functions. On any project of significant size the list of desired things often becomes larger than the budgeted resources or time. Faced with limiting circumstances, the only option becomes aligning the work effort with the constraints and only doing what fits within those constraints. The items are "timeboxed," and the amount of planned work is exactly the amount of allowed work. Some things remain in, while other features are left out. Alternatively, responsibility for controlling a project may be ignored and the end date arrives with some things simply not completed. Essentially, a project is timeboxed by default.

Naïve approaches to business intelligence will occasionally trap designers as they juggle operational data stores and data warehouses. The trap results from an honest endeavor to simplify designs and increase consistency throughout the solution. Under the umbrella of consistency a designer may plan for a reference table used for operational look-ups to perform a second service as a star schema dimensional table. Some or all reference tables then are declared by fiat to also be dimensions. While on a superficial level there are similarities between dimension tables and more normalized look up or reference tables, fundamentally these two concepts are separate things.

Occasionally, one sees a data structure abomination. This atrocity involves an object of almost any type, in almost any database wherein the object has a start date but no end date. It is not that the finish date currently has no value and is null; it is that the end date does not even exist on the table structure. The stop date was never intended to exist. The object in question starts, but it doesn't ever end.