Channel: Enterprise Information Management - Malcolm Chisholm

Data Management: What is Abstraction?

"Abstraction" is a term that gets thrown around a lot these days in data management, but I for one have had a lot of difficulty understanding what people mean by it. Usually it is not worthwhile interrupting valuable conversations on design and architecture to have a discussion on what someone means by "abstraction.” Such a tangential debate is unlikely to help solve the practical problem at hand and will probably waste a good deal of time. However, there comes a point when it is necessary to understand what "abstraction" really means, and particularly if its definition has significance for design or architectural patterns.

A problem I had for a long time was thinking that abstraction signifies a single concept. I no longer think that it does, and believe that in data management it is used to refer to several quite different concepts. This realization has given me greater confidence in challenging the users of the term "abstraction" to explain to me just what they mean. In this article, I concentrate on three common ways I have found "abstraction" to be used.

Classical Abstraction

The medieval scholastics developed a view of abstraction that they refined from the logic they inherited from the ancient world. This type of abstraction involves forming general concepts from more specific ones. In data modeling speak, it involves a bottom-up methodology to build supertypes from subtypes.

Suppose we have a zoo full of animals that we need to classify. Mammals are defined as warm-blooded air-breathing vertebrates that have hair, sweat glands, a four-chambered heart, and give birth to live young, which they suckle. Birds are defined as warm-blooded, air-breathing vertebrates that lay eggs, have a four-chambered heart, feathers, and scales on their legs and feet. Suppose we have one staff member dedicated to looking after mammals and another dedicated to looking after birds, and that we want to combine these positions so we have just one staff member to care for both mammals and birds. A creative way to do this would be to find a way to group mammals and birds. We could create such a grouping by inventing a new concept that keeps the attributes that mammals and birds share, while eliminating the attributes that they do not share. Let us call this new group of animals Class A animals. Class A animals can be defined as air-breathing, warm-blooded vertebrates that have a four-chambered heart. We choose to eliminate these attributes from consideration: giving birth to live young; suckling the young; having hair; having sweat glands; laying eggs; having feathers; and having scales on the legs and feet.

The process of removing attributes from concepts means that we end up with concepts that have fewer attributes, but which cover a lot more instances in the real world. If we continued the process in the zoo example mentioned above, we would probably end up with the concept of "animal" and that would cover all the specimens housed in the zoo. Traditional logicians recognized the inverse relationship between what they called "intension" or "connotation" (the number of attributes an entity type possesses) versus "extension" or "denotation" (the number of instances an entity type covers). The greater the intension, the less the extension, and vice versa.

So what does this mean for data management? I think we need to recognize that because classical abstraction exists, we must deal with it. In my opinion, that means that the degree of this kind of abstraction that exists in a data model depends on what the requirements are. At this point, we fall into the long-running battle between those who claim that there is only one view of reality and that a data model must reflect it, and those who claim that there is a large element of design choice in a data model. I think there is truth in both positions, but the fact that classical abstraction exists is a warning that we need to take design very seriously because we have choices.

The Aggregate

Classical abstraction is a form of generalization. There is another style of abstraction that I have seen in data management, which is the formation of aggregates.

Suppose I am hosting a dinner party and wondering about what to provide for dessert. Thinking about my guests, I know that Algernon likes peaches, Emma likes pears, Bertie likes cherries, and Camilla likes fruit salad. In a stroke of brilliance, I decide to make a fruit salad by mixing peaches, pears, and cherries. Now each guest can just help themselves to the fruit salad and pick out whatever is their favorite fruit. In the case of Camilla, she likes fruit salad anyway so there will be no complaints from her.

This is essentially what data modelers do in designs such as the party model. In its extreme form, the party model is a single entity type that holds information about all individuals and institutions that have some kind of relationship with the enterprise (e.g., as customers, employees, vendors, etc.). It is like a fruit salad of data. It is not like classical abstraction because we are not going to throw away attributes. We need to retain as much information as possible in our party master design. What we have, therefore, is an aggregate rather than the generalization of classical abstraction.

To be fair, in a party model there can be some degree of generalization with supertypes and subtypes rather than a single party table that contains everything. However, I have seen the latter, and the degree of generalization is often not very pronounced because you quickly conjure up too many subtypes in the model which complicates the design.

The party model is the subject of a lot of heated debate. Once again, this is often cast in terms of whether a data model is a true representation of reality (whatever that is) or a design that most closely fits some specific requirements. If your user is like Camilla from the dinner party and really wants the fruit salad of the party model, then that is the right answer. But if your user is only interested in one kind of relationship, e.g., onboarding institutional clients, the party model is probably not appropriate.

The problem of aggregates is that we try to manage dissimilar things within a single unit. In a data model this is not really a problem, but writing SQL against a party model design in a physical database can be a nightmare. Of course, that is a task for someone other than the data modeler.

Modeling

A third kind of abstraction is modeling itself. When we model something we are trying to represent something in a way that we can manipulate, in place of manipulating whatever is the object of the model.

If I want to renovate my kitchen, I will create a model on paper of what I want the new kitchen layout to look like. I can try various designs, and discover problems and solutions in these designs. This is a lot more effective than actually launching into a building effort whereby I try out each idea by actually constructing it in the kitchen.

This is what why we do data modeling - to design first and (hopefully) build once. However, in moving from the concrete reality of the kitchen, for example, to the abstraction of the model, we encounter a couple of problems. The first is that there are always limits to the fidelity with which a model can represent reality. Something always seems to get lost. For me, the inability to populate reference data tables in data models is one such problem. For instance, a model may have a table called Customer Type. If I ask the modeler what types of customer there are, I usually get told that this is up to the user and not the modeler's concern. A second problem is that a model is also a form of reality. It is certainly an abstraction from what it represents, but it is a reality unto itself. Because of this, modelers do strange things. For instance, some modelers will go to any length to stop lines crossing, even if it means excluding entity types from the model. My attitude is that reality is messy, and if lines cross it is too bad. Yet I am constantly astonished at how not making lines cross seems to be the ultimate goal of many modelers.

The lesson for me is that in this form of abstraction, if you are paying more attention to the model than you are to what it represents, then you have a problem.

Conclusion

People throw the term "abstraction" around a lot and it often makes them sound a lot smarter than they really are. My attitude is that, like Humpty-Dumpty semantics, they can use any word in any way they want as long as they define it explicitly and use it consistently. Unfortunately, with "abstraction" that is rarely the case. Indeed, in my experience, "abstraction" is used to cover the three major areas discussed above, and possibly more. A deeper problem is that many of the smart people who use "abstraction" do so in a way that implies they have secret knowledge that is going to make data management a lot easier. They should be challenged on this because, depending on requirements, "abstraction" can create problems as well as solve them.

Malcolm Chisholm, Ph.D., has more than 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of the books: How to Build a Business Rules Engine; Managing Reference Data in Enterprise Databases; and Definition in Information Management. Malcolm writes numerous articles and is a frequent presenter at industry events. He runs the websites http://www.refdataportal.com; http://www.bizrulesengine.com; and http://www.data-definition.com. Malcolm is the winner of the 2011 DAMA International Professional Achievement Award.