Saturday, June 9, 2012

Not All Reference Types Are Entities

I'm working on a scheduling system as part of a side-project, and I couldn't help but notice something interesting about how I'm designing it. The system has a kind of an abstraction of events, as events aren't always simple "calendar events" in the traditional sense.

Originally the requirement was that "events need to be able to repeat." Well, any calendar system can do that. So I wanted to know why other calendar systems aren't meeting the need of the application. With a little more back-and-forth in a small domain modeling exercise, it became clear that the requirement by itself didn't really state the business need.

As it turned out, events don't really need to "repeat" in the traditional sense. More specific to the business need, events need to be able to have multiple instances within the same event. For example, a particular "event" might happen "every day for a week" or "every Tuesday for a month" and so on. And it can get pretty complex, such as "from 14:00 to 17:00 every Wednesday for a month and a half, except for the fourth instance because the venue has something else so that one will be on Thursday instead, and will start at 14:30 instead of 14:00." Just making an event be "repeatable" won't cut it.

The solution is simple. Events don't have dates and times associated with them. Instead, Events contain a collection of Sessions which represent individual "instances" in this case. So the Event has a name and a description and other attributes, including a Location. It also has a collection of Sessions which each have a Start date/time, a Stop date/time, and an optional overriding Location.

So the database structure is also simple. An Events table, a Sessions table, and even a Locations table just for some little extra normalization (and because I plan to do more with Locations in the future in this system). As I designed the code, however, something didn't sit right with the concept of identity on these data structures. As a matter of reflex, I included an ID on the Sessions table. You know, so the system can uniquely identify a Session. But... why?

I was specifically thinking back to a previous project at work where entity identity was a significant problem. In that project, a technical decision was made at some point prior to my involvement that every table in the database will have a GUID as an ID and a software framework used that to uniquely identify all of the data entities. This led to a pretty serious problem in the data because the business had a very different definition of "identity" for their entities. An ID value (especially a GUID) meant nothing to the business. They were thinking in business terms and defining what attributes of an entity identified that entity. (Essentially, the business was thinking correctly and the technical design was artificially limiting them.)

So, what uniquely identifies a Session in my case? Well, nothing important. In fact, in the absence of an Event, a Session is meaningless. My domain doesn't even need to fetch Sessions from a repository by themselves. They should be attached as attributes to an Event when fetched from the Events repository, that's all. They should never need to be fetched individually outside of the context of an Event.

That is... Sessions are not entities. They are not, individually and atomically, a representation of a meaningful business concept. They are attributes attached to an entity... the Event. Sure, Sessions have their own table in the database. (This is a technical concern, not a domain concern.) They even have an ID to uniquely identify them. (This is a technical concern, not a domain concern.) They are even reference types in the code, not value types. (This is a technical concern, not a domain concern.) But they are not an entity in and of themselves. (This is a domain concern.)

At the level of the programming language being used (C#), they are not value types. But the business isn't concerned with the intricacies of C#. The business is concerned with the domain. And as far as the domain is concerned, Sessions are value types. You don't care which Session you're talking about, and if you blow one away and replace it with another one of identical values then the two are indistinguishable. The values are all that's important, not the unique identity thereof.

In real life, contrast this with something like a human being. Have you ever known someone with the same name as you? The same birthday? The same address (like a family member)? Any other identical attributes? It's unlikely that you would know someone who shared all of these attributes with you, but it's not impossible. You may need to explicitly seek out such a person and manually line up all of your attributes, but it can be done. (Within reason of the attributes, of course.)

Does that mean the two of you are now the same person? No, not at all. You are unique entities. Your attributes are simply values, they are not the entity itself. Values are often used to identify an entity in the absence of a unique identifier. (For example, business users may uniquely identify customers by their phone number. This may be good enough for a particular business, even though it's possible to have collisions. The phone number isn't the actual identity of the person, it's just a value used by the business to distinguish customers.) But they're just values, not identity.

In this project, Events will have a unique identifier as well. An ID column in the database, which is a simple incrementing integer. It's likely that the business will internally identify Events differently, of course. And the software will need to account for this. A combination of values may be used to identify an Event, including perhaps even the collection of Sessions.

But a Session by itself doesn't need an identity. No more so than your address needs an identity. Your house has an identity, and the value of its address is the most common way to identify it. But the address itself doesn't need an identity. It's not the entity, it's just an attribute value.

This actually reminds me of another project I worked on some time ago when I was working in North Carolina. We were modeling a fairly complex data model for a project and we brought in the database guy to help us. He went about doing very database-y things, including standard relational normalization. A lot of what he showed us was very helpful. The concept of super-typing tables to achieve a kind of inheritance model in the data was new to me, for example, but made a lot of sense and made the design much cleaner and simpler.

But there was one case where his model didn't make sense to me. Naturally, there was an Addresses table to store the addresses of various other entities. People, client businesses, anything that had one or more addresses as an attribute of it. And, being a relational database expert, he naturally normalized that data. But he took it a step further. His goal was to prevent data duplication within the Addresses table. So if two or more other entities had the same address, they should refer to a single record in that table.

Should they? This is where I disagreed with the design. At the time I articulated the concern with a simple use case... Suppose two people share a mailing address. For example, two business contacts at the same office. One of those people moves (transfers to a different office). So someone updates his address. But wait... They just updated the address for both people, and for that office, and for any other entity using that address. This is no good.

The conclusion was to simply allow data duplication in the Addresses table. The more we thought about the technical implementation, the more inescapable that conclusion became. There was still some mild objection to data duplication, but the objectors couldn't think of a more elegant solution.

There's a reason they couldn't. They were trying to do something against the domain. The domain made it very clear that addresses were not entities. They don't exist by themselves, they don't have any individual meaning, and they don't need to have identity. Addresses are values, not entities. They exist only as attributes to entities. If one is used more than once, that's ok. If you delete one and replace it with another one of the same value, it's the same one. (The technical implementation will need to maintain the relationship, of course, but that's a technical concern and not a domain concern.)

Just because something has a POCO in your software doesn't mean it's an entity. Just because something has its own table in your database doesn't mean it's an entity. The domain defines what is an entity and what is not. The technical implementation needs to reflect the domain's definitions, not present its own definitions in the name of some convention or habit of the technical implementors.