TL;DR: Currently when testing for key violation we think of a function f(x, y) where:x is the value (or values) of the key column (or columns) for one tuple, andy is the value (or values) of the key column (or columns) for another tuple, andf is the equality function.

I discuss whether it should be possible that:x is a set of the key columns of "key tuples" relating to one tuple, andy is a set of the key columns of "key tuples" relating to another tuple, andf is the intersection function - which reduces to equality where the key tuples have a single member (ie, this rule reduces to the typical rule described above)

If anyone has seen other discussions along these lines, please send me in that direction! If anyone thinks this hasn't been discussed before, please tell me where I should submit a paper!

---

I have recently been wondering what it means to have a unique "contact" in a data model. I am first going to describe an idea that I first thought of in the context of this example, because that is how I came to this idea. I will then describe it using a more familiar example - the problem of overlapping intervals in time. Note that the idea here is about what it means to be a key. I'm not proposing a solution specific to contacts, or intervals of time. I'm just using them as examples to demonstrate the concept. I'm not sure what to call the idea. "Key sets" or "inverted keys" come to mind.

First I obviously need to define what a "contact" is in my example model. That is actually quite difficult, which is, in a way, the point!

A contact is not a phone number, nor an email address. A contact is something that can "own" any number of phone numbers, and/or any number of email addresses, and/or any number of URL's (twitter, website, etc), and/or any number of some other of this kind of thing, (which I will henceforth call a "protocol endpoint").

In general a contact is the owner of some set of these protocol endpoints. In order for a contact to exist we must have some way to communicate with them, so they must have at least one such endpoint.

A contact may of course have other attributes. For example, there is almost certainly some way to address the contact when communicating with them, which I will call the "alias" of the contact. A contact probably has some kind of "salutation" attribute as well. Most of the time the contact represents an interface to a real human being, but that is not always necessarily true. I could, conceivably, have a contact representing a whole department. For example, a contact with an alias of "ACME corp accounts receivable", and an email address of AR@acme.com.

What, then, are the candidate keys for a contact? Clearly the "alias" attribute cannot be the key for the contact entity, because I might have several contacts all with the alias of "Bob", but one is bob@acme.com whereas the other is bob@bobsburgers.com.

I am going to make a claim now that seems a bit strange, but here it is: I claim that the contact is uniquely identified by any single member of the set of all protocol endpoints we can use to communicate with it, and that these endpoints are rows, not columns. Of course these endpoints are of different types (phone vs email vs URL, etc). For the moment, please imagine that I have some parent "protocol endpoint" type, but in the following example I will use values from the set of inherited types, just for clarity.

To be clear, I don't mean that the set of protocol endpoints is stored as a "multivalued attribute" in a single contact tuple, nor as a set of sparse columns in the tuple. Rather, I suggest that the set of protocol endpoints is itself a set of tuples of arbitrary cardinality - ie, rows - related to the contact tuple.

Of course in the real world we know that sometimes people share phone numbers, like the landline number for a residence. I am going to assert that, for the sake of this model, that situation would result in an incoherent idea for a contact, because it doesn't make sense to initiate a communication through one of these protocol endpoints without knowing who it is you are addressing. My claim, then, is that the "key value" for the contact I've described here is the "key set":

Specifically, I claim it is a key in the following sense: no member of this set can exist within the set defining the key for any other contact. Or, more formally:

the intersection of the "key set" with any other "key set" must be the empty set.

I realise that this implies that the empty set can be the key for any/all tuples, since the intersection of the empty set with itself is the empty set. This is a problem. I can eliminate the problem with an arbitrary rule that the empty set cannot be a key. Perhaps there's a more elegant way to eliminate the problem, (maybe relying on the fact that the empty set it a subset of itself?)

A-fortiori, this implies that the key sets cannot be identical - if two key sets cannot intersect, then no two key sets can be the identical, since identical sets would obviously intersect. So the existing idea of a primary key as a value which must be unique is maintained. In fact, the "key set" definition of a key contains the normal definition of a key: if the key set is a set containing a single member, then intersection reduces to equality.

"A single member" here doesn't imply a single scalar value. We could have a multiple values in the case of a compound (multi-column) key, but the identity simplification would still apply. For example, if I had an existing table with a key composed of two columns {A, B}, then using the familiar definition of a key we check to make sure that there is no pair of values of {A,B} in the table that is identical with any other pair of values {A,B}. But If I have a "set based key", then what we want to check is that the set of all pairs of values {A, B} in the key set for one entity does not intersect with the set of all pairs of values {A, B} in the key set for any other entity.

OK, all of this might seem very strange when we approach it from the point of view of these "protocol endpoints", because I'm describing it in terms of an unfamiliar problem. It just so happens to be what motivated my thinking here. But I can relate this to a much more common problem with which many people will be familiar: The problem of unique intervals.

It is quite common to encounter the problem of needing multiple versions of a thing, which cannot overlap in time. For example in a data warehouse with a slowly changing dimension, the interval represented by a version cannot overlap with the interval representing any other version. Another example might be the case of an employment relationship, where one person can only be employed by a company at most once at any given instant in time.

The problem with this sort of rule is that it cannot be declaratively enforced. I can define the interval via its start date and end date, but I can put no kind of declarative constraint on the start date column, nor the end date column, nor the combination of the two columns, to ensure that two intervals in the table do not overlap.

But there's another way of thinking about the start date/end date definition, and that is to think of it as a set (of times, in this case) which has been intentionally defined (ie, defined by a rule). Given these two dates, you can check whether any point in time is a member of "the set of all times between the specified dates". An easier way to think about might be to think about what would happen if we changed this intentional definition to an extensional definition - ie, a set defined by listing every member. I could create a table containing a person and a date. Then I look at the start and end date of the employment agreement between the person and the company, and put a row in this table for every date >= the start and < the end date. Then we could say that every tuple in this table must be unique over the columns person and date, and thereby we can ensure that there are no overlaps. Or in other words, it would not be possible to create another employment agreement with the same person whose interval overlaps an existing employment interval.

But this is not simply a familiar "many to one" relationship. I am suggesting that the employment agreement is *functionally dependent upon* the set of rows in this extensional table. In other words, the set of all dates representing the interval of the relationship, plus the person in the relationship, is the key for the employment agreement.

Of course, the problem with that approach - in this example of intervals of time - is that I have turned my intentional definition into an extensional definition, and the members aren't the same. I had to pick some level of granularity for the extensional definition. I happened to pick whole days, but what happens when someone says "actually, they can be the same day, but not the same hour", and then later "actually, they can be the same hour, just not the same second", and so on. But the purpose of the example is to demonstrate that the problem of "non overlapping intervals of time" is in fact the *same kind of problem* as the contact problem above: We would like to define a key as being composed of a set, which in this case happens to be intentionally defined compared to the extensional definition in the case of contact protocol endpoints. In other words, if it were able to define set-based keys, and if it were possible to use intentional as well as extensional definitions for those sets, then the problem of creating a key constraint which disallows overlapping intervals in time is the same as creating a key constraint which disallows the intersection of two sets.

The obvious case that doesn't work is where people share a phone number. The situation where a contact numbers takes you to a shared secretary where neither I nor the CEO nor the Creative Director is going to give a direct line contact numer to all and sundry, so the three of us share a phone number which will be answerd by someone else who will act as a gatekeeper/filter for us, passing calls on when appropriate to the relevant one of the three or dealing with them without disturbing any of us. That phone number s then a contact number for 4 people, and your "empty intersection" rule means that your database can only have one of the four of us as a contact with that phone number. Customrs of my former employer would still have that problem today, although I wouldn't be one of the 4 involved since I retired 8 years ago.

Many yearsbefore that, there would have been a problem with my home phone number too. Some people would have that number because there were circumstances at which certain very senior executives of customer companies had to be able to contact me out of office hours (ability built into some contracts). But some people would have that number so they could contact my wife in her role as borough mayor or as chair of the county and borough joint highway committee, and it's quite common for senior people in companies to have to deal both with suppliers and with local government, so your system would be broken because my wife and I - clearly distinct contacts with distinct functins and acting on behalf of different legal entities - would have had for anyone who needed both contacts key sets with a non-empty intersection, so your scheme could not handle them.

I don't think this is at all uncommon, in fact shared phone numbers are extremely common.

"Of course in the real world we know that sometimes people share phone numbers, like the landline number for a residence. I am going to assert that, for the sake of this model, that situation would result in an incoherent idea for a contact, because it doesn't make sense to initiate a communication through one of these protocol endpoints without knowing who it is you are addressing."

What I'm talking about here isn't a solution specifically for the problem of defining contacts in some kind of "universal contact model", although it would work given the assertion I made about this particular model. Rather, what I'm talking about is a much more general idea about what it means to be a key at all. I'm using a particular contact model to express an example of the idea, and the time intervals problem to express another example of the idea.

Another way of saying that would be this: The set based key solution is not a solution to the shared phone number model, precisely because in the shared phone number model a contact endpoint is not a candidate key. So it's not so much that the solution doesn't work in this case, but rather that the whole approach doesn't apply in this case.

It is quite common to encounter the problem of needing multiple versions of a thing, which cannot overlap in time. For example in a data warehouse with a slowly changing dimension, the interval represented by a version cannot overlap with the interval representing any other version. Another example might be the case of an employment relationship, where one person can only be employed by a company once at any given instant in time.

The problem with this sort of rule is that it cannot be declaratively enforced. I can define the interval via its start date and end date, but I can put no kind of declarative constraint on the start date column, nor the end date column, to ensure that two intervals in the table do not overlap.

But there's another way of thinking about the start date/end date definition, and that is to think of it as a set of times which has been intentionally defined - ie, defined by a rule. Given these two dates, you can check whether any point in time is a member of "the set of all times between the specified dates". An easier way to think about might be to think about what would happen if we changed this to an extensionally defined set, ie, one in which the set is defined by listing every member. I could create a table containing an employment agreement and a date. Then I look at the start and end date of the employment agreement, and put a row in this table for every date >= the start and < the end date. Then we could say that the combination of those two columns is a key that ensures there are no overlaps.

Of course, the problem with that approach is that I have turned my intentional definition into an extensional definition, and the members aren't the same. I had to pick some level of granularity for the extensional definition. I happened to pick whole days, but what happens when someone says "actually, they can be the same day, but not the same hour", and then later "actually, they can be the same hour, just not the same second", and so on. But the purpose of the example is to demonstrate that the problem of "non overlapping intervals of time" is in fact the same kind of problem as the contact problem above: We would like to define a key as being composed of a set, which in this case is intentionally defined compared to the extensional definition in the case of contact protocol endpoints. In other words, if it were able to define set-based keys, defined either intentionally or extensionally, then the problem of creating a constraint which disallows overlapping intervals in time is the same as creating a constraint which disallows the intersection of two sets.

This is easily solved in a declarative system, assuming you can identify each individual "thing" whose multiple versions can't overlap in time. It's then a matter of defiing an appropriate counting function (that counts the number of pairs of overlapping periods for an individual thing) and a check constraint that requires the result of that function to be zero. Of course if you can't identify the individual "thing"s then you don't have any sort or usable database at all. Current MS documentation of SQL Server's CHECK Constraints gives a nice example of a sort of similar function and the problem that's caused if you try to require row counts to be non-zero (if you don't have any rows there are no rows with counts that are not non-zero, so the constraint gives UNKNOWN instead of FALSE and therefor doesn't work because CHECK constraints object ony to FALSE). It doesn't give an example where a count is required to be zero, because in that case an absence of rows won't cause a problem: if there are no relevant rows there are certainly none that overlap, and if there are relevant rows the check will deliver TRUE if there are no overlaps and FALSE otherwise. So the pronblem you claim to have for avoiding overlapping intervals is imaginary. And of course the omly granularity involved is the granularity of the datetime type you choose to use for the times, and inventing all the rows needed to cover all intervals would just blow your database up into a rdiculous (and pointlessly) large size compared to storing only those intervals that are actually relevant in the real world because something relevant actually changed at the ends of them.

assuming you can identify each individual "thing" whose multiple versions can't overlap in time [...].So the problem you claim to have for avoiding overlapping intervals is imaginary.

Your initial assumption begs the question. The whole point of my claim is that the interval is part of what identifies the thing.

You said: "It doesn't give an example where a count is required to be zero, because in that case an absence of rows won't cause a problem: if there are no relevant rows there are certainly none that overlap"

...which implies a misunderstanding of what I'm suggesting, because that situation cannot possibly occur: if it did then we would have an entity with a key value of the empty set. We cannot have an employment agreement unless the employment agreement spans some interval of time. My claim, very precisely, is that the interval of time, plus the person, functionally determines all other attributes of the employment agreement. Then, by the relational definition of a key, this set of {date, person} tuples is the natural key for the employment agreement.

If we knew that the granularity only needed to extend to the date level, then this extensional definition would be sufficient. But as you said - repeating what I originally said - this isn't always suitable, because some of these sets need to be intentionally defined. It would be infeasible to extensionally define "instants in time". But whether we define the set intentionally or extensionally doesn't make any difference to the concept, because the concept is still that of a set acting as a key.

assuming you can identify each individual "thing" whose multiple versions can't overlap in time [...].So the problem you claim to have for avoiding overlapping intervals is imaginary.

Your initial assumption begs the question. The whole point of my claim is that the interval is part of what identifies the thing.

You said: "It doesn't give an example where a count is required to be zero, because in that case an absence of rows won't cause a problem: if there are no relevant rows there are certainly none that overlap"

...which implies a misunderstanding of what I'm suggesting, because that situation cannot possibly occur: if it did then we would have an entity with a key value of the empty set. We cannot have an employment agreement unless the employment agreement spans some interval of time. My claim, very precisely, is that the interval of time, plus the person, functionally determines all other attributes of the employment agreement. Then, by the relational definition of a key, this set of {date, person} tuples is the natural key for the employment agreement.

If we knew that the granularity only needed to extend to the date level, then this extensional definition would be sufficient. But as you said - repeating what I originally said - this isn't always suitable, because some of these sets need to be intentionally defined. It would be infeasible to extensionally define "instants in time". But whether we define the set intentionally or extensionally doesn't make any difference to the concept, because the concept is still that of a set acting as a key.

Begs the question? There is some set of intervals which must not overlap. End of story - if I don't know which intervals are members of that set, I can't enforce non-overlapping whether I do it declaratively or somehow else, and neither can you. So either we have some way of determining hich intervals are in the set, and therefor can count the overlaps (if any), or you don't have any realisable model at all. End of story! I can do it declaratively if I can have the set of intervals, if I don't have the set of intervals you don't have any sort of database.

There are no relevant rows - surely to anyone it's utterly obvious that "relevant rows" means rows in the set generated from pairs of overlapping intervals within a set which is supposed not to overlap, not rows in the original set!

A set of Date, Person tuples is precisely the thing that at the front of your reply you said couldn't exist (because you appear to say that there's no such thing as a Person). And claiming it's the key to an employment agreement is nonsense, since a date is a point in time, not an interval (unless you have storage and processing time to burm, and are willing to have 7,305 dates insteaa of just two to cover a 20 year emploment for perhaps each of a few thousand employees. And even if it were an interval the set would not be an employment agreement, it would be multiple employment agreements. I had about a dozen different employments by the European Commision at different dates and in different roles, with separate employment agreements for each. Clearly it would be nonsense for those to overlap (I couldn't be in Brussels and be chairing a meeting in the south of France both at the same time, could I?); and all sorts of other things (like where payment went, whether what I received from them was tax exempt or not, whether they paid directly for my travels and living expenses or gave me a fixed allowance for them, whether I was required to work in English or in French, whether I would do half or more of the job from home or have to be in the appropriate office(s) for all of it) were different in some respect for each employment (legally different, with effects separate in EU law and in British law - you don't expect to ignore the law with your database, I hope). But they were all supposed to be full time employment on the days I worked on them, so they could not (in theory) overlap on dates - and that is something you will need to be able to handle if you want to have a general capacity to represent employments.

On the other hand, employment by two different employers can overlap on dates in various special circumstances (I've been there).

The problem with the set acting as a key is that it works provided you can look within the key, just as in a conventional relational system you can look at the set of named scalars (columns) that form the key. It doesn't work if you can't, just as the relational model would break down if you couldn't look at projections of the key instead of at the full key.

Let's be clear: I'm not saying that what your are aiming for can't be made to work. With enough care and peristence in beating out all (or even just enough of) the problems, it probably can - actually I think it's not joust probably but very probably. But working on non-solutions to non-problems (like the overlap of periods one) isn't what you want to spend time on. Back in 1970 I was an academic looking at some information retrieval problems and I put a lot of effort in and got exactly nowhere becasue I didn't really understand how the existing (pre-relational) models had tackled that problem (the things I did know about back then were error management, programming language design and computation for physics and chemistry problems, not databases and info retrieval, and the things I lectured on were Fortran programming and mathematical algorithms).

I don't know if you've read Ted Codd's early papers on the relational model, but if you haven't you should read them - not to undertsand the relational model, but to understand why Codd arrived at it, what he was trying to achieve, what effect it would have on the whole database idea as it was before then. Looking at someof the pre-relational stuff wold be a good idea too, and so would looking at some of the modern stuff. and then thinking about how what those people were/are trying to do relates to what your new model is aiming at, and how what your new model might be applicable to some of the problems that they were trying to solve (but I would recommend ignoring Chris Date and Fabian Pascal as far as possible). Of course for all I know you've already done all that reading and thinking, and if you have my comments are not much help.

You're mistaking the employment agreement for the person. The key for the person, whatever that might be, is required as part of the identification of the employment agreement because the candidate key for the employment agreement consists of both a set of moments in time and a person. Per my now-repeated claim, it is possible to still have a multi-column key. The employment agreement is an example of something which requires a multi-column key. The person part of the key groups the set of intervals in the intervals part of the key. But the employment agreement itself is identified by the combination of the two.

I'm not sure whether you're actually comprehending the suggestion. I've tried to be quite clear, and I think sentences like "this set of {date, person} tuples is the natural key for the employment agreement" cannot be further clarified, so I'm not sure what more I can do about that. Perhaps I can come up with a completely way of demonstrating the idea. But I must also suggest that more careful consideration of what I've written might also help. I will try to improve my illustration.

"since a date is a point in time, not an interval (unless you have storage and processing time to burm, and are willing to have 7,305 dates insteaa of just two to cover a 20 year emploment for perhaps each of a few thousand employee"

Tom, this seems deliberately obtuse. Two dates which represent the endpoints of an interval equivalently represent a set of moments in time. I am simply going to assume that you don't disagree with this, because if you do then I don't know what you think a set is, and the discussion is pointless. I can, in general, represent a set either extensionally or intensionally. In the case of moments in time an extensional definition is infeasible (indeed, it is impossible if we are referring to reality as opposed to a model on a computer, where it is simply infeasible). I am not suggesting that we actually should turn the "start date and end date" representation into an extensionally defined set. I am simply demonstrating that what is being described is a set, which can, in principle, be represented by an extensional definition in an obvious, though infeasible, way.

You seem to think I am suggesting that we must implement this idea by storing the infinite set of {moment-in-time, person} tuples (which we obviously agree is impossible), or at least the huge finite set of all measurable moments in time (down to the plank time) or at least the very large finite set of { moment-in-time-to-the-maximum-granularity-available-in-the-engine, person } tuples, which we agree is infeasible.. I don't know why you think that.

Let's put it as simply as this: If you agree that an employment agreement could be represented as { person_id, start_date, end_date, [...], primary key (person_id, start_date, end_date)), notwithstanding the objection that this key doesn't quite do what we need it to do, which you will recall is my objection, then you agree that the key for the employment agreement is the combination of the person and a set of moments in time. It's that simple.

"On the other hand, employment by two different employers can overlap on dates in various special circumstances"

Sure, I can imagine some model where we would need to allow that. In my example model I don't need to allow that. This seems to be a common theme... people take an example which uses a model designed to demonstrate a principle, and then claim an objection on the basis that some other models can also exist. Of course some other models can also exist. We could talk about a model of retail sales, or ornithology, or MMORPGs, where this sort of situation may never appear. But if this sort of situation doesn't appear then there'd be little point in my using it as an example model to demonstrate the situation. It's an argument which is formally the denial of the antecedent. I claim that a given model implies a certain idea. You then claim that other models exist, and that this is an argument against the idea. This is an argument of the form A -> B, !A, ∴!B.

Yes, I have read several of the works of Codd, Date and Darwen back as far as the Turing lecture, but not prior to that. I have read no published works by Fabian Pascal other than what commentary he has freely made available online.

I have been trying to think of a different example which might clarify the confusion, and I think I have one. The confusion in the "employment agreement" example seems to stem from the fact that we have a compound key, one of whose elements is a set. Here I will provide an example where the key is not a compound key.

I need to represent areas. Let's start with a trivial requirement, where each area is defined by a triangular polygon. In this model the areas cannot overlap. I also want to represent some other attributes of those areas. Perhaps we want to name the areas. We also want to know who owns each area. And we also want to characterize the area as, say "forest", "ocean", "desert", and so on.

Here is a table we might use to represent the area:

Areas (vertex1, vertex2, vertex3, name, owner, characterization)

Now, what is the key for the area? The best key we can use here is vertex1, vertex2, vertex3. We may have geospatial types available to us, I will get to that in a moment.

I claim that the natural key for an area can be represented by its bounding vertices, which is equivalent to saying that the area is defined by the set of all points on the surface bounded by the vertices. Indeed, I claim that this is the only sensible key for an area bounded by a polygon. The name can't be the key, it's just something descriptive I add about the area. The owner obviously can't be the key, the characterization obviously can't be the key.

Of course, if we simply create a primary key (vertex1, vertex2, vertex3) then we haven't quite met the requirement, because I could make a second triangle one of whose points is within the triangle defined by the three vertices of the first, and we would have an overlap.

But things get worse, because now I extend the requirements and say that the bounding polygon isn't necessarily a triangle. It is a polgyon with any number of vertices. Clearly I cannot implement this as:

Because not every polygon will have N vertices, which means all but the first three vertex columns must be nullable. But that is contrary to the requirements of a key. What I really want to say is that the key for the area is a set of vertices with cardinality greater than or equal to three. Then the key for the area is set.

So, let's imagine I have some kind of geospatial type available to me. This geospatial type has the following properties: It is defined by an arbitrary number of vertices greater than or equal to three, and it can be compared with another value of the same type to see whether they overlap. That solves our problem - we will use that as our key.

But that datatype is precisely an intensionally defined set. Given any point on the surface, I can tell you whether it is a member of the set defined by the value in our geospatial type. So the solution to finding a key in this example is precisely the solution that I have suggested: the key is a set, and the test for uniqueness is that the sets have no intersection.

...,,But things get worse, because now I extend the requirements and say that the bounding polygon isn't necessarily a triangle. It is a polgyon with any number of vertices. Clearly I cannot implement this as:

Because not every polygon will have N vertices, which means all but the first three vertex columns must be nullable. But that is contrary to the requirements of a key. What I really want to say is that the key for the area is a set of vertices with cardinality greater than or equal to three. Then the key for the area is set......

OK, here it's clear what you are getting at. But a set of vertices doesn't work, it has to be an ordered list (because with two dimensional caresian coordinates [(0,1), (0,0), (1,0), (1,1)] is not the same area as [(0.1),(1,0),(1,1),(0,0)] and neither is the same as [(0,1),(1,0), (0,0),(1,1)] although all three have the same set of vertices (and once you allow enough vertices the list has to be allowed to include repetitions to cover all cases, so that's two ways in which a set or vertices doesn't work).The intended set would iof course work if you had enough storage to express it (since physical separation is quantised, a finite area can be expressed as a finite number of points) but you aren't ever going to see that much storage. Of course simple cases like this could be considered as having an extra vertex (at (0.5.0.5)) in the non-convex case, but that still leaves you with two different figures with the same set of vertices if you ignore order. Restricting yourself to polygons without intersecting sidesdoesn't eliminate those problems, as there are many non-convex polygons without what might be called intersecting sides which have them. Restricting to polygons where there is never a meetin of three sides still doesn't eliminate the problem.

But yes, it will work with an ordered list (possibly including repetitions, depending on what restrictions you might want to impose) of vertices as a way of expressing the set of pounts within an area definable by its vertices (and their order). There is of course the interesting question of whether points on a boundary count as intersections (they had better if your areas are topologically closed sets of points, and then intersections on boundaries would not be forbidden, while if all sets were topologically open a shared boundary point would be an utterly bizarre concept that you wouldn't have to worry about.