Sunday, 29 November 2009

Overriding equals(Object): an optimally correct solution

Overriding Object's apparently innocuous equals(Object) method can be a source of difficulties. The first problem is that there appears to be no single implementation that meets all requirements. The second problem is the way that HashMap, the prime suspect for causing errors if the letter of the equals() specification is not met, uses equality to maintain its structure.

When equals() is overridden the initial requirement is to ensure that the Object in equals(Object) belongs to a class that is comparable with the current object. Two different implementations can be found in publications from authorities on the subject. They have some utility but both cause problems if implemented outside their limited applicable usage. Neither can provide a correct implementation of equals() for all inheritance requirements in a class hierarchy. A third published implementation can supply an equals() with correct behaviour in the two most commonly required cases but brings a considerable execution overhead when equals() is overridden in a subclass.

A simple equals() configuration is described that allows a class to have any of the three possible equality relationships with its super classes. For the two most commonly required relationships modifying the standard equals() implementation found in Java library classes to provide correct behaviour is trivial, simply a matter of factoring-off field comparisons to a separate method. Overriding equals() in a subclass is similarly trivial. If no modification to equality is required the new class can be derived without any additional coding related to equals() functionality. The approach brings almost no penalty for execution efficiency and additional coding is minimal.

Background

Overriding equals() allows a different view of object equality from Object's same-object implementation to be introduced. Used with hashCode(), equals() also provides the basis for object inclusion, exclusion and retrieval from a HashMap, a fast storage and retrieval structure appropriate for many applications.

Making an overridden equals() work according to plan brings two requirements: a well defined set of objects that the new method applies to and compliance with the so called equals() contract - in fact a definition of the required equality relationship between all objects. Its requirements are reflexivity, symmetry, transitivity, consistency and that a comparison with a null should always return false. These properties define an equality relationship found generally in most applications outside of quantum mechanics and we are on fairly safe ground.

An exceptionally lucid expose of the equals() contract and discussion of problems with published equals() implementations can be found at

The equals() problem occurs anywhere we want to derive a subclass that has a concrete superclass other than Object, for which equals() has already been overridden and add some fields that are significant when computing subclass equality.

Figure 1A moot choice

As shown in Figure 1 - ClassA, we would probably like to inherit from an existing class getting all its methods and functionality without writing any code, taking advantage of normal OOP provision for code reuse. The requirement to override equals() may prevent this course because the standard equals() implementation found in many Java library classes does not work in this situation.

To get a better understanding of the problem we can formalize this requirement of an object model that makes object equality other than basic identity a property of all objects as:

The first code reuse requirement:ability to derive a class that declares zero or more new significant fields and is not a member of its superclass comparison set.

It is apparent there are four possible requirements and this statement covers two of them. The remaining two are:

Second code reuse requirement:ability to derive a class that declares no new significant fields and is a member of its superclass comparison set.

Third code reuse requirement:ability to derive a class that declares new significant fields and is a member of its superclass comparison set.

We now need a definition for comparison set.

A comparison-set model

A comparison set is a set of classes having a root class for which instances of the root and some but possibly not all of its subclasses are compared for equality using significant fields declared or present in the root. For completeness we must also say that significant fields declared in a subclass may also figure in a comparison but this is an unusual condition satisfying the third code reuse requirement. When an object is compared with an object of a class not in its comparison set equals() is unable to make a meaningful comparison and must always return false.

Figure 2 sketches a class hierarchy with classes in a comparison set coloured green. Other classes are not in the green set and supply a potential root for other comparison sets.

Figure 2A comparison set

A class that is a descendant of Object and does not override equals() is in a comparison set with Object as its root. The particular contract of Object's equals() - no two objects are equal, makes the standard implementation viable for overriding equals() in a direct subclass, A in Figure 2, but not in a sub classes B & C of a class in a derived comparison set. The standard implementation meets the first code reuse requirement in a class with no superclasses that have used it to override equals(), elsewhere it fails.

The Figure illustrates two important properties of comparison sets. Firstly, all classes in the set are descendants of a single root. The equals() implemented in the root must include a condition that determines whether the argument in equals(Object) is a member of the set of comparable objects for which a field comparison should be done or whether it is of another kind for which equals() should return false.

The 'standard' equals() implementation

The standard implementation employs instanceof the-root to exclude objects not in the comparison set. The instanceof statement can appear in a number of different guises, as illustrated using a class that declares a value field with an overloaded == operator:

All configurations exclude instances of any class that is not a descendant of Root from field comparison and return false. The same-object check shown in the third configuration is of course optional and can be used in any implementation.

Which brings us to the second property of comparison sets and the problem with the standard equals() implementation: not all classes that are descendants of the Root are members of the same comparison set. The instanceof condition succeeds for instances of all subclasses irrespective of new fields. A superclass object can be found equal to a subclass object if fields declared in the superclass are equal to inherited fields in the subclass. Presenting the same superclass object to an overridden equals() in the subclass its instanceof will reject the superclass object and return false.

The symmetry requirement of object equality is broken and in the first case no proper equality assessment has been done making the result a nonsense.

There are of course occasions when we want to derive a class from a superclass without overriding equals(): the second code reuse requirement. A subclass might introduce a next field for example, a reference to an instance of the subclass so that a linked list of objects can be constructed. The standard implementation meets the second requirement. It fails when presented with a subclass instance that uses new fields to check equality and is not in the comparison set. It does not enable us to implement ClassA in Figure 1. Can an additional or modified condition be introduced to exclude instances of these classes from field comparison?

equals() using class comparison

Another commonly seen equals() implementation compares the runtime Class reference of the two objects to determine comparison set membership:

obj != null && obj.getClass == this.getClass

replaces obj instanceof Root in the condition. It could be used to derive Figure 1 - ClassA if used in the existing and new classes but does not allow any further subclasses that belong to one or other comparison set to be derived. All objects that are not of the same class are excluded and equals() returns false for these objects.

This means that superclass objects are excluded from subclass superclass comparisons and get a false return. If anything this characteristic is more dangerous in use than the instanceof problem. When equals() is not overridden in a subclass its functionality is modified by proxy to get a constant false return for superclass objects. A person who derives the subclass and has not modified equals() may be left unaware of this side effect until the application fails at a later time. A class comparison equals() implementation meets the first code reuse requirement. It cannot meet the second.

The third code reuse requirement

Not surprisingly, overriding equals() from a concrete superclass is seldom seen in Java code but there is one pertinent and frequently quoted example: java.sql.Timestamp. This class is a java.util.Date subclass that overrides equals to include a nanos field. Comparing a Date with a Timestamp returns true if the dates match irrespective of the value of nanos. Comparing a Timestamp with a Date always returns false even with matching dates: 'because the nanos component of a date is unknown.'

If equals() was overridden in both classes using the implementation proposed here there would be an open choice as to whether to make both comparisons return false or to allow a Timestamp to be equal to a Date if and only if its nanos field was zero (or any other single value). The methodology also allows a Timestamp to have symmetrical equality with a Date if dates match or if a Timestamp is within 500,000 nanos of a Date but these options break the equals() specification requirement for transitivity while the zero nanos option does not.

Here the valid option where a Timestamp with a zero nanos field may be equal to a Date supplies an example of a third possible equality relationship between a sub and superclass and the third if seldom needed code reuse requirement.

Alternatives to the standard implementation

An alternative to producing a flawed equals() in a subclass using the standard or class comparison implementations is to use composition, as illustrated by Figure 1 - ClassB. Using composition to implement structural elements of a program or even a data type where it fits well is one thing. Being forced to use it to implement equals() is something else. Do this and we can be faced with another set of no-win choices: write a lot of methods that do nothing except call composed object methods or include a method that returns the composed object - a possible solution if the object is immutable, otherwise doing this is likely to cause more problems that it solves. There may still be a problem even when the returned object is a clone - the clone does not reflect legitimate modifications to the original.

The standard equals() implementation used in library classes leaves no satisfactory alternative to composition when equals() is overridden but otherwise it is not a desirable alternative.

Taking on board the fact that neither of the commonly used implementations work satisfactorily an artima article posts a working solution for overriding equals() to get two of the code reuse requirements identified above:

The third code reuse requirement is seldom needed and is therefore relatively unimportant but a problem with this implementation is that it uses a canEqual(Object) method that introduces an additional instanceof check and execution overhead for all calls to equals() even when comparing same-class objects. Consequently it cannot be seen as an optimal replacement for the standard implementation that can be put in place as a matter of course to allow subclasses to be derived at a later time.

Another downside is that the basic cost of a call doubles for an overridden equals in a subclass: super equals() is called to get superclass field comparison involving a second call to canEqual() and duplicate instanceof tests. The primary indication for using hashed storage is to provide fast access making this implementation not ideal for specific use even when the requirement to override equals() is known in advance.

The artima article shows a way forward and that a solution is possible but additional requirements to be addressed here are to devise an equals() implementation with a minimal code footprint and execution overhead that can be used as a matter of course in any class that overrides equals(). The implementation must allow fully working subclasses to be derived in all three cases where they may be required to meet OOP standards for code reuse.

We have now identified the requirements that an optimal equals() implementation must meet and can assess how the proposed alternative stands up.

Here we consider a base class that is a concrete subclass of Object. Using an abstract base has only minor implications and then only if it declares significant fields. The simplest possible class will supply examples: one with a single integer field is used. The number of fields or even a complete audit of the current state of a data structure as used by HashMap has no bearing on the problem. This class is called GreenBase in the diagram and is the root of a green comparison set.

With the exception of ZGreen, GreenBase descendants in the green comparison set do not override equals(). They introduce no new fields that figure in comparison and are comparable using the equals() inherited from GreenBase. They can of course introduce other fields, the one on the left could be a LinkedGreen with a field referencing the next object in a list.

Implementing these classes requires that the equals() implementation satisfies the second requirement for code reuse: ability to derive a class that declares no new significant fields and is a member of its superclass comparison set. From the Figure we can see that there is a secondary requirement - that the condition used to determine comparison set membership should include all possible descendants of the base on all branches.

The standard implementation satisfies both these requirements using instanceof the-base but now we come to the apparently conflicting code reuse requirement: ability to derive a class that declares zero or more new significant fields and is not a member of its superclass comparison set. The equals() in GreenBase inherited by green subclasses can do an initial assessment for comparison set membership using instanceof or some similar isA condition but if it is going to make provision for derived comparison sets it must exclude instances of RedBase, YellowBase and BlueBase that supply the base class for new sets. All these classes are also GreenBase subclasses and instances will pass the initial assessment.

This problem raised the question as to what additional isA or class comparison check could be put in equals() to exclude instances on the same or a different branch from the root. After some thought and deciphering of the contracts of various Class methods it was decided that there wasn't one. We could perhaps give all comparable objects an additional property, say Class comparisonBase() or a similar final field, but doing this adds additional coding and additional logic in equals().

Otherwise only the received object can do the necessary checks but it was also found that when the decision on membership was passed to that object checking ancestry was no longer a requirement. The same is true when the received object's class has the third equality relationship, ZGreen in Figure 3, but as discussed below an additional isA check is then required.

The following sections assess how this implementation fairs with regard to the criteria for minimum code size and effect on execution efficiency over the standard implementation.

Minimizing the code footprint

The requirement here was to develop an implementation that was altogether trivial to code and required a minimum of extra coding over the standard equals(). GreenBase supplies a standard where differences are not overwhelmed by other code when the two implementations are shown side by side:

Excluding braces from the calculation the difference amounts to the two extra lines of code used by the fEquals(GreenBase) method. The field comparison method can have any name even equals(GreenBase) but it is suggested that standardizing the name to fEquals will facilitate use of this implementation and that fEquals should be overloaded when equals() is overridden in descendants to maintain consistency in use.

The extra coding is nothing compared with the complication of using composition to get a subclass equivalent and the chore of writing many do-nothing methods in a real life situation.

Minimizing the overhead

In GreenBase the equals() implementation introduces the smallest possible overhead above the standard, only the cost of calling fEquals(). The initial task is to exclude equals(Object) arguments that are instances of any class not a GreenBase or a subclass. The instanceof operator includes a check that the argument is not null and provides an economical solution:

Having determined that the equals(Object) argument is a GreenBase or a subclass and therefore a type that has an fEquals(GreenBase) method the next step is to call the argument's version passing a reference to the current object. This call is the only addition to the overhead for an implementation with field comparison done inline.

The decision on equality is passed to the received object. It is already known that the object is a GreenBase or a subclass. If the object's class is a member of the same comparison set it will have the same version of fEquals(GreenBase), which does the field comparison and returns the result, otherwise an overridden version simply returns false.

Deriving a new comparison set

RedBase introduces a new significant field, is not comparable with a GreenBase or instances of any green subclass and forms the root of the red comparison set:

RedBase overrides equals() with a new instanceof condition excluding instances of GreenBase and its other subclasses and calls a new fEquals(RedBase) overloading of the field comparison method on the received object. fEquals(GreenBase) is overridden to return false.

When a RedBaseequals() receives a GreenBase or an instance of any of its other subclasses, including those that may have override equals(), the instanceof RedBase condition fails and false is returned. When the equals() in a GreenBase or another subclass is passed a RedBase either its overriden instanceof fails − it is a member of another derived comaprison set, or it is a member of the GreenBase comparison set and calls the object's overridden version of fEquals(GreenBase). Either case gets a symmetric false result for the two comparisons − exactly what is required.

Implementing classes that form the root of the other two comparison sets shown in Figure 3 follows the same pattern. BlueBase for example, inherits RedBase's constant false fEquals(GreenBase). It will override fEquals(RedBase) to return a constant false and implement a new fEquals overloading to deal with objects in the blue comparison set.

The logic is completely straightforward, but now we move on to implementing the unusual equality relationship where a class declares a new field that is used for comparison with instances of the same class and its subclasses but comparison with a superclass and other subclasses in the comparison set is permitted.

Equality of the third kind

ZGreen introduces a z field that is significant for equality. Comparison with instances of GreenBase and all other classes in the green comparison set is allowed.

The only circumstance where doing this makes any sense is if GreenBase is seen as supplying some kind of default z. If z is a string the default might be "" or null. If it is an integer it could be zero but the default can be any value from the range, yes even 42 is a possible candidate for an integer default. No approximate or wildcard value is permitted for the default or transitivity will no longer apply.

A default of zero seems to make a little sense so ZGreen is implemented with an integer z field and all other objects in the green comparison set are seen as having z == 0:

In GreenBase the first instanceof condition admits objects that are GreenBase or a subclass and false is returned for any other object. Here a second grouping of the comparison set is made to include instances of ZGreen and its subclasses using a second instanceof comparison. Other objects that may or may not belong to the green comparison set are treated separately.

ZGreen arguments are dealt with by calling the object's fEquals(ZGreen). For other types, a check for the default z value is made and if this succeeds the object's fEquals(GreenBase) is called. The local fEquals(GreenBase) is overridden to deal with direct calls from other objects in the comparison set when their equals() receives a ZGreen argument. Other GreenBase subclass objects will not be calling fEquals(GreenBase) because a ZGreen will fail their initial instanceof condition.

hashCode() is not overridden - the inherited version returns a value based on the inherited value field so that GreenBase and ZGreen objects with a matching value both have the same hashcode.

Implementing an isA comparison

In ZGreen the second instanceof comparison includes a redundant check that the Object argument is not null. It can be eliminated by substituting an isA comparison using the Class method isInstance(Object).

Interpreting the meaning of Class' isInstance() method is quite brain numbing. One trick is to correct the English in your head to what it actually means - 'obj.getClass().hasInstance(this)' for example. Another trick is to read the statement right to left: this isInstance obj.getClass(). Failing these you can put your own isA() method in the base class - the only con is execution efficiency:

The equals() implementation shown here produces a correct comparison that complies with the equals() meta-contract for all of the three possible equality relationships that a subclass can have with a superclass. Analysis using the comparison-set model demonstrates that there are only three such relationships that equals() needs to support.

It is suggested that any additional overhead from calling a factored-off field comparison method is insignificant in most cases and that the ability to produce subclasses with a fully functional overridden equals(), allowing OOP code reuse without employing composition indicates that this implementation should be used as a matter of course when coding any class that overrides equals().

There are other related topics that may be addressed in a later posting: using an abstract superclass, the HashMap put(K, V) gotcha and implications of the Liskov Substitution Principle. Currently all that needs to be said on the LSP is that the second Liskov and Wing paper excludes relationships between objects from general discussion but groups this aspect with other 'safety properties (nothing bad happens)'. Certainly the LSP does not appear to address anything beyond Object's view of equality. It is down to us to make sure 'nothing bad happens' when equals() is overridden but it will certainly help if the implementation used works in the first place.

3 comments:

First, congratulations on a very interesting article. I particularly like your breakdown of the business of code reuse into the 3 scenarios you describe; however, I have two objections to your solution (which, incidentally, I partially implement in most of my objects; viz: an 'equalTo()' method that takes an object of the same type).1. The simplicity of the 'canEqual' solution is sacrificed on the basis of efficiency (one less 'instanceof') which smacks to me of premature optimization.2. Your solution presupposes a knowledge of the heirarchy; the 'canEqual' solution does not. It can be implemented by anyone wishing to extend another class.That said, I would certainly consider it as a possibility if I was ever faced with scenario 3.

Thanks for your comments Winston - the first real feedback I have had. Glad you liked the article, bit long I thought. Most of it was written while trying to identify the problems - not quite in keeping with the altogether trivial solution maybe.

Seems to me my solution is about as simple as it can get: all that needs to be done to allow for any of the three scenarios in subclasses is to factor off field comparisons to a separate method - like your equalTo(). Additional overhead is minimal, less than canEqual() even in the base class and coding follows what might be done anyway.

On premature optimization, isn't fast access what using a hashed structure is all about? Otherwise a tree side-steps equality problems and provides optimal in-order retrieval if required. Unless optimization reduces clarity might as well code it that way straight off, don't you think?

Believe I see what you mean about knowledge of the hierarchy. If a subclass that does not override equalTo() is the immediate super then knowledge of the superclass where the last equalTo() is located is required. Not a great drawback perhaps - using @Override the compiler soon lets you know if you get it wrong.

Blog Archive

Followers

About Me

Along with many other things, I have a long standing interest in computing, particularly computer languages, AI and some unusual kinds of statistical analysis and have written many applications in the Delphi programming language some of which I managed to get paid for writing. What one might call commercial development I suppose.

My great computing heroes for brilliant, simple and eminently useful ideas include Alan Turin – unorganized machines, Tony Hoare – quicksort, Edsger Dijkstra – solution of a problem in concurrent programming (several in fact) and David Parnas – information hiding. Of course, the contribution of these guys extends far beyond the specific examples given here. It is just that from a personal perspective these few ideas have had a particular impact.

Something I would like to do is write a study demonstrating how a few similarly simple ideas could be used in a programming language to ensure information hiding and ease problems with writing and maintaining ever more complex applications. Whether this happens remains to be seen. Currently this blog is about Java and for posting some stuff for Darrell Ince's Processing book.