Java Ordered Collections: Trees and Skip Lists

by Mark Grand

Explore Java collections framework classes that implement the SortedSet interface and how they keep a set of objects in a guaranteed order, independent of the order that they are added to the collection.

The Java collections framework includes classes you use to maintain collections of other objects. These collection classes implement interfaces that support different organizations of the objects they contain. For example, classes that implement the List interface keep objects in the order that they are added to the collection and can take a long time to search (proportionate to the number of objects in the collection). Classes that implement the Map interface keep objects in no particular order but are very fast to search (search time is independent of the number of objects in the collection). Classes that implement the SortedSet interface keep a set of objects in a guaranteed order, independent of the order they are added to the collection; this makes them fast to search.

SortedSet

Collection classes that implement the SortedSet interface impose a total ordering on the objects that they contain. There are two kinds of orderings that can be used with a SortedSet.

Instances of SortedSet classes can use the natural ordering of objects in the collection if the objects in the collection implement the Comparable interface. This means that the order of the objects is determined by the objects themselves.

A SortedSet collection imposes a natural ordering on the objects it contains by calling the compareTo method that is part of the SortedSet interface. An object's compareTo method takes one argument that is the other object it compares the object to. The compareTo method returns a positive integer, 0, or a negative integer depending on whether the object is greater than, equal to, or less than the other object.

A SortedSet object can use the natural ordering of objects it contains only if

all of the objects implement the a SortedSet interface

all of the objects can be passed to each other's compareTo method

and the natural order is the order that is wanted.

If any of these things are false, natural ordering cannot be used.

The alternative to using natural ordering is to use an object that implements the Comparator interface. The Comparator interface includes a compare method that takes two objects as its arguments. The compare method returns a positive integer, 0, or a negative integer depending on whether the first object is greater than, equal to, or less than the second object. A Comparator object can be used to impose any order on the contents of a SortedSet, so long as all possible comparisons produce consistent results.

The SortedSet interface does not specify what determines whether a SortedSet object uses the natural order of its contents or uses a Comparator object to impose an ordering. Implementations of SortedSet commonly use different constructors to determine this issue. Constructors that take a Comparator argument create a SortedSet object that uses the Comparator object. Constructors that do not take a Comparator argument create a SortedSet object that uses the natural ordering of its contents.

Java 1.6 comes with two classes that implement the SortedSet interface. These are java.util.TreeSet and java.util.concurrent.ConcurrentSkipListSet. Choosing to use one of these classes has implications for performance and concurrency. To understand these implications, you need to have some understanding of the data structures these classes use to organize a set of objects.

TreeSet

The TreeSet class uses a binary tree to organize a set of objects. Figure 1 shows an example of this data structure that contains Integer objects with the values 1 to 9.

Figure 1: Binary Tree

A TreeSet creates objects, called nodes, for its own internal use. A TreeSet object uses nodes to organize objects added to the TreeSet into a binary tree as shown in Figure 1. A TreeSet object creates a node for each object added to the TreeSet. Each node references the object it is responsible for organizing. A node also refers to zero, one, or two other nodes.

There are a few different roles that a node can play in the binary tree:

If a node refers to other nodes, it is called the parent of the nodes it refers to.

The other nodes a parent node refers to are called its children.

Only one node in a binary tree has no parent. This node is called the root of the tree. In Figure 1, the root is drawn at the top of the diagram.

Nodes in a binary tree that have no children are called leaves. In Figure 1, the leaf nodes are drawn towards the bottom of the diagram.

Nodes that are not the root or a leaf node are called interior nodes.

Searching a binary tree is fast (it takes time proportionate to log(n), where n is the number of objects in the tree) because of two rules that the TreeSet enforces on the organization of the binary tree.

The first rule is about the relationship between a parent node and its child nodes. One of a parent node's references is either null or refers to a child node with a lesser value than the parent. The other node reference is either null or refers to a child node with a greater value that the parent. In Figure 1, the child on the left always has a lesser value than its parent; the child on the right always has a greater value than its parent.

Using this first rule, the tree shown in Figure 1 could have been created by adding values to an empty TreeSet in this order: 5, 8, 6, 2, 4, 1, 9, 3, 7. Figure 2 shows the organization of the binary tree after each value is added.

Using just the one rule worked well when the values are added (or removed from) the tree in an arbitrary order. To find something in this tree requires at most four comparisons. For example, to find 3 in this tree, the TreeSet would look at the four nodes shown in red in Figure 3.

Figure 3: Nodes looked At When Searching for 3

The tree produced using just the one rule is not as fast to search when values are added to the tree in sorted order. For example, adding the same values to the TreeSet in the order 1, 2, 3, 4, 5, 6, 7, 8, 9 produces a binary tree that is internally organized, as shown in Figure 4.

Figure 4: Adding Values in Ascending Order

The binary tree shown in Figure 4 does not branch. To find the node for 9 requires searching very other node in the binary tree. A tree like this is not very fast to search. This brings me to the second rule that the TreeSet uses for its binary tree.

The second rule is that the tree must be balanced. Balanced means that the tree branches and the leaf nodes are about the same distance from the root of the tree. By distance, I mean the number of nodes between the root and the leaf. To keep its binary tree balanced, a TreeSet object may need to adjust the organization of its binary tree after each add or remove operation.

If the tree is perfectly balanced, the leaf node farthest from the root node is no more than one node farther from the root node than the nearest leaf node. The binary tree in Figure 1 is perfectly balanced.

On average, it takes less time to search for objects in a perfectly balanced binary tree than to search in a tree that is less well balanced. However, it turns out algorithms for keeping a tree perfectly balanced can require a lot of effort in some cases. Algorithms that keep a binary tree partially balanced within a guaranteed limit require a lot less effort.

The effort to keep a tree perfectly balanced may be justified if it is expected that many more search operations than adds or removes will be made. However, because TreeSet is a general purpose class, it uses an algorithm called red black tree. Red black tree guarantees that the distance from the root to the most distant leaf is no more than twice the distance from the root to the nearest leaf.

Earlier in this article, I mentioned that the choice to use a TreeSet has concurrency implications. None of its methods are synchronized. If two threads attempt to modify a TreeSet at the same time, it will break. To avoid this problem, it is up to the caller to arrange for calls to a TreeSet to be single threaded. If a TreeSet object's methods will have multiple callers, the solution is to have them call the TreeSet object's methods through a wrapper object whose methods are synchronized. This is commonly arranged code similar to this:

SortedSet set s = Collections.synchronizedSortedSet(new TreeSet());

The challenge this presents for concurrency is that access to the entire TreeSet object is forced to be single threaded. If there are more than a few threads trying to add objects to or remove object from the same TreeSet object, they will all need to wait for their turn. For situations like this, the better choice is ConcurrentSkipListSet.

ConcurrentSkipListSet

java.util.concurrent.ConcurrentSkipListSet is a thread-safe implementation of the SortedSet interface. Multiple threads can add and remove objects at the same time and work correctly. However, individual search, add, and remove operations take longer for a ConcurrentSkipListSet than for a TreeSet, so you usually should not use ConcurrentSkipListSet unless concurrency is an issue.

One other caution about the ConcurrentSkipListSet is that the size() method can take a long time. To avoid needing locks, the ConcurrentSkipListSet does not keep a running count of the number of objects that are currently in the set. For this reason, the size method works by iterating over the objects in the ConcurrentSkipListSet and counts them.

You now have the information you need to decide when it is appropriate to use a TreeSet or a ConcurrentSkipListSet. However, when using a ConcurrentSkipListSet, it may be helpful to understand the underlying data structure.

The ConcurrentSkipListSet is based on a data structure called a skip list. Unlike a binary tree, the internal organization of a skip list does not need to be adjusted because of the order that data is added or removed. The internal organization of a skip list is based on random numbers and is not dependent on the actual data in the skip list.

Figure 5 shows an example of a skip list.

Figure 5: Skip List

The first node in the skip list is called the header. It refers to other nodes or an object called NIL. NIL is at the end of every skip list. Both the header and NIL are in every skip list, even if the skip list is empty.

The header is the first node in multiple linked lists. These linked lists are called level 1, level 2, … In Figure 5, the level 1 list is shown towards the bottom of the diagram and the level 4 list is shown towards the top of the list.

Nodes that appear in the linked lists between the header and NIL have associated value objects. The order in which nodes appear in the linked lists is based on their value objects and determined by the natural ordering of the value objects or by a Comparator.

All of the nodes in a skip list are in the level 1 linked list. Roughly ½ of the nodes in a skip list are in the level 1 and 2 linked lists. About ½ of those nodes (about ¼ of all the nodes) are also in the level 3 linked list. If there are higher levels of linked lists, each linked list includes about ½ of the nodes in the linked list in the next lower level.

The algorithm for searching a skip list starts at the highest level linked list, looking for a node whose value is greater than or equal to the value being searched for. If the value of a node is equal to what is being searched for, the search is done. Otherwise, the search backs up a node and continues one level of linked list down until either the search is successful or there are no more nodes to search. To clarify the search process, here is an example:

Suppose that you are searching the skip list shown in Figure 6 for an object with the value 5. There are four levels in the skip list, so you start with the level 4 linked list. The first node on the level 4 linked list (after the header) has the value 2. Because 2 is less than 5, you look at the next node on the level 4 linked list.

The next node on the level 4 linked list is NIL. Because NIL is the end of the linked list, the search continues on the level 3 linked list.

The next node on the level 3 linked list after the node with the value 2 has the value 9. Because 9 is greater than 5, the search continues on the level 2 linked list.

The next node on the level 2 linked list after the node with the value 2 has the value 4. Because 4 is less than 5, the search continues with the next node on the level 2 linked list.

The next node on the level 2 linked list after the node with the value 4 has the value 6. Because 6 is greater than 5, the search continues on the level 1 linked list.

The next node on the level 1 linked list after the node with the value 4 has the value 5. Because that is equal to the value we are looking for, the search is done.

Figure 6 shows, in red, the links in the skip list that were traversed in the search for 5.

Figure 6: Skip List Search for 5

When a node is added to a skip list, a random number generator is used to determine what levels of linked list the node will be included in. There is a ½ chance that a new node will be included in the level 2 linked list. There is a ¼ chance that the new node will be included in the level 3 linked list. For each higher level of linked list, the chance of a new node being included in the linked list is ½ of the chance of its being included in the next lower level linked list.

SortedMap

The SortedMap interface is similar to the SortedSet interface in that it also involves a set of objects that are sorted by their value. Like SortedSet, Java 1.6 includes two implementations of the SortedMap interface. These are TreeMap and ConcurrentSkipListMap. The preceding discussion of TreeSet and ConcurrentSkipListSet also applies to TreeMap and ConcurrentSkipListMap

Summary

This article has explained when it is best to use a TreeSet or ConcurrentSkipListSet to manage an ordered set of objects within the SortedSet interface. It has also explained the data structures that these classes use to organize their contents.

About the Author

Mark Grand is a consultant and book author with over 30 years of experience who specializes in Distributed Systems, Object Oriented Design and Java. He was the architect of the first commercial business-to-business e-commerce product to use the Internet.

Mark Grand is most widely known for his best-selling design pattern books. Mark has taught for U.C. Berkeley, Sun, and other organizations.

Mark is based in the Atlanta area. He has been involved with object-oriented programming and design since 1982.