Eliminating some duplicates with the RecentSet
class

Abstract

In some situations, it may make sense to try to significantly reduce
the number of duplicates in a collection, while ensuring that memory usage
does not increase without bound. To help with this, I created a Python
utility called a RecentSet, which serves as a sort
of first-pass filter for significantly reducing the number of duplicates
in a collection.

Sets, both mathematically and in practice as data structures, are useful
for identifying and removing duplicate items from any collection. Python makes
this particularly easy for the developer, but as you might expect, its
implementation utilizes only memory for storage, so a large enough number of
unique items will cause the program to terminate for lack of memory. Somewhat
smaller numbers will simply make the program, and likely the system, run very
slowly, particularly if it is forced into frequent swapping.
Neither situation is particularly pleasant.

When I ran into this problem, I looked at using a database system to
eliminate the duplicates, since databases use disk-based data structures, and
disks are generally far larger than available memory. This works, but it can
still be expensive, because we must feed the entire collection of items to the
database system and allow it to eliminate the duplicates. If the entire
collection of items contains a large number of duplicates, then the size of
the total collection of items will be significantly larger than that of the
corresponding set of items. If we can somehow first reduce the size of the
total collection by eliminating some of the duplicates, then we can
significantly improve performance.

To do this, I implemented something that I called a
RecentSet, which provides a mechanism for approximately
determining whether an item is already a member of some abstract set. It works
by keeping track of a fixed number of items in the order in which they were
last checked for membership. This means that a
RecentSet works best on collections that have clusters
of identical items. In fact, the membership test will be completely accurate
if and only if the distance between identical members within the collection is
always less than or equal to the size of the RecentSet
(and the accuracy generally diminishes as this distance increases). Thus, a
RecentSet serves as a sort of first-pass filter for
significantly reducing the number of duplicates in a collection.

I needed to be able to efficiently order items and move items from their
current position in the order to the head of the order. These operations are
best suited to a doubly-linked list, but it doesn't seem that an
implementation of this data structure exists in the core Python library, so I
implemented my own (or, more precisely, a class for nodes that combine to form
a doubly-linked list, called DllNode). I've provided
this implementation below:

And the following is the implementation of
RecentSet, which depends on the previous doubly-linked
list implementation. It's currently quite simple. The name is perhaps
misleading, because it does not actually represent a set, but rather a sort of
approximate oracle about whether new items were present previously in an
abstract set.

Upon request, I would be happy to provide some examples of how you might
utilize a RecentSet. Until then, that will just have to
remain an exercise for the reader. I would also be happy to flesh these
components out further, to make them more generally useful, if anyone is
interested in using them.