Friday, January 4, 2013

This Week's Coding Practice: Linked Lists

Linked lists are elementary data structures that any programmer worth their salt should know back to front. In contrast to the contiguous array, linked lists naturally grow or contract as the data stored in them increases or decreases, respectively. Furthermore, the memory occupied by the data does not need to be allocated contiguously. However, this flexibility comes at a price. Accessing random elements in constant time is no longer possible -- with a linked list, accessing the nth element takes O(n) time.

There are many ways to skin a cat, and there are also many ways to implement linked lists. Some important decisions to make along the way include determining the degree of abstraction; picking between singly and doubly-linked lists; and picking between sentinel nodes and sentinel values. I'll discuss each decision briefly below.

At a high-level, a linked list can be defined recursively: a linked list is a node followed by a linked list. You have at most two entities to worry about -- the Node, which contains the data you want to store and a pointer to the rest of the linked list, and the List itself. Ideally, you want to model the list as a "black box" and abstract away the fact that it consists of a chain of Nodes. This requires defining and implementing the List and Node explicitly as separate data types (as separate structs, for example). Frameworks, like the C++ STL's vector or Python's list(), do this for you. Home-brew implementations of the linked list often avoid defining an explicit data type for the list itself, and model the list implicitly as a chain of nodes (see how I implemented Bins in a hash table a few weeks ago). This saves the developer the effort of having to define separate List and Node data types, at the expense of reduced abstraction. If the particular linked list implementation is only used internally (as opposed to being part of a framework), then the loss of abstraction is not a big deal.

List linking comes in two flavors: single and double. A node of singly-linked lists contains a pointer to the next node only. A node of a doubly-linked list contains pointers to both the previous and the next nodes. Singly-linked lists are the simplest and most flexible -- so flexible, in fact, that they can be "abused" by chaining lists together in a tree-like structure. Again, this flexibility comes at a price. Updating (inserting and removing elements) singly-linked lists can be a pain in the ass, since such operations require access to the node that is before the update location. Since singly-linked lists can only be traversed in one direction (from the head to the tail), locating this previous node requires a full traversal from the head of the list -- which can cost O(n). Doubly-linked lists allow the previous node to be determined directly, at the expense of increased storage overhead, due to the extra pointer in each node, and reduced flexibility -- abusing the list as a tree is no longer possible.

All good things come to an end, and linked lists are no exception (OK, with the exception of circular lists...). Primarily, there are two ways of indicating the end of a list: sentinel values and sentinel nodes. Sentinel values are special values like NULL that indicate that the previous or next node doesn't exist. It's a simple idea, but requires additional conditional code to handle search and retrieval operations near the head and tail of the list. Sentinel nodes simplify the code by removing such conditionals, at the expense of adding nodes that do not hold any real data.

How one goes about making the above-mentioned three decisions depends on the application -- what will you be using the linked list for? This week, the tasks were to implement a simple singly-linked list and:

Write a function to remove a list's 3rd from last element. Challenge: can you do it in a single list traversal?

Write a function to remove all duplicates from a linked list. Challenge: can you do it without storing any extra data?

Write a function to detect a loop in a linked list.

For the abstraction level, I decided to roll with abstracting away the Node and exposing only the top-level List. This looks a bit awkward at first, since all the List is is just a pointer to the first Node. However, it does simplify the representation of border cases, such as empty lists. I also decided not to use sentinels this time, as they would be more trouble than they are worth in some cases. Specifically, removing duplicates from a list without storing any extra data requires pre-sorting the list with an algorithm such as merge-sort. If we're using sentinels, then the divide step of merge-sort would require us to create additional sentinel nodes, and the merge step would require us to delete those additional sentinel nodes.