Languages that are purely functional or near-purely functional benefit from persistent data structures because they are immutable and fit well with the stateless style of functional programming.

But from time to time we see libraries of persistent data structures for (state-based, OOP) languages like Java. A claim often heard in favor of persistent data structures is that because they are immutable, they are thread-safe.

However, the reason that persistent data structures are thread-safe is that if one thread were to "add" an element to a persistent collection, the operation returns a new collection like the original but with the element added. Other threads therefore see the original collection. The two collections share a lot of internal state, of course -- that's why these persistent structures are efficient.

But since different threads see different states of data, it would seem that persistent data structures are not in themselves sufficient to handle scenarios where one thread makes a change that is visible to other threads. For this, it seems we must use devices such as atoms, references, software transactional memory, or even classic locks and synchronization mechanisms.

Why then, is the immutability of PDSs touted as something beneficial for "thread safety"? Are there any real examples where PDSs help in synchronization, or solving concurrency problems? Or are PDSs simply a way to provide a stateless interface to an object in support of a functional programming style?

You keep saying "persistent". Do you really mean "persistent" as in "able to survive a restart of the program", or just "immutable" as in "never changes after its creation"?
–
Kilian FothJun 29 '13 at 9:37

10

@KilianFoth Persistent data structures have a well-established definition: "a persistent data structure is a data structure that always preserves the previous version of itself when it is modified". So it's about re-using the previous structure when a new structure based on it is created rather than persistency as in "able to survive the restart of a program".
–
Michał KosmulskiJun 29 '13 at 10:02

3

Your question appears to be less about use of persistent data structures in non-functional languages and more about which parts of concurrency and parallelism aren't solved by them, regardless of paradigm.
–
delnanJun 29 '13 at 10:30

2 Answers
2

Persistent/immutable data structures don't solve concurrency problems on their own, but they make solving them much easier.

Consider a thread T1 that passes a set S to another thread T2. If S is mutable, T1 has a problem: It loses control of what happens with S. Thread T2 can modify it, so T1 can't rely at all on content of S. And vice versa - T2 can't be sure that T1 doesn't modify S while T2 operates on it.

One solution is to add some kind of a contract to the communication of T1 and T2 so that only one of the threads is allowed to modify S. This is error prone and burdens both the design and implementation.

Another solution is that T1 or T2 clone the data structure (or both of them, if they aren't coordinated). However, if S isn't persistent, this is an expensive O(n) operation.

If you have a persistent data structure, you're free of this burden. You can pass a structure to another thread and you don't have to care what it does with it. Both threads have access to the original version and can do arbitrary operations on it - it doesn't influence what the other thread sees.

Ah, so "thread safety" in this context just means that one thread doesn't have to worry about other threads destroying the data they see, but has nothing to do with synchronization and dealing with data we want to be shared between threads. That's in line with what I thought, but +1 for elegantly stating "don't solve conurrency problems on their own."
–
Ray ToalJun 29 '13 at 17:59

1

@RayToal Yes, in this context "thread safe" means exactly that. How data are shared between threads is a different problem, which has many solutions, as you've mentioned (personally I like STM for its composability). Thread safety ensures that you don't have to worry what happens with data after being shared. This is actually a big deal, because threads don't need to synchronize who works on a data structure and when.
–
Petr PudlákJun 29 '13 at 18:17

@RayToal This allows elegant concurrency models such as actors, which spare developers from having to deal with explicit locking and thread management, and which rely on immutability of messages - you don't know when a message is delivered and processed, or to what other actors it's forwarded to.
–
Petr PudlákJun 29 '13 at 18:18

Thanks Petr, I'll give actors another look. I'm familiar with all of the Clojure mechanisms, and did note that Rich Hickey explicitly chose to not use the actor model, at least as exemplified in Erlang. Still, the more you know the better.
–
Ray ToalJun 29 '13 at 19:36

@RayToal An interesting link, thanks. I only used actors as an example, not that I'm saying it'd be the best solution. I haven't used Clojure, but it seems that it's preferred solution is STM, which I'd definitely prefer over actors. STM also relies on persistence/immutability - it wouldn't be possible to restart a transaction if it irrevocably modifies a data structure.
–
Petr PudlákJun 29 '13 at 21:00

One can imagine a data structure which would be persistent but mutable. For example, you could take a linked list, represented by a pointer to the first node, and a prepend-operation which would return a new list, consisting of a new head node plus the previous list. Since you still have the reference to the previous head, you can access and modify this list, which has meanwhile become also embedded inside the new list. While possible, such a paradigm doesn't offer the benefits of persistent and immutable data structures, e.g. it is certainly not thread safe by default. However, it may have its uses as long as the developer knows what they're doing, e.g. for space efficiency. Also note that while the structure may be mutable at the language level in that nothing prevents the code from modifying it, it may in practice be used as if it were immutable: the application logic may by convention not mutate the state even though theoretically it could.

So long story short, without immutability (enforced by the language or by convention), persistence od data structures loses some of its benefits (thread safety) but not others (space efficiency for some scenarios).

As for examples from non-functional languages, Java's String.substring() uses what I would call a persistent data structure. The String is represented by an array of characters plus the start and end offsets of the range of the array which is actually used. When a substring is created, the new object re-uses the same character array, only with modified start and end offsets. Since String is immutable, it is (with respect to the substring() operation, not others) an immutable persistent data structure.

The immutability of data structures is the part relevant to thread safety. Their persistence (re-use of existing chunks when a new structure is created) is relevant to efficiency when working with such collections. Since they are immutable, an operation like adding an item doesn't modify the existing structure but returns a new one, with the additional element appended. If each time the whole structure was copied, starting with an empty collection and adding 1000 elements one by one in order to end up with a 1000-element collection, would create temporary objects with 0+1+2+...+999 = 500000 elements total which would be a huge waste. With persistent data structures, this can be avoided as the 1-element collection is re-used in the 2-element one, which is re-used in the 3-element one and so on, so that in the end no garbage nodes are allocated - each one is at the end used in the final state of the data structure.

Sometimes it's useful to have quasi-immutable objects in which all but one aspect of state is immutable: the ability to make an object whose state is almost like a given object. For example, an AppendOnlyList<T> backed by power-of-two growing arrays could produce immutable snapshots without having to copy any data for each snapshot, but one could not produce a list which contained the contents of such a snapshot, plus a new item, without recopying everything to a new array.
–
supercatFeb 27 '14 at 21:43