Anti-pattern: parallel collections

(Note that I’m not talking about "processing collections in parallel, which is definitely not an anti-pattern…)

I figured it was worth starting to blog about anti-patterns I see frequently on Stack Overflow. I realize that some or all of these patterns may be collected elsewhere, but it never hurts to express such things yourself… it’s a good way of internalizing information, aside from anything else. I don’t guarantee that my style of presenting these will stay consistent, but I’ll do what I can…

The anti-patterns themselves are likely to be somewhat language-agnostic, or at the very least common between Java and C#. I’m likely to post code samples in C#, but I don’t expect it to be much of a hindrance to anyone coming from a background in a similar language.

Context

You have related pieces of data about each of several items, and want to keep this data in memory. For example, you’re writing a game and have multiple players, each with a name, score and health.

Anti-pattern

Each kind of data is stored (all the names, all the scores, all the health values) in a separate collection. Typically I see this with arrays. Then each time you need to access related values, you need to make sure you’re using the same index for each collection.

Preferred approach

The code above fails to represent an entity which seems pretty obvious when you look at the description of the data: a player. Whenever you find yourself describing pieces of data which are closely related, you should make sure you have some kind of representation of that in your code. (In some cases an anonymous type is okay, but often you’ll want a separate named class.)

Once you’ve got that type, you can use a single collection, which makes the code much cleaner to work with.

Note how we can now use a foreach loop to iterate over our players, because we don’t care need to use the same index for two different collections.

Once you perform this sort of refactoring, you may well find that there are other operations within the Game class which would be better off in the Player class. For example, if you also had a Level property, and increasing that would automatically increase a player’s health and score, then it makes much more sense for that "level up" operation to be in Player than in Game. Without the Player concept, you’d have nowhere else to put the code, but once you’ve identified the relationship between the values, it becomes much simpler to work with.

It’s also much easier to modify a single collection than multiple ones. For example, if we wanted to add or remove a player, we now just need to make a change to a single collection, instead of making sure we perform the same operation to each "single value" collection in the original code. This may sound like a small deal, but it’s easy to make a mistake and miss out on one of the collections somewhere. Likewise if you need to add another related value – like the "level" value described above – it’s much easier to add that in one place than adding another collection and then making sure you do the right thing in every piece of code which changes any of the other collections.

Summary

Any time you find yourself with multiple collections sharing the same keys (whether those are simple list indexes or dictionary keys), think about whether you could have a single collection of a type which composes the values stored in each of the original collections. As well as making it easier to handle the collection data, you may find the new type allows you to encapsulate other operations more sensibly.

Update: what about performance?

As some readers have noted, this transformation can have an impact on performance. Originally, all the scores were kept close together in memory, all the health etc. If you perform a bulk operation on the scores (finding the average score, for example) that locality of reference can have a significant impact. In some cases that may be enough justification to use the parallel collections instead… but this should be a conscious decision, having weighed up the pros and cons and measured the performance impact. Even then, I’d be tempted to encapsulate that PlayerCollection in a separate type, allowing it to implement IEnumerable<Player> where useful. (If you wanted the Player to be mutable, you’d need it to be aware of PlayerCollection itself.)

In almost all these anti-patterns, there will be cases where they’re the lesser of two evils – but novice developers need to be aware of them as anti-patterns to start with. As ever with performance trade-offs, I believe in first deciding on concrete performance goals, then implementing the code in the simplest possible way that meets the non-performance goals, measuring against the performance goals, and tweaking if necessary to achieve them, relying heavily on measurement.

31 thoughts on “Anti-pattern: parallel collections”

I’d propose that one place this pattern is typically preferred is where physical data layout and access can have significant impact on performance, trumping potential encapsulation wins. One example that comes to mind is in GPU (SIMD/SPMD) computing, where (at least historically) accessing or writing to a “struct of arrays” is often preferable to operating on an “array of structs”

As far as I understand there are completely legitimate cases, where parallel collections are much better.

Mainly, it is about performance of iteration over collection, which, especially in games, can occur many times a second in different context(pathfinding, rendering, physics, logic etc).

In that case there are two main problems, that cause awful lot of cache misses:
* data structure is bigger than needed for particular task (contains data not needed in this exact moment)
* ref types will make memory access pattern almost random in worst case.

In general, it seems there is a tradeoff rather then “pure-evil” antipattern. Cheers.

And in fact, it should sound like a small deal only to the most naïve, inexperienced programmer.

The guidance to avoid this anti-pattern is spot-on. In fact, it is a special case of the more general “don’t repeat yourself” (DRY), which itself has a number of benefits.

But certainly one of the foremost is, as implied here, that when repeating oneself (e.g. duplicating some logic over multiple collections instead of authoring it just once), not only is it easy to make a mistake, it is hard to fix a mistake.

Even if you duplicate the initial implementation exactly right in each place, if that initial implementation is flawed, it’s that much harder to ensure a fix for the flaw is applied everywhere it needs to be.

There are a bunch of other reasons for following the “DRY” principle, and in general they are also reasons for following your advice here too! :)

Even in game development, there are few places where such locality is actually an important factor. Even in game development, the goal should be correct code first, fast code second. It’s fine to break good code if it turns out to be a measured bottleneck which can be significantly improved by making the code worse. But otherwise, worrying about something like locality is just another premature optimization.

In any case, genuine performance optimizations regularly are in direct contradiction with proper software engineering practices. That’s the nature of optimizations; if they were aligned with proper software engineering practices, they wouldn’t be “optimizations” per se. They’d just be normal code.

It is likely that any discussion regarding good and bad programming patterns will find performance optimizations on the side of “bad”. Given this near-universal truth, it doesn’t seem all that fruitful to me to bother mentioning optimizations as a counter-argument to good programming practices in any specific discussion.

When a footnote winds up applied to every discussion in a theme (e.g. evil code, anti-patterns), why bother with the footnote at all? Seems like we can just take it as granted and move on.

Fact is, performance optimizations usually do wind up resulting in evil code. Oh well…c’est la vie! But let’s not pretend that the code doesn’t wind up being evil.

This antipattern is common in games because it’s actually a pattern. You’re describing what game developers sometimes refer to as AofS (array of structs) vs SofA (struct of arrays) patterns. There are pros and cons of each of these ways of structuring your code. Performance is often a motivator for preferring parallel containers, but it’s also a way of maintaining separation of responsibilities. I’m not sure whether you’ve done any research on the performance and code maintenance benefits of the “anti pattern” you describe, but if you have, I’d love to see you address those benefits directly in the article.

@NotAnAntiPattern: Hence my update at the bottom. I haven’t done any research on this for games, as I haven’t worked in the games industry – but I’ve usually seen this anti-pattern in Stack Overflow posts from novice developers.

I would suggest that in the games industry where performance can be so critical, there are quite a few times where the performance requirements outweigh the benefits of what *I’d* view as clean code. Mutable structs for positions, for example, may well make sense in high performance gaming code – but I’d still urge the vast majority of developers to avoid them.

As for the *maintenance* benefits of the anti-pattern – I really haven’t seen any.

I think the example also misses a key aspect of the reason for the “anti pattern” by only showing a single entity type (player). In the original code it is likely that players will have a name and a score, but it won’t just be players that have health – doors might have health, weapons might have health, non-player creatures might have health. It’s not an anti pattern to model capabilities as entities in their own right particularly when not all entities if a particular type have a capability and when entities of different types have the same type of capability.

@NotAnAntiPattern: In that case, I’d model that as an interface implemented by the various classes. I wouldn’t just keep health as a separate array.

And again, that simply *isn’t* the sort of situation where I’ve usually seen the anti-pattern. On Stack Overflow in particular, I’ve pretty much always seen it in novice code where the coder just hasn’t *thought* of creating another class to encapsulate the data.

@JDT: I explicitly mention performance in the update at the bottom. I regard this as an anti-pattern in that it’s *usually* a bad idea – at the very least as a first step. I would regard it as something that you might *try* after you’ve discovered a performance bottleneck, but which should only be done when it’s proved that the benefit outweighs the performance issue.

As for anti-pattern only being a suitable word when something is *always* worse – there are very few situations where that’s absolutely the case. I think it’s sufficient to be something that you should be very aware of, and only violate when you have very good reason to. It’s mostly about raising awareness – for folks who haven’t even *considered* the single-collection alternative.

For those arguing that this AP is a P due to performance concerns, keep in mind that you can still write an adapter that lets you iterate the structures without having to maintain the index manually. At least then you debug the complexity in only one place. Thats an easy way to keep your SofA DRY.

This is an interesting point, and I’d really like to discuss the non-beginner scenarios. I’m very interested in what sort of API you should put *on top* of your component to leave yourself room to refactor the implementations as necessary.

I’ve encountered this situation over and over again. Game entities. Topological entities (vertices, edges, polygons). Anything where you have millions of uniform entities. In these cases, the per-object overheads (in both space and time) can be relatively high. And in some cases the cost of doing non-vectorized operations can be enormous. So there’s a huge incentive to structure these things in a way that doesn’t have a separate object for each entity. (That might be one array for each feature. Or it might be an array of structs. Or maybe it involves some sort of hybrid data-structure, like a tree of arrays.) But you want to hide that implementation from users so far as is possible. What’s a good API?

One approach is to have proxy objects that represent the entities in the collection, which are created on-demand and handle extracting and updating the data from the underlying implementation. In this way you can have a very natural object-oriented API on top of a data-structure with a very different implementation. But it can have lousy performance if you’re creating and discarding millions of proxies, and it can be difficult to handle invalidation of the proxies if their entity is removed from the collection or the collection is otherwise mutated.

Another approach is to have cursor objects that are a bit like proxies, except they can be explicitly “navigated” to point to different parts of the collection. This makes for a slightly less natural API, but it does allow a lot of flexibility in implementation of the collection. And much like iterators, cursors are typically expected to be invalidated by mutations to the collection, so they don’t have some of the issues of proxy objects. Or you can make cursors disposable and simply make it an error to mutate a collection with extant cursors.

Of course, neither of these approaches on their own really let you exploit the benefits of an optimized data-structure. But I think they’re valuable because they let your optimized vectorized whole-collection operations co-exist with code that is not performance-critical and which can benefit from dealing with a simpler view of the collection and its contents.

I have a lot of respect for your knowledge, but I disagree with you calling this an anti-pattern. In the wrong hands (ie beginner programmers), it’s probably not the best way to go, but there are many examples of successful game engines built around this concept.

To me, an Anti-pattern should be something that is almost always a bad idea for beginners and pros alike, and is demonstrably harmful to the project.

@Postie: I think it’s telling that games have come up so often here. In my experience, a lot of what I hear about game-focused optimization goes directly against what I think of as maintainable code.

I would suggest that the vast majority of programmers are *not* writing games – and outside games, this *is* an anti-pattern which is almost always a bad idea for beginners and pros alike. (I’d argue that even within games, in *most* scenarios it would be a bad idea, and shouldn’t be used without careful consideration and thought first.)

Parallel collections are even more useful in languages such as C++ where memory contiguity is useful when interfacing with C libraries which work with arrays. My latest use case was using TA_LIB functions.

@Postie: I don’t think an anti-pattern is something that is universally wrong, if it was universally wrong we’d just say its bad code.

I think this is a good example of a typical anti-pattern and if @skeet hadn’t used a game as an example (which is one of the edge cases that benefit from this anti-pattern) there wouldn’t be this much commotion about it.

“You must name your coding anti-pattern in order to defeat it.” – Sun Tzu.

I’d also add that not only does moving your parallel collections into a named class give a name to your concept, you also have a handy place to put cohesive behaviors manipulating the data in the collections, rather than having those behaviors strewn all throughout your code elsewhere.

@intrueder: No – Zip is a great way of getting *from* a situation where you have parallel collections for some unavoidable reason (e.g. the data is coming from different sources, or it’s the serialization format) to a single collection.

If something needs no possible justifiable benefit to be considered an anti-pattern there is quite frankly no such thing as an anti-pattern.

Honestly I see this as an anti-pattern, even when it is necessary after extensive performance checking. I would just see it as the lesser of evils.

Performance is just one aspect of code and this approach is really only a benefit for performance. It hurts readability, comprehension, maintainability, and makes the code more fragile because you rely on a reasonable coincidence (the index being correct for both lists).

For anyone to call something that helps one aspect of code a pattern but hurts so many others as anything but an anti-pattern, seems disingenuous to me. It’s not to say an anti-pattern isn’t useful, but anti-patterns should always be a last resort thing and the reason for it documented and supported with strong evidence.

I think people object so strongly to it being called an anti-pattern because then they can’t just off-the-cuff well we need it, they are more likely to have to show it is a benefit. An anti-pattern should always be avoided when possible, sometimes that just isn’t possible.

I really wish we could have the best of both worlds.
That is, the code would be written as a single collection of objects, but behind the scenes it would be parallel arrays.
Or maybe a bit differently: declared as parallel arrays, but then you define some magic and you can henceforth treat a “vertical slice” (the same index across all those arrays) as an object.
Or maybe there’s an even better way, I don’t know. I just know that the current situation is bad. Though it happens a lot, I don’t want to choose between code that is fast (and allows SIMD) and code that looks good – I definitely want both.

The ubiquitous “vector” package for Haskell uses type families to implement arrays-of-structs as structs-of-arrays. An unboxed Vector (Int64, Char, (Bool,Word16)), for example, will be represented by an array of Int64 (64 bits per entry), an array of Char (32 bits per entry), an array of Bool (8 bits per entry), and an array of Word16 (16 bits per entry), but it acts like an array each of whose entries has an Int64, a Char, and a pair of Bool and Word16.