Optimize ToArray and ToList by providing the number of elements

The ToArray and ToList extension methods are convenient ways to eagerly materialize an enumerable sequence (e.g. a Linq query) into an array or a list. However, there’s something that bothers me: both of these methods are very inefficient if they don’t know the number of elements in the sequence (which is almost always the case when you use them on a Linq query). Let’s focus on ToArray for now (ToList has a few differences, but the principle is mostly the same).

Basically, ToArray takes a sequence, and returns an array that contains all the elements from the sequence. If the sequence implements ICollection<T>, it uses the Count property to allocate an array of the right size, and copy the elements into it; here’s an example:

List<User> users = GetUsers();
User[] array = users.ToArray();

In this scenario, ToArray is fairly efficient. Now, let’s change that code to extract just the names from the users:

Now, the argument of ToArray is an IEnumerable<User> returned by Select. It doesn’t implement ICollection<User>, so ToArray doesn’t know the number of elements, so it cannot allocate an array of the appropriate size. So here’s what it does:

if the array is longer than the number of elements, trim it: allocate a new array with exactly the right size, and copy the elements from the previous array

return the array

If there are few elements, this is quite painless; but for a very long sequence, it’s very inefficient, because of the many allocations and copies.

What is annoying is that, in many cases, we know the number of elements in the source! In the example above, we only use Select, which doesn’t change the number of elements, so we know that it’s the same as in the original list; but ToArray doesn’t know, because the information was lost along the way. If only we had a way to help it by providing this information ourselves….

Well, it’s actually very easy to do: all we have to do is create a new extension method that accepts the count as a parameter. Here’s what it might look like:

Note that if you specify a count that is less than the actual number of elements in the sequence, you will get an IndexOutOfRangeException; it’s your responsibility to provide the correct count to the method.

So, what do we actually gain by doing that? From my benchmarks, this improved ToArray is about twice as fast as the built-in one, for a long sequence (tested with 1,000,000 elements). This is pretty good!

Note that we can improve ToList in the same way, by using the List<T> constructor that lets us specify the initial capacity:

3 thoughts on “Optimize ToArray and ToList by providing the number of elements”

“..but for a very long sequence, it’s very inefficient, because of the many allocations and copies..”

It’s not inefficient – amortized complexity of an insert is still O(1) even though for some specific iteration it may need extra copying.

Your optimization works fine if the filter retains a lot of items, but if it doesn’t you’d end up with large but mostly empty lists. Wasting space in turn puts pressure on GC which may affect performance, so it’s not that straightforward! 🙂

Another solution might be some “SelectAsList(filter)” extension method which counts the retained elements exactly and so creates the array with more accurate size.

But still I would expect a very modest performance improvement overall – definitely not the 50% speedup. It must be something very specific in your benchmarks!!