Iterators in Python – Part 2

In this second installment of my intermediate Python instructable series, I'll show you but one of nifty things you can do with iterators in a way that's faster and more beautiful than other approaches.

Rolling Windows

There's a good chance you'll encounter this at some point while coding: you've got a list of objects and you want to access them a few at a time, but only advancing one at a time. This sort of iteration pattern is referred to as a rolling window. For example:

list(iterable) can be though of as shorthand for [i for i in iterable][1][2].

Rolling windows are especially common with time series — a sequence of data points made over a time interval. Like just about everything else in programming, there is no lack of solutions to the problem.

All we're doing is taking a series of slices from the input iterable and making a list of out them. For the input rolling_window(3, range(5)) your output is essentially [li[0:3], li[1:4], li[2:5]].

Like a lot of Python one-liners, the simplicity is misleading: while it's not a morass of nested list comprehensions, there are a couple of problems, both involving how Python represents the underlying data.

Array slicing. If you're using CPython[3], every time you take a slice of an array, it's getting copied.

It's all in memory. The first argument of rolling_window, iterator, would need to support slicing. Much of the time, if an object supports slicing, the underlying data would need to be stored entirely in memory.

The Functional Way

Now lets try to find a solution that doesn't require indexed access to individual items in an iterator. We're going to need two of the functions provided to us by the itertools module, which is part of the Python standard library, tee and izip.

Tee Time

The first function is tee(iterable, n=2). Given an iterator, it returns a tuple of n iterators, each an almost-copy of the input iterator. What teereally does is keep track of every time one of the 'copied' iterators gets the next element. If that iterator is the furthest ahead out of all of the copied, it also stores the output temporarily until all of the iterators are done iterating over that element. For instance:

Is it faster?

Yup, it's almost twice as fast to run. More to the point, if we used the iterator version of rolling_window as intended (rather than applying list), doing our operations on it's elements one at a time, it would occupy a constant amount of memory.

What about Yield?

Yield has been a part of Python since version 2.3 and in its current from since 2.5. I'm not going to go fully into it here, because https://wiki.python.org/moin/Generators does a much better job explaining it. Go read it if you're unfamiliar with generators – I'll wait for you.

...

...

...

Alright let's get to it.

1st Attempt

Rather than building up the entire list of slices in the naive example above, we're only returning one slice. Generators, despite some complexity under the hood, are iterators; that is, you can call next on them.

Do you see any problems with this approach though? We still need to be able to reference sub-elemements of iterable, which often means building it fully in memory first.

2nd Attempt

If we can't access all elements of iterable, we'll need to make sure to hang on to the last |window_size| elements to build our window, while stepping through the iterable.

We could use a list, but we'd probably want to yield something like a tuple. Why? Tuples in Python are immutable, whereas lists are mutable. For instance, when I say a = tuple(xrange(5)), a refers to a structure in memory with the numbers 0-5.

If I was to take a slice of it, for instance b = a[1:], a new tuple is created in memory for b. Likewise, if I try to set an element in b:

actually results in a whole new object getting created. While this can be inefficient (creating a copy of a large object can be expensive in both time and space), there are advantages.

Because we're yielding these tuples to other code, if we were to yield a list, a mutable object, the consuming code[4] could modify it, and because my iterator is relying on the state of the list to return the next window, future windows might be inaccurate. Sure, we could tell users of our function to not modify the yielded list, but that's just another thing to worry about.

list is actually significantly faster than the naive list comprehension in both Python 2 and 3 (at least for CPython) ↩︎

When you call list on your iterable, it's going to negate the memory advantages you might have gained by using iterators in the first place, as Python will construct a list in memory, pointing to each item in the iterable. ↩︎

Python is a programming language. If you write code in Python, you still need an interpreter to understand and run your code. CPython is one such interpreter. In fact, it's the most popular by far, and chances are you're using it. To read more, check out the StackOverflow discussion on it. ↩︎

When I say consuming code, I mean the code that takes the items from the iterator and does something with them. ↩︎