Introduction

I want to spend some time messing about with iterators in Julia. I think they
not only provide a familiar and useful entry point into Julia’s type system and dispatch
model, they’re also interesting in their own right.1 Clever application of iterators can
help to simplify complicated loops, better express their intent, and improve
memory usage.

A word of warning about the code here. Much of the it isn’t idiomatic Julia and I wouldn’t
necessarily recommend using this style in a serious project. I also can’t speak
to its performance vis-a-vis more obvious Julian alternatives. In some cases,
the style of the code examples below may help reduce memory usage, but
performance is not my main concern. (This may be the first blogpost about Julia
unconcerned with speed). Instead, I’m just interested in different ways of
expressing iteration problems.

For anyone who’d like to play along at home, there’s an IJulia notebook of
this material on Github, which can be viewed on nbviewer here.

The Iterator Protocol

What do I mean by iterators?2 I mean any I in Julia that works on
the right hand side of the statement for i = I .... That is, anything you can for-loop
over. This includes not only data collections like Arrays, Dicts, and Sets, but
also more abstract types like Ranges, as well as what I’ll call “higher order”
iterators such as those that result from zip or enumerate functions.

As an equivalent definition, an iterator in Julia is any type that implements
the iterator protocol. The iterator protocol is comprised of three methods:
start, next, and done. So any type in Julia for which these three methods
are defined is an iterator. It might be a dumb iterator or a broken iterator,
but it’s an iterator.

Since the for statement works on iterators, and iterators are just a
collection of methods, we can define any for loop using calls to those methods.

In this example, the start method provides the initial state of the iterator;
the next method returns the value of the array at a given state, as well as what the
next state is. Finally, the done method returns true when we’ve gone past
the end of the iterator, informing the loop that it should stop.

If you know Python, the idea of the iterator protocol is probably familiar. In
Python, any object can be an iterator if it has the methods __iter__ and
__next__. But notice the lack of side effects in the Julia implementation
—calling start or next on the array has no affect on the array itself. start is
basically a constant, always returning the value of the initial state whenever
you pass it the same type of iterator. And next doesn’t really increment
anything; it’s just a mapping from current state &rightarrow; (value, next
state). In general, the iterator itself has no internal state being incremented or
changed as you pass through a loop.

An iterator’s state

More concretely, what’s the state of an iterator? How the state is
defined, and an iterator’s sequence of states depends on the type of
the iterator itself. It’s best to look at some examples.

Arrays.

Arrays are very intuitive iterators. They have states that are just integer
values from 1 to the length of the array. So start returns 1.

arr=["one","two","three","four","five","six"]start(arr)

1

The next mapping is i &rightarrow; i+1, and the value of the iterator at any state
i is just a[i].

If this were a multidimensional array, say 3×2 instead of 6×1, we’d
get the same result; iteration would just proceed along the rows of the matrix.

The done method returns true when the state is i =
length(a) + 1. You might think it’d be length(a), but recall the for-equivalent while loop
above. Having done return true at the last index of the array would prevent
the loop from executing on the last element. So in our 6-element array, done
is true when the state hits 7.

println(done(arr,6))# not yetprintln(next(arr,6))

false
("six",7)

done(arr,7)

true

Ranges

Ranges have states that looks similar to arrays, except they start at zero.

rng=11:20# length 10 rangestart(rng)# 0

0

But the relationship between the current and next state is the same: i &rightarrow; i+1.

Since we start at zero, the done state is one less than the equivalent array.

done(rng,10)

true

Unordered collections: Dicts, Sets, etc.

Arrays and ranges have a natural order, so the evolution of state is
straightforward. But what about collections such as dictionaries and sets that have no inherent
order? Like in many languages, such things can be iterated over, but the order
of iteration is not easily predictable.

For example, here’s a dictionary:

dictit={:one=>1,:three=>3,:five=>5,:five=>"five!"}

{:one=>1,:three=>3,:five=>"five!"}

The starting state isn’t 0 or 1, as would be natural for an ordered collection.

s0=start(dictit)

3

And while next maps state i to state j, the relationship between i and j
is not obvious. Here, while the first state is 3, the second is 11, and the rest
are similarly weird.

_,s1=next(dictit,s0)

((:one,1),11)

_,s2=next(dictit,s1)

((:three,3),13)

_,s3=next(dictit,s2)

((:five,"five!"),17)

done(dictit,s3)

true

The states, you probably and correctly suspect, are tied to the internal
implementation of the dictionary, e.g. how the keys are hashed. So the state
doesn’t follow a predictable 1, 2, 3, … order, and what order of elements we
get when iterating is essentially unpredictable.

But because for loops handle the iterator’s states for us, we rarely if ever have to worry about
the representation of an iterator’s state. The for loop implicitly calls the start,
done, and next methods, which do all this bookkeeping for us.

Iterators and Delayed Evaluation

While many iterators are collections of data in memory, like Arrays, Dicts, or
Sets, iterators can also represent abstract collections that aren’t held in memory.

Range is a good example. When we iterate over the range 1:10, we get the
sequence 1, 2, 3, …, 10. But in memory, this range is comprised of only two
integers, 1 and 10. The values in between are only evaluated when we’re looping over it.

Notice that the next method calculates the value of the iterator in state
i. This is different from an Array iterator, which just reads the element
a[i] from memory.

Iterators that exploit delayed evaluation like this can have important performance
benefits. If we want to iterate over the integers 1 to 10,000, iterating over an
Array means we have to allocate about 80MB to hold it. A Range only requires
16 bytes; the same size as the range 1 to 100,000 or 1 to 100,000,000.

Application: Iterating over Fibonacci numbers

Here’s another example of an iterator which computes values on demand, using the
next method to do the work. fibit(n) is an iterator over the first n
Fibonacci numbers. When the iterator is constructed, it doesn’t calculate all of
these numbers. Instead it waits for its next method to be called, providing
the next Fibonacci number depending on the current one.

Tasks/Co-routines

This talk of iterators with delayed evaluation may remind Pythonistas of
generators. And Julia has a type that is basically equivalent to Python’s
generators, called Tasks. A Task is constructed by calling the Task()
constructor (or
@task macro) on a function with a produce statement, which issimilar to Python’s
yield.

Instead of using the Fibit type above, we could get an equivalent iterator by
defining a Task that produces sequential Fibonacci numbers.

Whether you create an iterator using a type with the iterator protocol, or by
constructing a Task, is up to you. There are pros and cons to each approach. By
defining your iterator as a specific type, you can dispatch lots of other
functions on it. Here, on the other hand, fibtask is just a Task type, so
defining methods for it means defining methods for all Tasks, which may be
undesirable or infeasible. On the other hand Tasks give you iterators with less
code. Below I’ll show an example of an iterator that’s hard to define with the
iterator protocol methods, but easy to define as a Task. And of course, Tasks
are coroutines, and can be used in those contexts.

Realizing Iterators without loops

So far, we’ve talked about iterators in the context of for loops. We saw that
for i = I was a construct for calling I‘s start, done and next
methods, letting us realize and operate on the values in the iterator.

But there are functions which can take iterators as inputs and implicitly iterate over them
to some desired result. This obviates the need for explicit for loops, and can
make for cleaner more functional code. Some examples follow.

collect and reduce

The collect function takes an iterator input, realizes all its values, and
collects them into an array.

collect(fibit(10))

10-element Array{Any,1}:
1
1
2
3
5
8
13
21
34
55

The reduce function similarly realizes the values of an iterator, but then successively
applies an operator to them to give a scalar result.

reduce(+,fibit(10))

143

That reduce operation is equivalent to the sum function called with an
iterator argument.

sum(fibit(10))

143

In this next line of code, I compute the sum of the reciprocals of the first
10,000 Fibonacci numbers (which should be close to this), using collect to first put them into an array.

Comprehensions

The collect function may remind you of an array comprehension, and it is
similar, but here we see array comprehension don’t work on our iterator:

[fforf=fibit(10)]

no method length(FibIt{Int64})
while loading In[26], in expression starting on line 1
in anonymous at no file

What’s going on is that the array comprehension wants to allocate an array,
then fill it in with the values of the iterator. Since it doesn’t know the
iterator’s length (how many values it will produce), it doesn’t know how large
an array to allocate.3 We can fix this for our Fibonacci iterator by
giving it a length method.

Base.length(it::FibIt)=it.n[fforf=fibit(10)]

10-element Array{Int64,1}:
1
1
2
3
5
8
13
21
34
55

Now we can redefine our sum-of-reciprocals using a comprehension instead of collect.

no method length(Task)
while loading In[27], in expression starting on line 1
in anonymous at no file

We get the same issue; Tasks don’t have a length method. The advantage of using
the FibIt type is that we can easily define a length method for it. We can
only give our Fibonacci task a method if we give all Tasks a length method,
which wouldn’t make sense.

The Iterator Package

When we calculated the sum of the reciprocals of Fibonacci number above, we had
to realize the values of the Fibonacci iterator before taking the
reciprocal, and then sum a collection of all those values. Alternatively, we could have called sum on an
iterator that produced 1/x for each Fibonacci number x.

One way to do this would be to create a new iterator type, called
ReciprocalFibIt, and given it its own start, next, and done methods. But that
feels a little excessive. Wouldn’t it be nicer to be able to construct that iterator from
the Fibonacci iterator we already have? Essentially saying, “hey, I want another
iterator that gives one over the values of that other iterator.”

That would be an example of what I’ll call a higher-order iterator, which is
an iterator constructed from one or more other iterators. zip and enumerable
are common examples.

It turns out Julia has a neat little package of useful higher-order iterators;
called, obviously, Iterators. In the rest of (this already very long) post, I’ll
explore some of things in the package. Pythonistas will notice similarities with
the itertools module in the Standard Library.

usingIterators

Imap

The Imap iterator provides us with our wish above: a new iterator that is the
result of applying a function to the values of an existing iterator. In the case of our
reciprocal Fibonacci numbers, that function is x -> 1/x.

recipricalfib=imap(x->1/x,fibit(10_000,BigInt))# A new iterator, composed# from a FibItpsi=sum(recipricalfib)# No collect needed

So reciprocalfib is itself an iterator, whose values are only realized when
it’s passed to the sum function. We didn’t have to allocate any arrays before
calling sum as with the collect and comprehension examples above.

An IFilter iterator

Since we have a map-like iterator, why not a filter?4 How would it work? Given an
iterator that produces values v1, v2, v3, …, the filter iterator would
only produce the values that met some predicate, skipping any that didn’t.

This isn’t implemented in the Iterators package (because Base.filter will
already do this, see footnote 4). It’s a neat idea, but it turns
out to be tricky to define in terms of the iterator protocol. It’s easy with a
Task, though.

Repeat and RepeatForever

Another surprisingly useful iterator is Repeat, which simply produces an object
some number of times. Here the iterator is just the string “echo!” five times.

fori=repeated("echo!",5)println(i)end

echo!
echo!
echo!
echo!
echo!

If we didn’t provide the second argument, the result would be an iterator that
goes on infinitely, so its for loop would never terminate. Why would you want
that? I’ll show some examples of its use below.

Extension: Repeating impure functions

One thing about the Repeat iterator though, is that the object or value it
repeats is fixed at its construction. If you pass it a called function, it will
call that function once in the constructor, and then repeatedly return the
result of that first call. For pure functions, that’s fine. The first call of
sqrt(100) is the same as the second, third, or ten-thousandth call of
sqrt(100).

Here, the rand function was called once in the constructor, and that result was repeated again
and again. I’d prefer if I could get 10 separate calls to rand. Here’s one way
to get this to work.

Base.next(it::Iterators.Repeat{Function},state)=it.x(),state-1Base.next(it::Iterators.RepeatForever{Function},state)=it.x(),nothing# Note the function isn't called in the constructor;# the `next` function does this.fori=repeated(rand,10)println(i)end

What I’ve done is create new next methods for the Repeat and RepeatForever
iterators. When the object of the iterators is a function, the next methods
call the function. By passing the iterator an uncalled function object, I avoid the call
in the constructor, and defer it to the next method.

Take and Drop

The Take iterator only iterates over some specified first values of its input
iterator. It works well in combination with infinite iterators, like RepeatForever

Let’s also create a TakeUntil iterator, which takes values until it finds one that
passes the test. So the last value produced by this iterator will pass the test
and all values before that will have failed.

Application: How long does it take a Poisson process to produce a prime number?

As an application of the TakeUntil iterator, an experiment. How many draws do
we have to make from a Poisson process until we draw a prime number? For this
example, I’ll use a Poisson with mean 5,000.

In the code, we make a Repeat iterator that repeatedly draws from the
Poisson. We pass this into takeuntil and this creates an iterator that draws
from the Poisson until we find a prime number. While this is happening, we keep track of the
number of steps we took through this iterator.

# Draw random integers from a distibrution d until you get a prime number.# Return the number of draws.function primetime(d,dparams)randgen=()->rand(d(dparams...))tu=takeuntil(repeated(randgen),isprime)time=0fori=tutime+=1endtimeendprimetime_poiss5k=()->primetime(Poisson,5000)

What’s the average wait for a prime? Repeating the experiment 10,000 times, we
find the average number of draws is between 7 and 8.

mean(repeated(primetime_poiss5k,10_000))

7.6783

To see the distribution of waiting times, I’ll collect each repetition of the
experiment in an array that we can plot.

times=[t::Intfort=repeated(primetime_poiss5k,10_000)]

Partition

The Partition iterator split its input iterator into pieces, producing an
iterator over iterators. For example we could use it to partition the Range
iterator 1:100 into two iterators, 1:50 and 51:100. We can also make
overlapping partitions, for example, 1:50, 2:51, 3:52, etc.

constgoogdata="http://ichart.finance.yahoo.com/table.csv?s=GOOG&d=0&e=7&f=2014&g=d&a=0&b=7&c=2013&ignore=.csv"|>download|>open|>readall|>s->split(s,"\n")|>a->map(l->split(l,","),a)|>a->filter(l->contains(l[1],"201"),a)|>reverse# Dates start at most recent, so reverse for chron order.

We then create iterators over the dates and closing prices in the Array. These
iteratively extract and parse values from the relevant columns.

Groupby

While the Partition iterator makes partitions of specified lengths, the
Groupby iterator splits an iterator based on some condition. One caveat is
that the input iterator has to be sorted in some way on the groupby condition,
so that values with the same condition are adjacent in the iterator.

Application: Do Labor Force figures follow Benford’s Law?

In this example, I’m going to look at Benford’s Law using the Groupby
iterator. Benford’s Law posits that the leading digits of numbers in many
data sources follows a regular distribution. I’ll use the Groupby iterator to
group the data by first digit and check this.

The data I’ll examine is the size
of the labor force population in each U.S. county in 2012.

The analysis is simple with a Groupby iterator. It splits up the data by
leading digit, and then I just calculate the frequency of each leading digit in
the data by taking the length of each leading-digit group as a share of
the total length of the data.

dgroups=groupby(lfdata,s->s[1])# Groups by first digit# Extract the digit from the group membersdigits=imap(i->parseint(string(i[1][1])),dgroups)# Compute the frequencyfrequency=imap(x->length(x)/length(lfdata),dgroups)

Benford’s Law posits that the frequency of digit d in data should be
log10(d+1) - log10(d). This function prints out a
table of the observed frequencies next to the expected frequencies per Benford’s Law.

Iterate

Though its name might be a little confusing, the Iterate iterator is one of my
favorites. It recursively applies a function to a starting value, that is
f(...f(f(f(x)))...). I come across applications for it all over the place.

Application: Autoregressive time series processes

One application is producing autoregressive time series processes. An AR(1)
process has the form xt+1 = pxt + et+1, where
e is some random noise. If
We define the function f(x) = px + e, then xt+2 as a function
of xt is f(f(xt)). Subsequent values can be similarly
produced by iteratively applying the function.

First the code for the AR(1) function itself, along with a helper function for
plotting a realization of the process.

Defining a coefficient and a standard deviation for the random variable, I pass
them through a process that creates an iterator that recursively applies the
function, starting with a randomly-drawn initial value. Then I collect 250 values of
that iterator and plot them.

constar1coef=0.9constar1sigma=0.15(ar1coef,ar1sigma)|>x->apply(ar,x)|>f->iterate(f,ar1sigma*rand())|>i->collect(take(i,250))|>s->plotar(s,"AR(1) Time Series")

This idea can be extended an AR(p) process, where the current value of x
depends on several past values. Whereas the coefficient was a scalar in the
AR(1) model, it’s a matrix now, but the formula is otherwise the same.

constar3coeffs=[.9,-.1,-.25]constar3sigma=0.15(ar3coeffs,ar3sigma)|>x->apply(ar,x)|>f->iterate(f,ar3sigma*rand(3))|>i->map(first,take(i,250))|>s->plotar(s,"AR(3) Time Series")

Conclusion

Most iteration you’ll see in the wild uses simple collections or ranges as the
iterator, performing extensive work inside the loop. Sometimes our problem can be
better expressed using more complicated iterators whose structure represents the
logic of our iteration. One thing to notice in all the examples was that once
the iterators were defined, there was very little to do after iterating over
them. Typically I was just collecting the iteration values into an array, or
reducing them with an operation to a scalar result. We were also able to build
the problems in such a way that calculation of values in the iterators was
delayed until absolutely necessary.

There are tradeoffs to this sort of style, and much of the stuff in this
post was more cute than practical. But it was a fun exploration of how to create types
that interact with protocols in Julia. Julia’s type system and dispatch design
are very powerful and interesting, and gives programmers a lot of flexibility in
expressing their problems.

While
we’ll see lots of examples of extending Julia’s base functions dispatched on
newly-defined types, we won’t see much multiple dispatch, which is an
important design feature of Julia. In fact, pretty much everything here could be
implemented in a single-dispatch OO language.

Pythonistas
may be thinking about the distinction between iterators and iterables. (See,
e.g. this Stack Overflow thread.) That distinction doesn’t really apply to
Julia. So I won’t use the term iterable here, and I’ll define an iterator in the two
ways discussed above: (1) it is valid in a for i = I statement, and (2) it
implements the iterator protocol.

This limitation seems
to come from the idea that only iterators with known lengths can be counted on
to produce multidimensional arrays. This may be changed in future versions of
Julia. See, e.g. Issue #550. The collect function uses the
push! function to dynamically allocate the array, but collect can only give
a 1-D Array output, whereas comprehensions can be multidimensional.

Actually, Julia’s filter function already does this. If you pass
that function a predicate or condition function and an iterator, it produce a
Filter object that you can then iterate over. This is different from
map which can take an input iterator, but returns the result of
mapping the function immediately.