This tutorial departs from the very beginner nature of the previous three, so this may be of more interest to readers who already have some programming experience in another language. (Though also, see the section on using matching in Scala in Part 3.)

Iteration, the Scala way(s)

Up to now, we have (mostly) accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: “For each word in the given list of words: print it”. Here is how to say this in Scala.

This says to take each element of the list (indicated by foreach) and apply a function (in this case, println) to it, in order. There is some underspecification going on in that we aren’t providing a variable to name elements. This works in some cases, such as above, but won’t always be possible. Here’s is how it looks in full, with a variable naming the element.

This is useful when you need to do a bit more, such as concatenating a String element with another String.

scala> animals.foreach(animal => println("She turned me into a " + animal))
She turned me into a newt
She turned me into a armadillo
She turned me into a cat
She turned me into a guppy

Or, if you are performing a computation with it, like outputing the length of each element in a list of strings.

scala> animals.foreach(animal => println(animal.length))
4
9
3
5

We can obtain the same result as foreach using a for expression.

scala> for (animal <- animals) println(animal.length)
4
9
3
5

With what we have been doing so far, these two ways of expressing the pattern of iterating over the elements of a List are equivalent. However, they are different: a for expression returns a value, whereas foreach simply performs some function on every element of the list. This latter kind of use is termed a side-effect: by printing out each element, we are not creating new values, we are just performing an action on each element. With for expressions, we can yield values that create transformed Lists. For example, contrast using println with the following.

The result is a new list that contains the lengths (number of characters) of each of the elements of the animals list. (You can of course print its contents now by doing lengths.foreach(println), but typically we want to do other, usually more interesting, things with such values.)

What we just did was map the values of animals into a new set of values in a one-to-one manner, using the function length. Lists have another function called map that does this directly.

So, the for-yield expression and the map method achieve the same output, and in many cases they are pretty much equivalent. Using map, however, is often more convenient because you can easily chain a series of operations together. For example, let’s say you want to add 1 to a List of numbers and then get the square of that, so turning List(1,2,3) into List(2,3,4) into List(4,9,16). You can do that quite easily using map.

nums.map(x=>x+1).map(x=>x*x)

Some readers will be puzzled by what was just done. Here it is more explicitly, using an intermediate variable nums2 to store the add-one list.

Since nums.map(x=>x+1) returns a List, we don’t have to name it to a variable to use it — we can just immediately use it, including doing another map function on it. (Of course, one could do this computation in a single go, e.g. map((x+1)*(x+1)), but often one is using a series of built-in functions, or functions one has predefined already).

You can keep on mapping to your heart’s content, including mapping from Ints to Strings.

Note: the use of x in all these cases is not important. They could have been named x, y, z and turlingdromes42 — any valid variable name.

Iterating through multiple lists

Sometimes you have two lists that are paired up and you need to do something to elements from each list simultaneously. For example, let’s say you have a list of word tokens and another list with their parts-of-speech. (See the previous tutorial for discussion of parts-of-speech.)

Ripping a string into a useful data structure

It is common in computational linguistics to need convert string inputs into useful data structures. Consider the part-of-speech tagged sentence mentioned in the previous tutorial. Let’s begin by assigning it to the variable sentRaw.

Now, let’s turn it into a List of Tuples, where each Tuple has the word as its first element and the postag as its second. We begin with the single line that does this so that you can see what the desired result is, and then we’ll examine each step in detail.

What’s an Array? It’s a kind of sequence, like List, but it has some different properties that we’ll discuss later. For now, let’s stick with Lists, which we can do by using the toList method. Additionally, let’s assign it to a variable so that the remaining operations are easier to focus on.

Now, we need to turn each of the elements in that list into pairs of token and tag. Let’s first consider a single element, turning something like “The/DT” into the pair (“The”,”DT”). The next lines show how to do this one step at a time, using intermediate variables.

So, firstPair is a tuple representing the information encoded in the string first. This involved two operations, splitting and then creating a tuple from the Array that resulted from the split. We can do this for all of the elements in tokenTagSlashStrings using map. Let’s first convert the Strings into Arrays.

Note: if you are comfortable with using one-liners that chain a bunch of operations together, then by all means use them. However, there is no shame in using several lines involving a bunch of intermediate variables if that helps you break apart the task and get the result you need.

One of the very useful things of having a List of pairs (Tuple2s) is that the unzip function gives us back two Lists, one with all of the first elements and another with all of the second elements.

With this, we’ve come full circle. Having started with a raw string (such as we are likely to read in from a text file), we now have Lists that allow us to do useful computations, such as converting those tags into another form.

Providing a function you have defined to map

Let’s return to the postag simplification exercise we did in the previous tutorial. We’ll modify it a bit: rather than shortening the Penn Treebank parts-of-speech, let’s convert them to course parts-of-speech using the English words that most people are familiar with, like noun and verb. The following function turns Penn Treebank tags into these course tags, for more tags than we covered in the last tutorial (note: this is still incomplete, but serves to illustrate the point).

Voila! If we want to convert the tags in this manner and then output them as a string like what we started with, it’s just a few steps. We’ll start from the beginning and recap. Try running the following for yourself.

This is similar to defining functions as we had previously (e.g. def addOne (x: Int) = x+1), but it is more convenient in certain contexts, which we’ll get to later. For now, the thing to realize is that whenever you map, you are either using a function that already existed or creating one on the fly.

Filtering and counting

The map method is a convenient way of performing computations on each element of a List, effectively transforming a List from one set of values to a new List with a set of values computed from each corresponding element. There are yet more methods that have other actions, such as removing elements from a List (filter), counting the number of elements satisfying a given predicate (count), and computing an aggregate single result from all elements in a List (reduce and fold). Let’s consider a simple task: count how many tokens are not a noun or adjective in a tagged sentence. As a starting point, let’s take the list of mapped postags from before.

However, because filter just takes a Boolean value, we can of course use Boolean conjunction and disjunction to simplify things. And, we don’t need to save intermediate variables. Here’s the one liner.

As an exercise, try doing a one-liner that starts with sentRaw and provides the value “resX: Int = 9” (where X is whatever you get in your Scala REPL).

In the next tutorial, we’ll see how to use reduce and fold to compute aggregate results from a List.

Copyright 2011 Jason Baldridge

The text of this tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to http://www.jasonbaldridge.com and to this original tutorial.

Suggestions, improvements, extensions and bug fixes welcome — please email Jason at jasonbaldridge@gmail.com or provide a comment to this post.