This tutorial is about accessing the file system in order to work with text files. The previous tutorial showed how to build a Map that contains the counts of each word type in a given text. However, it was assumed that the text was available in a String variable, and typically we are interested in knowing things about files that live on the file system, or on the internet. This tutorial shows how to read a file’s contents into Scala for processing, both by building a single String for the file or by consuming it line-by-line in a streaming fashion. Along the way, immutable Maps are introduced as a way to enable word counting without reading an entire file into memory.

This creates a BufferedSource, from which you can easily get all of file’s contents as a String.

scala> val holmes = Source.fromFile("pg1661.txt").mkString
holmes: String =
"Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<...many more lines...>

With this, you can do the same things as shown it tutorial 7 to get the word counts (except that here we’ll split on white space sequences rather than just a single space).

Lest you think it strange that Watson only shows up four times, keep in mind that we split on whitespace, and that means that in a sentence like the following, the token of interest is Watson,” rather than Watson.

“You could not possibly have come at a better time, my dear Watson,” he said cordially.

Looking that and others up shows more tokens containing Watson in the story.

Of course, the real problem is that tokenizing on whitespace is too crude. To do this properly generally takes a good hand-built tokenizer (which is able to keep tokens like e.g. and Mr. and Yahoo! while splitting punctuation off most words) or a machine learned one that is trained on data hand-labeled for tokens. For an example of the latter, see the Apache OpenNLP toolkit tokenizers, which includes pre-trained models for English.

Working line by line

Quite often, you need to work through a file line by line, rather than reading the entire thing in as a single string as we did above. For example, you might need to process each line differently, so just having it as a single String isn’t particular convenient. Or, you might be working with a large file that cannot easily fit into memory (which is what happens when you read in the entire string). You can obtain the lines in the file as an Iterator[String], in which each item is a single line from the file, using the getLines method.

This iterator is ready for you to consume lines, but it doesn’t read all of the file into memory right away — instead it buffers it such that each line will be available for you as you ask for it, essentially reading off disk as you demand more lines. You can think of this as streaming the file to your Scala program, much like modern audio and video content is streamed to your computer: it is never actually stored, but is just transferred in parts to where it is needed, when it is needed.

Of course, Iterators share much with sequence data structures like Lists: once we have an Iterator, we can use foreach, for, map, etc. on it. So to print out all of the lines in the file, we can do the following.

scala> Source.fromFile("pg1661.txt").getLines.foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
<...many more lines...>

That creates a lot of output, but it shows you how you can easily create your own Scala implementation of the Unix cat program: just save the following line in a file called cat.scala:

scala.io.Source.fromFile(args(0)).getLines.foreach(println)

And then call that with the name of the file to list its contents.

$ scala cat.scala pg1661.txt

Back in the REPL, it is somewhat less-than-ideal to see the entire file. If you just want to see the start of the file, use the take method on the Iterator before the foreach.

scala> Source.fromFile("pg1661.txt").getLines.take(5).foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included

The take method is quite useful in general with any sequence, and provides the complement of the drop method, as shown in the following examples on a simple List[Int].

Word counting line by line, first try

Now that we’ve seen how to read a file and start working with it line-by-line, how do we count the number of occurrences of each word? Recall from tutorial 7 and above that the starting point was to have a sequence (Array, List, etc) of Strings in which each element is a word token. To start moving toward that, we can simply use the toList method on the Iterator[String] obtained from getLines.

scala> val holmes = Source.fromFile("pg1661.txt").getLines.toList
holmes: List[String] = List(The Project Gutenberg EBook of The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle, (#15 in our series by Sir Arthur Conan Doyle), "", Copyright laws are changing all over the world. Be sure to check the, copyright laws for your country before downloading or redistributing, this or any other Project Gutenberg eBook., "", This header should be the first thing seen when viewing this Project, Gutenberg file. Please do not remove it. Do not change or edit the, header without written permission., "", Please read the "legal small print," and other information about the, eBook and Project Gutenberg at the bottom of this file. Included is, important information about your specific rights and restrictions in, how the file may be used. You can also find ou...

We now have the contents of the file as a List[String], and may proceed to do useful things with it. For example, we could map each line (Strings) to be sequences of whitespace-separated Strings.

But you should be a bit bothered by all this: wasn’t the idea here (in part) not to read all of the lines in at once? Indeed, with what we did above, as soon as we said toList on the Iterator, the whole file was read into memory. However, we can do without the toList step and just directly flatMap the Iterator and get a new Iterator over the tokens rather than the lines.

Oops — that worked, but we once again brought the whole file into memory because the List that was created from toList has all lines for the file. We’ll see next how to use a mutableMap to get around this.

Word counting by streaming with an Iterator and using mutable Maps

In all of the tutorials so far, I’ve pretty much stuck to immutable data structures except when mutable ones show up due to context (like Arrays coming out of the toString method). It’s good to try to make use of immutable data structures where possible, but there are times when mutable ones are more convenient and perhaps more appropriate.

With the immutable Maps we saw in the previous tutorial, you could not change the assignment to a key, nor could you add a new key.

Note: when you start with some values already in a Map, Scala can infer the types of the keys and the values, but when initializing an empty Map, it is necessary to explicitly declare the key and value types.

With this in hand, here is how we can use flatMap plus a mutable Map to count words in a text without reading the entire text into memory.

Now we can’t modify the values on fixedCounts, which has advantages in many contexts, e.g. we can’t accidentally destroy values or add unwanted keys, and there are (positive) implications for parallel processing.

If you are just going to analyze the same file again and again, this is probably not what you need — just download the file and use it locally. However, it can be quite useful in contexts where you are exploring links within pages (e.g. while processing Wikipedia or Twitter data) and need to read in content from URLs on the fly.

Use (up) the Source

A final note on the Iterators you get with Source.fromFile and Source.fromURL: you can only iterate through them once! This is part of what makes them more efficient — they aren’t holding all thattext in memory. So, don’t be surprised if you get the following behavior.

scala> val holmesIterator = Source.fromFile("pg1661.txt").getLines
holmesIterator: Iterator[String] = non-empty iterator
scala> holmesIterator.foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<...many lines of output...>
This Web site includes information about Project Gutenberg-tm,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how to
subscribe to our email newsletter to hear about new eBooks.
scala> holmesIterator.foreach(println)
<...nothing output!...>

So, the Iterator is used up! If you want to go through the file again, you’ll need to spin up a new Iterator just like you did the first time around. The neat thing about staying with the Iterators and not converting to Lists (and thus bringing everything into memory) is that each mapping operation we do on the Iterator applies only for the current item we are looking at, so we never need to read the whole file into memory.

Of course, if you have a reasonably small file to work with, you should feel absolutely free to toList it and work with it that way if you prefer — it will often be more convenient since you can do the groupBy and mapValue pattern.

Newsletter

Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

Email address:

Join Us

With 1,043,221 monthly unique visitors and over 500 authors we are placed among the top Java related sites around. Constantly being on the lookout for partners; we encourage you to join us. So If you have a blog with unique and interesting content then you should check out our JCG partners program. You can also be a guest writer for Java Code Geeks and hone your writing skills!