Errata for Learning Spark

The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".

The following errata were submitted by our customers and approved as valid errors by the author or editor.

In Example 3-21, return types for the getMatches* methods are incorrect. getMatchesFunctionReference() should return RDD[Boolean], and getMatchesFieldReference() and getMatchesNoReference() should either return RDD[Array[String]] or change the implementation to use flatMap instead of map.

Note from the Author or Editor:I've updated the return types, thanks for catching this.

Anonymous

Feb 19, 2015

Mar 27, 2015

PDF

Page vii
2nd parapgraph

duplicated wording
READ:
"You’ll learn how to learn how to download..."
SHOULD READ:
"You’ll learn how to download..."

Note from the Author or Editor:Fixed in fe6dc3e1dd493a83464e115a4309ab806cf240cb

Page 9 has the following text:
This will download a compressed tar
file, or “tarball,” called spark-1.1.0-bin-hadoop1.tgz .
On page 10, a different tarball is referenced:
cd ~
tar -xf spark-1.1.0-bin-hadoop2.tgz
cd spark-1.1.0-bin-hadoop2

In order to match with the code in Github:
com.oreilly.learningsparkexamples.mini.Scala.WordCount
should be:
com.oreilly.learningsparkexamples.mini.scala.WordCount
and
com.oreilly.learningsparkexamples.mini.Java.WordCount
should be:
com.oreilly.learningsparkexamples.mini.java.WordCount
Lower case scala and java as paths. Compilation fails otherwise.

Note from the Author or Editor:I've fixed this in the copy edit version we got back.

Note from the Author or Editor:Thanks for catching this, I've added in the missing semicolons to example 3-17.

Tatsuo Kawasaki

May 01, 2015

PDF

Page 32
1

I have been earlier asked by the author to return the book because I reported an issue.
I am therefore not writing this with the intention of it being corrected, but just to let others know that how the code can be run correctly.
The following code does not run:
-----------------------------------------------------------------------------------------------
class SearchFunctions(val query: String) { def isMatch(s: String): Boolean = {
s.contains(query)
}
def getMatchesFunctionReference(rdd: RDD[String]): RDD[String] = {
// Problem: "isMatch" means "this.isMatch", so we pass all of "this”
rdd.map(isMatch)
}
def getMatchesFieldReference(rdd: RDD[String]): RDD[String] = {
// Problem: "query" means "this.query", so we pass all of "this"
rdd.map(x => x.split(query)) }
def getMatchesNoReference(rdd: RDD[String]): RDD[String] = {
// Safe: extract just the field we need into a local variable
val query_ = this.query
rdd.map(x => x.split(query_))
} }
-----------------------------------------------------------------------------------------------
And it can be modified or updated to the following so that it can run:
-----------------------------------------------------------------------------------------------
// As the RDD class is not automatically imported therefore we have to import it explicitly
import org.apache.spark.rdd.RDD
class SearchFunctions(val query: String) {
def isMatch(s : String ): Boolean = {
s.contains(query)
}
def getMatchesFunctionReference(rdd: RDD[String]): RDD[Boolean] = {
// Problem: "isMatch" means "this.isMatch", so we pass all of "this"
rdd.map(isMatch)
}
def getMatchesFieldReference(rdd: RDD[String]): RDD[Array[String]] = {
// Problem: "query" means "this.query", so we pass all of "this"
rdd.map(x => x.split(query))
}
def getMatchesNoReference(rdd: RDD[String]): RDD[Array[String]] = {
// Safe: extract just the field we need into a local variable
val query_ = this.query
rdd.map(x => x.split(query_))
}
}
-----------------------------------------------------------------------------------------------
Regards,
Gourav

Note from the Author or Editor:We should include the import org.apache.spark.rdd.RDD in the standard imports.

Note from the Author or Editor:I've fixed this is the latest build for author provided images, but if O'Reilly has already started remaking the images you may need to redo the Figure 3-3 bottom right as the submitter has suggested.

Note from the Author or Editor:Fixed by author in 54759cf2cf0e41b81bdd56eaa5adb308ac911845

Anonymous

Jan 25, 2015

Mar 27, 2015

Printed

Page 45
Example 3-40

In example 3-40, "result.persist(StorageLevel.DISK_ONLY)" will not work, as it is not imported in the example.
Adding "import org.apache.spark.storage.StorageLevel" will fix this.

Note from the Author or Editor:Thanks for pointing this out, we mention the package that the StorageLevels come from in the persistence table but I've added an import in the example code for clarity.

Tom Hubregtsen

Apr 19, 2015

May 08, 2015

PDF

Page 50
Table 4-2

Right outer join and left outer join "Purpose" descriptions are reversed; in the right outer join, the key must be present in the "other" RDD, not "this" RDD. Reverse mistake is made in the left outer join purpose description.
It's clear from looking at the "Result" columns, which are correct, that in the right-join case the only key in the result is from "other", while in left-join the keys in the results are from "this".
From scaladoc for right outer join:
For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k.

Note from the Author or Editor:Great catch, I've swapped the two.

Wayne M Adams

Feb 24, 2015

Mar 27, 2015

Printed

Page 53
Example 4-11. Second line.

Shouldn't "rdd.flatMap(...)" be "input.flatMap(...)"

Note from the Author or Editor:Fixed in atlass

Jim Williams

Apr 06, 2015

May 08, 2015

PDF

Page 54
United States

Example 4-12 does not print out its results as the others do. Also, 4-13 should arguably use a foreach to print as it uses side effects.

Note from the Author or Editor:Fixed print and swapped to foreach in 6f5d7e5d065f88e4df46e03a61fb5b70d8982649

Justin Pihony

Jan 25, 2015

Mar 27, 2015

PDF

Page 57
Example 4-16

Apparent cut-and-paste mistake: the "Custom parallelism" example is the same as the default one, in that no parallelism Int was specified in the example call.

collectAsMap() doesn't return multi Map. so Result should be
Map{(1,2), (3, 6)}
https://github.com/apache/spark/blob/b0d884f044fea1c954da77073f3556cd9ab1e922/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L659
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
* one value per key is preserved in the map returned)

Note from the Author or Editor:Thanks for catching that, I've updated the example.

Tatsuo Kawasaki

May 08, 2015

PDF

Page 64
4th paragraph

Page 64, Section "Determining an RDD’s Partitioner", second line says, "or partitioner() method in Java".
There is no method "partitioner()" available on "org.apache.spark.api.java.JavaPairRDD." (spark version 1.3.1)
Is this a typo for the method "partitions()"?

Note from the Author or Editor:Seems that the partitioner() function doesn't exist. I'll drop it. (partitions() doesn't quite return the partitioner rather a list of the partitions).

Gaurav Bhardwaj

May 09, 2015

Printed

Page 65
Example 4-24

In example 4-24, "val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))" will not work, as it is not imported in the example. Either an import or a change into "new org.apache.spark.HashPartitioner(2)" would work.

Note from the Author or Editor:Thanks for pointing this out, I'll update our example to include the import. Fixed in cd090206381a9bbf0466468bf7128a808085522f.

Tom Hubregtsen

Mar 10, 2015

Mar 27, 2015

PDF

Page 66
JSON

It is mentioned that liftweb-json is used for JSON-parsing, however Play JSON is used for parsing and then liftweb-json for JSON output. This is a bit confusing.

Note from the Author or Editor:I've fixed this in the latest push.

Anonymous

Aug 05, 2014

Jan 26, 2015

PDF

Page 67
United States

The case under //Run 10 iterations shadows the links variable. This might be confusing for new developers

Note from the Author or Editor:Thanks, thats a good point since that behaviour is maybe confusing for people coming from other languages. I've clarified it by using a different local variable in the dev verison.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 70
United States

feildnames

Note from the Author or Editor:Fixed in the latest build (typo)

Anonymous

Aug 17, 2014

Jan 26, 2015

PDF

Page 70
first paragraph

"In Python if an value isn’t present None is used and if the value is present the regular value"
should be
"In Python if a value isn’t present None is used and if the value is present the regular value"

Note from the Author or Editor:Fixed in atlass

Mark Needham

Nov 30, 2014

Jan 26, 2015

PDF

Page 72
United States

"The input formats that Spark wraps all
transparently handle compressed formats based on the file extension."
is an awkwardly worded sentence.

Note from the Author or Editor:Improved a bit :)

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 73
Example 5-4

(p. 91 of the PDF doc; p. 73 of the book). This is a total nitpick, but the file url is
file://home/holden/salesFiles
and instead should be
file:///home/holden/salesFiles

Note from the Author or Editor:Thanks, fixed :)

Wayne M Adams

Feb 26, 2015

Mar 27, 2015

PDF

Page 73
United States

"Sometimes it’s important to know which file which piece of input came from"
should probably be
"Sometimes it’s important to know which file each piece of input came from"

Note from the Author or Editor:Thanks for catching this. I've fixed the case issue in this example.

Myles Baker

May 11, 2015

PDF

Page 78
United States

import Java.io.StringReader should use a lowercase j
This happens in a number of locations actually.

Note from the Author or Editor:Thanks for catching this. I've applied a global fix to the dev copy.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 79
United States

"If there are only a few input files, and you need to use the wholeFile() method,"
should be
"If there are only a few input files, and you need to use the wholeTextFile() method,"

Note from the Author or Editor:Thanks, fixed to wholeTextFiles.

Justin Pihony

Apr 27, 2015

May 08, 2015

Printed

Page 82
example 5-20

Example 5-20. Loading a SequenceFile in Python should drop the "val" on "val data = ..." Works otherwise.

Note from the Author or Editor:Thanks for catching this, I went ahead and fixed this in atlass.

jonathan greenleaf

Apr 09, 2015

May 08, 2015

PDF

Page 84
United States

"A similar function, hadoopFile(), exists for working with Hadoop input formats implemented with the older API."
This sentence is in respect to newAPIHadoopFile and should be moved up by one sentence. Maybe as a side-note as it would then throw off the flow into talking about the 3 classes.

Note from the Author or Editor:re-arranged that paragraph for clarity.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 85
Example 5-13/5-14

Minor issue; there should be a
import Java.io.StringReader
statement in your CSV loading examples in Scala (and presumably Java)

Note from the Author or Editor:I fixed in holden@hmbp2:~/repos/1230000000573$ git log
commit a9f9f34a3b8513885325f47c1101e657cb5faa89

Timothy Elser

Oct 07, 2014

Jan 26, 2015

ePub

Page 87

"We have looked at the fold, combine, and reduce actions on basic RDDs". There is no RDD.combine(), did you mean aggregate()?

Note from the Author or Editor:Replace combine with aggregate (fixed in f7df06b0c1d730a3a20f173dea8d4ce5c137aa0d).

Thomas Oldervoll

Jan 25, 2015

Mar 27, 2015

PDF

Page 90
United States

"you can specify SPARK_HADOOP_VERSION= as a environment variable" should be as AN environment variable.

Note from the Author or Editor:Fixed :)

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 91
Example 5-31

(p. 109 PDF document; page 91 of book). Minor -- with the import of the HiveContext class, there's no need to fully qualify the class name when invoking the HiveContext constructor.

Note from the Author or Editor:Thanks for catching this, I've simplified the code as suggested in b9d7e376aae27e2f8d4de6d431691a62852d92ba.

Wayne M Adams

Feb 26, 2015

Mar 27, 2015

PDF

Page 95
United States

Why is the SparkContext and JavaSparkContext in example 5-40 and 5-41 using different arguments? If no reason, then they should be synchronized.

Note from the Author or Editor:Unified :)

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 101
United States

Example 6-3 creates the SparkContext, while the other examples do not.

Note from the Author or Editor:Unified, thanks.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 102
Third paragraph

Don't need a comma before the word "or" in:
"... when there are multiple values to keep track of, or when the same value needs..."
"... percentage of our data to be corrupted, or allow for the backend to fail..."

Note from the Author or Editor:Fixed.

Anonymous

Feb 04, 2015

Mar 27, 2015

PDF

Page 103
United States

Example 6-5 outputs Too many errors: # in #
But, this would be "invalid in valid", where what is really needed is "invalid in total" or another wording.

Note from the Author or Editor:True, changed the output in the example to clarify.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 111
United States

String interpolation in example 6-17 needs to be in brackets as it uses a property of the y object.

Note from the Author or Editor:Good catch, fixed in the dev build.

Justin Pihony

Apr 27, 2015

May 08, 2015

ePub

Page 112
3rd

Text reads: “Spark has many levels of persistence to chose from based on what our goals are. ”
should read: “Spark has many levels of persistence to choose from based on what our goals are. ”

Note from the Author or Editor:This is a good catch, since split on an empty string returns an array with a single element the result isn't what we want. Swapping the order of the map/filter does what we want. Fixed in 0374336d16ebb32ca3452b37c7bb1642ca0755a3.

Wayne M Adams

Mar 10, 2015

Mar 27, 2015

PDF

Page 147
United States

"To trigger computation, let’s call an action on the counts RDD and collect() it to the driver, as shown in Example 8-9" might read better as "To trigger computation, let’s call an action on the counts RDD BY collect()ING it to the driver, as shown in Example 8-9."

Note from the Author or Editor:Does sound better, thanks updated in dev.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 156
United States

Example 8-11 coalesces to 5, but the number of partitions is listed as 4.

Justin Pihony

Apr 27, 2015

May 08, 2015

Printed

Page 157
6th line down of text

Extra "this".

Note from the Author or Editor:Thanks for catching this, fixed in atlass.

Note from the Author or Editor:Thanks for catching this, I think this was from an indexing tag that accidentally got included in the text. I've changed this in atlass and it should be removed in the next update.

Wayne M Adams

Apr 02, 2015

May 08, 2015

PDF

Page 163
Table

Table 9.1 lists the Scala and Java types/imports for Timestamp.
java.sql.TimeStamp
should be
java.sql.Timestamp

Note from the Author or Editor:Fixed.

Anirudh Koul

Feb 03, 2015

Mar 27, 2015

PDF

Page 165
United States

Example 9-8 does not match 9-6 and 9-7 in that it does not show the creation of the SparkContext.

Note from the Author or Editor:Unified.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 179
United States

Example 9-39 has a collect and println, whereas 9-36 and 9-37 do not.

Note from the Author or Editor:Removed the println from the java example (done in atlas).

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 185
United States

Example 10-5 @override is missing on the call method.

Note from the Author or Editor:Added missing @override

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 195
United States

In Figure 10-7, An arrow is missing on the inverse graph from {4,2} to 20

Note from the Author or Editor:On the right side of the graph we need to add an arrow from the {4,2} box to 20 and remove its line to box 22.

Justin Pihony

Apr 27, 2015

May 08, 2015

PDF

Page 199
United States

"...it needs to have a consistent date format..."
I am pretty sure this should be datA format

Note from the Author or Editor:fixed.

Justin Pihony

Apr 28, 2015

May 08, 2015

PDF

Page 201
United States

Example 10-32 uses a helper with a map for the final action, whereas 10-33 simply calls print

Note from the Author or Editor:There was a difference, I've made both just print.