Datasets provide an RDD-like API, but with all the performance advantages given by Catalyst and Tungsten. groupBy(), for example, which was always a performance no-no on RDDs, can be done efficiently with Datasets without worrying for example about some groups being too large for a single node because Catalyst can spill large groups. And it's all strongly typed and type-safe.

Saturday, October 3, 2015

You would think it would be easy to find an example pom.xml for Scala somewhere on the web. It's not. And the example one at http://www.scala-lang.org/old/node/345 doesn't work because its <sourcedir>src/main/java</sourcedir> excludes all your Scala files from src/main/scala!