Topics

Featured in Development

Peter Alvaro talks about the reasons one should engage in language design and why many of us would (or should) do something so perverse as to design a language that no one will ever use. He shares some of the extreme and sometimes obnoxious opinions that guided his design process.

Featured in AI, ML & Data Engineering

Today on The InfoQ Podcast, Wes talks with Katharine Jarmul about privacy and fairness in machine learning algorithms. Jarul discusses what’s meant by Ethical Machine Learning and some things to consider when working towards achieving fairness. Jarmul is the co-founder at KIProtect a machine learning security and privacy firm based in Germany and is one of the three keynote speakers at QCon.ai.

Featured in Culture & Methods

Organizations struggle to scale their agility. While every organization is different, common patterns explain the major challenges that most organizations face: organizational design, trying to copy others, “one-size-fits-all” scaling, scaling in siloes, and neglecting engineering practices. This article explains why, what to do about it, and how the three leading scaling frameworks compare.

GS Collections by Example – Part 1

I am a Java Developer, Tech Fellow and Managing Director at Goldman Sachs. I am the creator of the GS Collections framework that Goldman Sachs open sourced in January 2012. I am also a former Smalltalk developer.

When I started working with Java, there were two things that I missed.

Smalltalk’s block closures (aka lambdas)

The wonderfully feature rich Smalltalk Collections Framework.

I wanted both of these features as well as compatibility with the existing Java Collections interfaces. Around 2004, I realized that no one was going to give me everything I was looking for in Java. I also knew at this point that I would likely be programming in Java for at least the next 10-15 years of my career. So I decided to start building what I was looking for.

Related Sponsor

Fast forward 10 years. I now have almost everything I ever wanted in Java. I have support for lambdas in Java 8, and I now get to use lambdas and method references with arguably the most feature rich Java collections framework available – GS Collections.

Here is a comparison of the features available in GS Collections, Java 8, Guava, Trove and Scala. These may not be all of the features you are looking for in a Collections framework, but they are the features that I have needed or other GS developers I work with have needed in Java over the last 10+ years.

Features

GSC 5.0

Java 8

Guava

Trove

Scala

Rich API

✓

✓

✓

✓

Interfaces

Readable, Mutable, Immutable, FixedSize, Lazy

Mutable, Stream

Mutable, Fluent

Mutable

Readable, Mutable, Immutable, Lazy

Optimized Set & Map

✓ (+Bag)

✓

Immutable Collections

✓

✓

✓

Primitive Collections

✓ (+Bag, +Immutable)

✓

Multimaps

✓ (+Bag, +SortedBag)

✓ (+Linked)

(Multimap trait)

Bags (Multisets)

✓

✓

BiMaps

✓

✓

Iteration Styles

Eager/Lazy,Serial/Parallel

Lazy, Serial/Parallel

Lazy, Serial

Eager, Serial

Eager/Lazy, Serial/Parallel (Lazy Only)

I described the combination of the features that I believe make GS Collections interesting in an interview with jClarity last year. You can read them in their original form here.

Why would you use GS Collections now that Java 8 is out and includes the Streams API? While the Streams API is a big improvement to the Java Collections Framework, it doesn’t have all the features you might want. As shown in the matrix above, GS Collections has multimaps, bags, immutable containers, and primitive containers. GS Collections has optimized replacements for HashSet and HashMap, and its Bags and Multimaps build on those optimized types. The GS Collections iteration patterns are on the collections interfaces so there’s no need to “enter” the API with a call to stream() and “exit” the API with a call to collect(). This results in much more succinct code in many cases. Finally, GS Collections is compatible back to Java 5. This is a particularly important feature for library developers, since they tend to support their library on older versions of Java well after a new major release comes out.

I will show some examples of how you can leverage these features in many different ways. These examples are variants of the exercises in the GS Collections Kata; a training class we use inside of Goldman Sachs to teach our developers how to use GS Collections. We open sourced this training as a separate repository in GitHub.

Example 1: Filtering a collection

One of the most common things you will want to do with GS Collections is to filter a collection. GS Collections provides several different ways to accomplish this.

In the GS Collections Kata, we will often start with a list of customers. In one of the exercises, I want to filter the list of customers down to a list which only includes customers who live in London. The following code shows how I can accomplish this using an iteration pattern named “select”.

The select method on MutableList returns a MutableList. This code executes eagerly, meaning all computation to select the matching elements from the source list and add them into the target list has been performed by the time the call to select() completes. The name “select” comes from the Smalltalk heritage. Smalltalk has a basic set of collection protocols named select (aka filter), reject (aka filterNot), collect (aka map, transform), detect (aka findOne), detectIfNone, injectInto (aka foldLeft), anySatisfy and allSatisfy.

If I wanted to accomplish the same thing using lazy evaluation, I would write it as follows:

Here I have added a call to a method named asLazy(). All of the other code has stayed pretty much the same. The return type of select has changed because of the call to asLazy(). Instead of a MutableList<Customer>, now I get back a LazyIterable<Customer>. This is pretty much the equivalent of the following code using the new Streams API in Java 8:

Here the method stream() and then the call to filter() return a Stream<Customer>. In order to test the size, I have to either convert the Stream to a List as above, or I can use the Java 8 Stream.count() method:

Both GS Collections interfaces MutableList and LazyIterable share a common ancestor named RichIterable. In fact, I could write all of this code just using RichIterable. Here’s the example using only RichIterable<Customer>, first lazily

There is a common parent interface for both MutableList and ImmutableList named ListIterable. It can be used in place of either type as a more general type. RichIterable is the parent type for ListIterable. So this code can also be more generally written as follows:

I will leave it to the reader to decide whether this impacts readability. I tend to break fluent calls up and introduce intermediate types if I feel it will help future readers of the code to understand things better. This comes at the cost of more code to read, but this in turn can lower the cost of understanding, which can be more important for less frequent readers of the code.

I can accomplish the conversion of the List to a Set in the select method itself. The select method has an overloaded form defined which takes a Predicate as the first parameter, and a result collection as a second parameter:

In the following case I get back a CopyOnWriteArrayList, which is part of the JDK. The point is that this method will return whatever type I specify, but it has to be a class that implements java.util.Collection:

The string “London” is passed as the second parameter to each call of the method defined on Predicate2. The first parameter will be the Customer from the list.

The method selectWith, like select, is defined on RichIterable. Therefore, everything I have previously demonstrated with select will work with selectWith. This includes support on all the different mutable and immutable interfaces, support for different co-variant types, and support for lazy iteration. There is also a form of selectWith which takes a third parameter. Similar to select with two parameters, the third parameter in selectWith can take a target collection.

For instance, the following code filters from a List to a Set using selectWith:

The last thing I will show is that the select and selectWith methods can be used with any collection that extends java.lang.Iterable. This includes all of the JDK types as well as any third party collection libraries. The first class that ever existed in GS Collections was a utility class named Iterate. Here is a code example that shows how to select from an Iterable using Iterate.

There are also variations that take target collections. All of the basic iteration protocols are available on Iterate. There is also a utility class that covers lazy iteration (named LazyIterate), and it also can work with any container that extends java.lang.Iterable. For example:

SetAdapter can be used similarly for any implementation of java.util.Set.

Now if you have the kind of problem that can benefit from data level parallelism, you could use one of two approaches for parallelizing this problem. First I will demonstrate how to use the ParallelIterate class to solve this problem using an eager/parallel algorithm:

The ParallelIterate class will take any Iterable as a parameter, and always returns java.util.Collection as its result. ParallelIterate has been around in GS Collections since 2005. Eager/parallel has been the only form of parallelism GS Collections has supported until the 5.0 release, when we added a lazy/parallel API to RichIterable. We do not have an eager/parallel API on RichIterable, as we felt that lazy/parallel makes more sense as a default case. We may add an eager/parallel API directly to RichIterable in the future, depending on feedback we receive on the usefulness of the lazy/parallel API.

If I wanted to solve the same problem using the lazy/parallel API, I would write the code as follows:

Today, the asParallel() method only exists on a few concrete containers in GS Collections. The API has not yet been promoted to any interfaces like MutableList, ListIterable or RichIterable. The asParallel() method takes two arguments – an ExecutorService and a batch size. In the future, we may add a version of asParallel() that calculates the batch size automatically.

There is a hierarchy of ParallelIterable that includes ParallelListIterable, ParallelSetIterable and ParallelBagIterable.

I have demonstrated several different ways of filtering a collection in GS Collections using select() and selectWith(). I have shown you many combinations of eager, lazy, serial and parallel iteration, using different types from the GS Collections RichIterable hierarchy.

In part 2 of this article, to be published next month, I will cover examples including collect, groupBy, flatCollect as well as some primitive containers and the rich API available on them as well. The examples I go through in part 2 will not go into quite as much detail or explore as many options, but it is worth noting that those details and options are likely available.

About the Author

Donald Raab manages the JVM Architecture team, which is part of the Enterprise Platforms group in the Technology division at Goldman Sachs. Raab served as a member of the JSR 335 Expert Group (Lambda Expressions for the Java Programming Language) and is one of the alternate representatives for Goldman Sachs on the JCP Executive Committee. He joined Goldman Sachs in 2001 as a technical architect on the PARA team. He was named technology fellow in 2007 and managing director in 2013.

This article reflects information available to the Technology Division of Goldman Sachs only and not any other part of Goldman Sachs. It should not be relied upon or considered investment advice. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. Goldman, Sachs & Co. (“GS”) does not warrant or guarantee the accuracy, completeness or efficacy of this article, and recipients should not rely on it except at their own risk. This article may not be forwarded or otherwise disclosed except with this disclaimer intact.