Thursday, 20 April 2017

Java SE 9 - JPMS module naming

The Java Platform Module System (JPMS) is soon to arrive, developed as Project Jigsaw. This article follows the introduction and looks at how modules should be named.

As with all "best practices", they are ultimately the opinion of the person writing them. I hope however to convince you that my opinion is right ;-). And as a community, we will certainly benefit if everyone follows the same rules, just like we benefited from everyone using reverse-DNS for package names.

As can be seen, the module contains a set of packages (exported and hidden), all under one super-package.
The module name is the same as the super-package name.
The author of the module is asserting control over all names below org.joda.time, and could create a module org.joda.time.18n in the future if desired.

To understand why this approach makes sense, and the finer details, read on.

JPMS naming

Naming anything in software is hard. Unsurprisingly then, agreeing an approach to naming modules has also turned out to be hard.

The naming rules allow dots, but prohibit dashes, thus lots of name options are closed off.
As a side note, module names in the JVM are more flexible, but we are only considering names at the Java level here.

These are the two basic approaches which I think make sense:

1) Project-style. Short names, as commonly seen in the jar filename from Maven Central.

2) Reverse DNS. Full names, exactly as we've used for a package names since Java v1.0.

Here are some examples to make it more clear:

Project-style

Reverse-DNS

Joda-Time

joda.time

org.joda.time

Commons-IO

commons.io

org.apache.commons.io

Strata-Basics

strata.basics

com.opengamma.strata.basics

JUnit

junit

org.junit

All things being equal, we'd choose the shorter name - project-style.
It is certainly more attractive when reading a module-info.java file.
But there are some clear reasons why reverse-DNS must be chosen.

It is worth noting that Mark Reinhold currently indicates a preference for project-style names. However, the linked mail doesn't really deal with the global uniqueness or clashing elements of the naming problem, and others in the expert group disagreed with project-style names.

Ownership and Uniqueness

The original designers of Java made a very shrewd choice to proposed reverse-DNS names for packages. This approach has scaled very well, through the incredible rise of open source software. It provides two key properties - Ownership and Uniqueness.

The ownership aspect of reverse-DNS delegates control of part of the global DNS namespace to an individual or company. It is a universally agreed approach with enough breadth of identifiers to make clashes rare. Within that namespace, developers are then responsible for ensuring uniqueness. Together, these two aspects result in globally unique package names. As such, it is pretty rare that code has two colliding packages, despite modern applications pulling in hundreds of dependent jar files. For example, the Spark framework and Apache Spark co-exist despite having the same simple name. But look what happens if we only use project-style names:

Project-style

Reverse-DNS

Spark framework

spark.core

com.sparkjava.core

Apache-Spark

spark.core

org.apache.spark.core

As can be seen, the project-style names clash!
JPMS will simply refuse to start a modulepath where two modules have the same name, even if they contain different packages.
(Since these projects haven't chosen module names yet, I've tweaked the example to make them clash. But this example is far from impossible, which is the point here!)

Not convinced?
Well imagine what would happen if package names were not reverse-DNS.
If your application pulls in hundreds of dependencies, do you think there would be no duplicates?

Of course we have project-style names today in Maven - the jar filename is the artifactId which is a project-style name.
Given this, why don't we have problems today?
Well it turns out that Maven is smart enough to rename the artifact if there is going to be a clash.
The JPMS does not offer this ability - your only choice with a clash will be to rewrite the module-info-class file of the problematic module and all other modules that refer to it.

As a final example of how project-style name clashes can occur, consider a startup creating a new project - "willow".
Since they are small, they choose a module name of "willow".
Over the next year, the startup becomes fantastically successful, growing at an exponential rate, meaning that there are now 100s of modules within the company depending on "willow".
But then a new Open Source project starts up, and calls itself "willow".
Now, the company can't use the open source project.
Nor can the company release "willow" as open source.
These clashes are avoided if reverse-DNS names are used.

To summarize this section, we need reverse-DNS because module names need to be globally unique, even when writing modules that are destined to remain private. The ownership aspect of reverse-DNS provides enough namespace separation for companies to get the uniqueness necessary.
After all, you wouldn't want to confuse Joda-Time with the freight company also called Joda would you?

Modules as package aggregates

The JPMS design is fundamentally simple - it extends JVM access control to add a new concept "modules" that groups together a set of packages. Given this, there is a very strong link between the concept of a module and the concept of a package.

The key restriction is that a package must be found in one and only one module.

Given that a module is formed from one or more packages, what is the conceptually simplest name that you can choose? I argue that it is one of the package names that forms the module. And thus a name you've already chosen.
Now, consider we have a project with three packages, which of these three should be the module name?

Again, I'd argue there isn't really a debate. There is a clear super-package, and that is what should be used as the module name - org.joda.time in this case.

Hidden packages

With JPMS, a module can hide packages. When hidden, the internal packages are not visible in Javadoc, nor are they visible in the module-info.java file. This means that consumers of the module have no immediate way of knowing what hidden packages a module has.

Now consider again the key restriction that a package must be found in one and only one module.
This restriction applies to hidden packages as well as exported ones.
Therefore if your application depends on two modules and both have the same hidden package, your application cannot be run as the packages clash.
And since information on hidden packages is difficult to obtain, this clash will be surprising.
(There are some advanced ways to around these clashes using layers, but these are designed for containers, not applications.)

The best solution to this problem is exactly as described in the last section.
Consider a project with three exported packages and two hidden ones.
So long as the hidden packages are sub-packages of the module name, we should be fine:

By using the super-package name as the module name, the module developer has taken ownership of that package and everything below it.
So long as all the non-exported packages are conceptually sub-packages, the end-user application should not see any hidden package clashes.

Automatic modules

JPMS includes a feature whereby a regular jar file, without a module-info.class file, turns into a special kind of module just by placing it on the modulepath.
The automatic module feature is controversial in general, but a key part of this is that the name of the module is derived from the filename of the jar file.
In addition, it means that people writing module-info.java files have to guess the name that someone else will use for a module. Having to guess a name, and having the Java platform pick a name based on the filename of a jar file are both bad ideas in my opinion, and that of many others, but our efforts to stop them seem to have failed.

The naming approach outlined in this article provides a means to mitigate the worst effects of this.
If everyone uses reverse-DNS based on the super-package, then the guesses that people make should be reasonably accurate, as the selection process of a name should be fairly straightforward.

What if there isn't a clear super-package?

There are two cases to consider.

The first case is where there really is a super-package, it's just that it has no code. In this case, the implied super-package should be used. (Note that this example is Google Guava, which doesn't have guava in the package name!):

Can you have sub-modules?

Yes. When a module name is chosen, the developer is taking control of a namespace. That namespace consists of the module name and all sub-names below it - sub-package names and sub-module names.

Ownership of that namespace allows the developer to release one module or many. The main constraint is that there should not be two published modules containing the same package.

As a side effect of this, the practice of larger projects releasing an "all" jar will need to stop. An "all" jar is used when the project has lots of separate jar files, but also wants to allow end-users to depend on a single jar file. These "all" jar files are a pain in Maven dependency trees, but will be a disaster in JPMS ones, as there is no way to override the metadata, unlike in Maven.

What if my existing project does not meet these guidelines?

The harsh suggestion is to change the project in an incompatible manner so it does meet the guidelines.
JPMS in Java SE 9 is disruptive. It does not take the approach of providing all the tools necessary to meet all the edge cases in current deployments. As such, it is not surprising that some jar files and some projects will require some major rework.

Why ignore the Maven artifactId?

JPMS is an extension to the Java platform (language and runtime). Maven is a build system. Both are necessary, but they have different purposes, needs and conventions.

JPMS is all about packages, grouping them together to form modules and linking those.
In this way, developers are working with source code, just like any other source code.
What artifacts the source code is packed up into is a separate question.
Understanding the separation is hard, because currently there is a one-to-one mapping between the module and the jar file, however, we should not assume this will always be the case in the future.

Another example of this separation is versioning. JPMS has little to no support for versions, yet build systems like Maven do. When running the application, Maven is responsible for collecting a coherent set of artifacts (jar files) to run the application, just as before. It's just that some of those might be modules.

Finally, the Maven artifactId does not exist in isolation. Maven makes unique identifiers by combining the groupId, artifactId and classifier. Only the combination is sufficiently globally unique to be useful. Picking out just the artifactId and trying to make a unique module name from it is asking for trouble.

Summary

JPMS module names, and the module-info.java in general, are going to require real thought to get right. The module declaration will be as much a part of your API as your method signatures.

The importance is heightened because, unlike Maven and other module systems, JPMS has no way to fix broken metadata.
If you rely on some modular jar files, and get a clash or find some other mistake in the module declarations, your only options will be to not use JPMS or to rewrite the module declarations yourself.
Given this difficulty, it is not yet clear that JPMS will be a success, thus your best option may be to not modularize your code.

See the TL;DR section above for the summary of the module name proposal.
Feedback and questions welcome.

11 comments:

Hi, it appears to be a good strategy in general, but it won't work in each and every case, i.e. it's not something which can be applied in an automated fashion. E.g. take Guava which you mentioned. It actually contains these packages:

I.e. the "common super-package" would be "com.google" which doesn't seem like a good module name as it's too generic.

But even when pretending everything in Guava's JAR was under com.google.common, there'd be an issue when also using Guava's test lib. This also has com.google.common as it's common super-package. Now you'd have two modules with the same name.

Or consider the case where someone publishes a patched version of a module, e.g. to fix a bug in a library not under their direct control. Wouldn't it make sense to keep the original package names but use another module name, allowing to tell the two apart?

Doing large style package renamings in order to arrive at unique super-package names seems not appropriate to me.

An important thing you mention is taking ownership of a namespace. I think that's key, but not by claiming packages but by claiming group ids. Together with the artifact id (and perhaps classifier if needed) it seems one can arrive at reasonably unique names.

I made good experiences by combining group and artifact id, omitting any redundant middle part. E.g. in the case of com.google.guava:guava and com.google.guava:guava-testlib one would arrive at com.google.guava and com.google.guava.testlib, respectively. And the good thing is that Maven Central already prevents conflicts by assigning group ids to individual parties. E.g. I as a non-Guava contributor couldn't publish a module with an automatically derived module name of com.google.guava as I'm not allowed to publish to this group id (unlike any approach based on package names).

Agh, the indentation of the package tree was removed during posting. There are "com.google.common.annotations", "com.google.common.base", "com.google.common. ..." and "com.google.thirdparty.publicsuffix".

Guava is indeed weird, but the thirdparty packge is intended to be private if you read the docs. Thats why I suggested that case should be sharded. That said, in theory Google takes ownership of `com.google` in general, and can do anything it likes within that namespace providing it is consistent and sensible.

Using groupId and artifactId to generate a module name is appealing, yet IMO a bad idea. A lot of the time it will result in the super-package name, so appear to work. But as I've tried to emphasise, artifacts and modules are different things. And because JPMS modules are built of packages, it is the package name that must be used. Let me try to clarify the distinction.

JSRs are released as specifications, and it is often the case that different teams produce their own jar file artifact containing the specification. So, for example there might be an `apache-jsr-340.jar` and a `jboss-jsr-340.jar`. But both of these jar file artifacts contain the same package and the same code. If you get a conflict today with Maven you can exclude one of the artifacts (although in most cases it won't matter, as both jar files are functionally identical, the first on the classpath is fine). However, with JPMS modules, that conflict cannot be resolved.

But there shouldn't be a conflict in the first place. The correct module name is the super-package of the JSR code, `javax.servlet`, not something referring to apache or jboss. ie. there are two different artifacts supplying one conceptual module (both artifacts have equivalent functionality). Taking this approach, other modules that depend on the jar will use the module name `javax.servlet` in their module-info.java files.

When assembling this using Maven, the pom.xml will still use the groupId and artifactId to reference one of the two artifacts. But there is not a 1:1 mapping from artifact to module.

This also answers your question about changing the module name when patching it to fix a bug. Again, this is two different artifacts but one module name. Tooling like Maven is going to have to resolve this many-to-one problem.

In your example, you're describing the difference between the interface vs the implementation. This of course makes sense, but how many of the legacy automodules does this really apply to? I surmise it's really an edge case.

Since Mark has removed the Module Name mechanism for authors to provide a sensible default, we're back to the original problems.

Ignoring groupIds in module naming ignores relevant details about the providers of the module and their licensing regimes. It is not enough to just depend upon 'javax.servlet' since there won't be only one provider of that package with universally acceptable licensing. This is why you often see these API jars from Apache since people re-author the interfaces under Apache license to avoid the license on the originally published API jar. This same problem wont go away with modules. This is exactly why depending upon module names is a bad idea. Depending upon package names is what really matters. (Full disclosure, I am the OSGi CTO.)

See the next blog in the series, modules != artifacts - http://blog.joda.org/2017/04/java-se-9-jpms-modules-are-not-artifacts.html . As for package names, I tend to agree that it isn't that clear what is obtained by depending on a module name rather than depending on a package name. In theory, it is an extra abstraction level, but I'm struggling to see what I'd do with it.