Sonatype Blog

Download it All at Once: A Maven Idea

Consider, for a moment, your big corporate project that you work with every day. I know. It's huge. I see several of these projects on a constant basis. Maybe you have one big project with multi-modules. Maybe you have a more mature approach that splits up a very large project into several multi-module projects. Whatever it is, there's a chance that you also work in the kind of environment that has a huge build with hundreds of dependencies that spans tens of thousands of lines of code. Your build spends most of the day juggling dependencies, both internal and external

...and, the build takes forever the first time you run it. Correction, the build takes forever every time you run it because it is just that big, and because you have the sort of environment that demands you always check for snapshot updates. Welcome to the reality of using Maven on a very large-scale project.

What happens when a new programmer rolls up and runs the build for the first time? What happens every night when you clear a CI build's local repository? Maven downloads the Internet. That's what happens. No really, Maven downloads MBs of dependencies from a repository, and it does so one-by-one pulling dependency metadata, POMs, and binaries down from a repository. It makes hundreds (if not thousands of requests) back to a repository.

Wasting Time

I'd like to put forward the idea that this is a problem with the design of the tool. Maybe it wasn't a problem years ago, but it is today. I'm both busy and impatient, and I can see a better way. If the repository already has artifacts, metadata, and dependency information why not save some time. Maven (or Gradle or Ant or whatever your poison) should make a single request, passing in a collection of GAV coordinates and the repository should be able to calculate all of the dependency information sending back a compressed archive of everything a build would need.

This. I want this:

Making Maven more like Git

Let's look to Git as an inspiration for how Maven should work going forward. Git makes a single request to a server, the server bundles up everything Git needs, compresses it and then sends down an entire repository. Quick, no waiting for XML parsing on the client side, no pages and pages of "Downloading XYZ..." on your screen, just a simple status that updates you on compression progress and a single network interaction. Git does it right, Maven does it wrong. Let's fix it.

I shopped this idea around to a few people. Some of the responses have been positive, but a few people have had a strong negative reactions. ("This is unnecessary.") I really only care about this for selfish reasons. I have a large project, I don't really want to have to sit next to another new developer and apologize for Maven. In fact, on this same project we checkout a 300 MB git repository and it is fast as could be, then I have to show them how to run the Maven build... we start it and then we go for a long coffee break during which I apologize for Maven:

"Maven isn't like Git, it takes time to make those requests. Maven has to download POM files, artifacts, metadata. Then it has to create a bunch of dependency graphs and sort out conflicts. Once those conflicts have been solved it then has to go get all the artifacts. One by one. I apologize, listen, the coffee's on me, ok?"

The Difference: Latency is annoying.

I did some quick metrics with a real project. Actually it is more like a megaproject. This project has an insane number of internal dependencies. Running the build with a prepopulated local repo takes approximately 3 minutes. Running the build without a prepopulated local repo (against a populated Nexus repo) takes approximately 11 minutes and it downloads about 120 MB of dependencies. That 7 minute difference is the download time (over a 100 MBps connection) plus the latency required to setup HTTP connections, parse XML, and calculate dependencies.

Think about that: it took Maven 8 minutes to calculate dependencies for all projects (and all Maven plugins) and then download 120 MB. All the files, all the POMs, all the metadata was already present in Nexus. Let's use a 120 MB download over a 10 MBps link to establish our baseline - if all we were doing was throwing this file down to the client, it should take ~10 seconds. I'm going to be generous and assume that it would take nexus 20 seconds to calculate all of the dependency information on the server-side (assuming that it didn't have to retrieve anything from a proxy repo). Let's them assume that compression of stream of files would add another 5 seconds of overhead.

This is where I'd like to go. Shift the calculation of all dependencies to the server side, send down an archive, and get rid of this back and forth between Maven and the repository. I think we can take this 8 minute Maven dependency mess and turn it into a 35 second process that involves 25 seconds of waiting and 10 seconds of transfer.

This should be easy, who's up for the challenge?

I think we can do this without having to muck with Maven. I'm not a big fan of the Maven codebase (it's a monster), and I'd rather not break into the tool itself. I think we could just write a simple CLI wrapper that would interact with a Nexus REST service. All this CLI interface would have to do is parse a pom.xml, gather a collection of dependencies and Maven plugins, and then send a request to Nexus. Nexus would then take this collection of GAV coordinates alongside the name of a single repository group to use for artifact resolution. Nexus would calculate the full list of transitive dependencies and then stream it back as a compressed collection of files to the CLI wrapper.

At this point, the CLI wrapper just unpacks everything and drops it in ~/.m2/repository. Maven would then run as normal, not even aware of the antics of what I'm calling "Maven Rides the Lightning".

Note, this is not my idea. Nambi Sankaran brought this idea up during a training class several weeks ago, and I very much want to see this happen. Nambi has created a Github repository for this project and if you are interested it would be a good opportunity to get some like minded people together and start thinking about how this could be done. Don't get too excited, there's nothing in that repo yet, but I know we have all the pieces available to make this happen. (Plus, I would really like to see some other people create a good open source Nexus plugin.)

Very smart, and probably the way to go forward.
But before doing that, could Maven stop being part of the problem? I mean, with a build containing only a few plugins, each of those plugin requires a different version of Maven libraries (code, plugins, utils...) so to get started on that project, it needs to download Maven 2.0, 2.0.2, 2.0.6, 2.0.8, 2.0.9, 2.2.1, 3.0, 3.0.3!!! And I'm using Maven 3.0.4.

Manfred Moser

Great idea. I like it! This could be a warm up tool for any local repository including Maven but also Gradle, Grails and others..

lolek

After each downloaded POM maven might discover dependencies - that is why it cannot download all files at once.

Anon

Great idea, Nambi

Manfred Moser

Nexus could build the whole dependency graph on the server. That is the whole point ...

wytten

If maven worked that way, I wouldn't have the time read blog posts like this one ;-)

Caspar MacRae

What would be even better is if every single deployed artifact conformed to semantic versioning and all dependencies were stated with version ranges. Local repos would be smaller, networks less hammered, etc... (In an ideal world)

Baptiste MATHUS

Apart from the fact it doesn't have anything to do with the current subject, well, actually not so much. I personnally don't want at all to use any version range.
I actually think using version ranges is a bad practice.

What Maven provides you (given you do some things correctly) is build reproducibility. Version ranges jeopardizes just that.
One day, my build is gonna work. Then the other day, even if I changed nothing, my build could begin failing because it decided to use another dependency version than the one used previously.

Build instability is what can happen with snapshots, since snapshots are changing artifacts by definition, so that's normal. Version ranges transforms RELEASE dependencies into a kind of snapshot type, and that's really not what you want.

rupert

we build continuous, put that built version somewhere and test it. if we are happy we release it, ideally without rebuilding. in an emergency case we need to rebuild with a slight change. as it is a big project others do like us. we want to depend on their newest stuff, which they build continuously. we somehow want both: get the newest version, and in case of emergency, start off from the exact versions which were used when going into production.
how would we do that with snapshots?

Frédéric Camblor

By going this way, won't you enforce the dependency resolution mecanism on serverside, although it could be different from build tools (maven / ivy / gradle / whatever), especially when speaking about conflicts resolution and grabbing transitive dependencies ?

We cannot assume to send List *only* because, for instance in maven, we have dependency exclusions which should be taken into account when calculating dependency graph isn't it ?

Xavier Hanin

The idea is interesting, I would go even a bit further: the client computes the effective pom, send it to the server (maybe in a compacted format, the original POM format is very verbose), then the server can calculate the list of all dependencies, and send it back to the client as a BillOfMaterial (BOM) of all dependencies needed for that effective POM. Then the client can check what is already in its local repo and what is not, and send back a filtered BOM with only what it needs to the server. Then the server can send everything in one package.

That way it could be used efficiently even to update your dependencies.

Baptiste MATHUS

As just said on Twitter, I think it would fall back onto the first issue: having to download for example the parent pom to resolve the effective one would do the opposite.

But your idea of informing the server of what you already have is still very good and goes exactly in the right way, as Tim says: git-principle based. And that's what Git is doing: only downloading the difference.
So, generating a compressed list of everything present in the local repository (I think that this would be only some kB, even if you have a lot of things), and sending it optionally to the MRM would improve even further the performance of download.

Not bad idea but now to have reproducible build you will depend on both maven version and remote server version ..

Baptiste MATHUS

Maybe not if that's just used as a bootstrap mode, then falling back to the classic mode.
Say that special mode is tried first with the right switch (or even in another tool to begin), don't care about its eventual failure.

Thor Åge Eldby

I would recommend a second simple test as well before you venture into the unknown:

Take all the urls from the output of a clean repo build. Write a program to download all those urls in sequence and measure the time. This is the optimal time with todays download strategy. Everything in between this and maven is mavens dependency resolution and what-not.

Caspar MacRae

Hi Baptiste,

The topic is about a proposed improvement: speeding up Maven download, my comment is simply this can be achieved by downloading less (so I believe it's on-topic premise but different conclusion).

It's completely possible to have a reproducible build with version ranges, you simply lock down at release (see maven-version-plugin:resolve-ranges).
I hear this same tired, false argument (build reproducibility) from detractors of version ranges all the time, it seems none of them have heard of semantic versioning.

Version ranges do not transform dependencies into SNAPSHOTs, they allow you to state compatibility in terms of versions carrying semantic meaning - why would you not want to (normally) build against the latest (released) version of a compatible module?
Semantically enforced version ranges make modularity that much more powerful (look at OSGi's runtime usage of version ranges.

We only release what's changed, without version ranges the standard practice seems to be to bump all modules in a project to the same arbitrary version number and roll out the entire project (this then recurses into dependencies). This static all at the same version is not in the spirit of Maven and certainly not a modular approach.

In summary version ranges are not bad practice, they are used through OSGi - applying version ranges without any semantic notion of compatibility is just plain wrong (about as wrong as incrementing a project's version (so all modules' versions) for a single backward compatible, non-API-changing bug fix).

Please note the original post ended in "In an ideal world" =)
Best regards, Caspar

Philippe Marschall

Great idea. One thing that I'd like to see is compressing all JARs using pack200. We did a test and some of our largest JARs are a tenth of the size when compressed using pack200.

Marcel Ammerlaan

I saw a similar solution whereby a run was done, the local-repo was zip'ed and stored in Nexus. After this each build would first checkout this file from Nexus, unzip it and then start the build. Saved a lot of time for everyone, especially as we are using Hudson and each build has a private repository (as sharing a Maven repository for multiple builds does not work).