Files

README.md

So I am working on a feature that I hope to get merged into cargo,
Rusts package manager. The feature would allow a developer to specify a
license to add to a new project (or all new projects), and automatically
put that information in their Cargo.toml, as well as add the LICENSE
file to their project.

So I got to the point where the feature was working, but I had to figure out
how many, and which licenses, to support in the tool. My intuition was to
include: MIT, BSD (2- and 3-clause), Apache-2.0, and GPL, both -2.0 and -3.0.

However, we are all about data these days, right? So forget my intuition, let's
see what actual Rustaceans are using!

Process

So my first step was to collect some data from crates.io, the central repository
for Rust crates. You can easily get an index of all the crates on the site by
using the index that the Cargo team has on github:

$ git clone https://github.com/crates.io-index
$ cd crates.io-index

Now, lets query the crates.io API for information about these crates.
I ended up saving the information to a file, though you don't necessarily have to
do that. It helped with iterating on the data, as I didn't have to repeatedly hit
crates.io's servers for the info (it saved them bandwidth, and me time, since
crates.io will cut you off if you make too many requests in too short a time).

Ok, so now we have a nice .csv file with the name of the crate and the license string it
uses. Now, lets re-read that information back in, and count licenses:

importcsvfromcollectionsimportCounterlicense_counter=Counter()
withopen("license.csv") ascsvfile:
data=csv.reader(csvfile, dialect='excel')
forcrate_name, rowindata:
# some projects multi-license, and they almost always use a '/' to join# the license nameslicenses=row.split("/")
forlicenseinlicenses:
# we just want the general class of the license,# so the trailing '+' characters are unnecessarycleaned=license.strip().rstrip("+")
ifcleaned:
license_counter.update([cleaned])
forx, ninlicense_counter.most_common():
print("{x:30}{n}".format(x=x, n=n))

Results

So what were the results? Well, my intuition was about half correct. The top 2
most-used licenses were the MIT license and Apache-2.0. After that the number
of projects using a particular license drops off considerably, with the
BSD-3-Clause coming in 3rd. The Mozilla Public License came in 4th. I did not
have the MPL on my list, which was obviously foolish, considering Rust is a
Mozilla project. "non-standard" came in 5th, but that is kind of a wash because
it appears to be a kind of "default value" that cargo (or crates.io) gives the
project when they don't have a "license" key in their configuration, but rather
a "license-file" which has a path. The handful of these that I looked at were
using MIT, but just didn't name it in their Cargo.toml configs. It made me
chuckle, but the "Unlicense" came in 6th. The GPL-3.0 is at 7, and the
BSD-2-Clause at 8th. So all the licenses from my list were in the top 8, but
were definitely not the top 5. Here is a table of my counts:

Conclusion

Given the results, I am probably going to take the GPL-3.0 and BSD-2-Clause out
of my PR, and add the MPL in. The "Unlicense" seems to be slightly controversial
(at least from the little digging I did on the internet), but I don't want to
exclude it while including licenses that were represented less in the data, so
taking the top 4 instead of the top 5 seems more fair.

I am not sure if the cargo devs will be interested in my feature when I get a
PR opened, but either way I enjoyed this quick little dip into the crates.io
ecosystem.