Wednesday, August 29, 2007

Sandy Payette has plans to expand Fedora to support open access publishing, eScience and eScholarship. With a recent $4.9M grant from the Moore foundation, it looks like she might have the opportunity to do this...

Abstract:It has become clear that scholarly practice and scholarly communication across a wide range of disciplines are being transfigured by a series of developments in IT and networked information. While this has been widely discussed at the national and international levels in the context of large-scale advanced scientific projects, the challenges at the level of individual universities and colleges may prove more complex and more difficult. This presentation will focus on these challenges, as well as the development of truly institution-wide strategies that can support and advance the promises of e-research.

"...that the output per contributor in open source projects is much higher when licenses are less restrictive and more commercially oriented."

and observe:

"Projects written for the Linux operating system have lower output per contributor than projects written for other operating systems..."and:"Output per contributor in projects oriented towards end users (DESKTOP) is significantly lower than that in projects for developers."

They also observed that the median # of contributors in "restrictive" projects (13) to be much less than for "non-restrictrive" projects (35).

They chose the 71 most active projects on SourceForge in January 2000 and studied them over an 18 month period starting in January 2002. They measure these projects every 2 months over this period resulting in 9 samples. The metrics they used include: Source lines of code (SLOC), #contributors, the "restrictiveness" of the license (ranging from GPL = very; LGPL, Mozilla, NPL, MPL = moderate; or BSD = non), operating system, age of project, if it is a desktop or system application, language (C++ or C = 1; all others = 0), and others. They took in to account the difference between the LOC of language by separately also looking at just the C++ or C projects.

I do not understand the lag in choosing the projects (January 2000) and the start of the data sampling (January 2002). This in itself could have skewed the results, i.e. the 71 most active projects in 2000 would almost definitely NOT be the most active 2 years later. I think this may be a major flaw in this study.

I also don't think that the sampling size is large enough & that the sampling method should have been a random selection of projects that met some reasonable criteria, like:

had at least Ccontributors

had at least L lines of code contributed over the last M months

had at least D downloads over the last M months (penalized very new & very unpopular projects??)

I also believe that they made another possible error: they observe in their discussion that the median number of LOC per project was 53K for "non-restrictive" and 60k for "restrictive". They suggest that this is not a big difference (they do not appear to verify the nature of the distribution of LOC in projects by license grouping statistically). But I would suggest that 500 lines of code for a project that has 5k LOC can often be a more significant contribution than 500 LOC to a 100K LOC project. They should have looked into the effect of normalizing the contributed LOC by the total LOC in the project.

I haven't taken too much time to go over all of their experimental design, model & stats....

"Projects geared toward end-userstend to have restrictive licenses, while those oriented towarddevelopers are less likely to do so. Projects that are designedto run on commercial operating systems and whose primary languageis English are less likely to have restrictive licenses. Projectsthat are likely to be attractive to consumers—such asgames—and software developed in a corporate setting aremore likely to have restrictive licenses. Projects with unrestrictedlicenses attract more contributors."

I believe the level of acceptance of researchers is high enough to move forward on an national data archive, and clearly there also needs to be a better education campaign by SSHRC and other Canadian research funding bodies both at the strategic level - read "policy and funding" - and at the tactical level - read "engaging, informing and educating researchers".

University Affairs has an interesting article on Open Science that examines the patents and licensing regime and its impacts on science and the ability to do science. While at times advocating an Open Source-like model of Open Science, the author is a little to wishy-washy and supports hybrid models which are too much of a slippery slope for me.

I also don't agree with a number of statements including:

But now an international scientific counterculture is emerging. Often referred to as "open science" this growing movement proposes that we err on the side of collaboration and sharing.

Counter-culture? I think that he has it backwards: despite the many biotechnologists and biotech companies and other science-based industries that use the patent system to support their business interests - usually encumbering further scientific discovery - the vast majorityof scientists - at least working in academia, and of course with exceptions - have long been and will continue, working in an Open Science environment. Not to take away from the Open Science movement and what it is trying to do. But it existed before someone decided to call it Open Science and it is the default model / mode for most scientists in academia. The tail is wagging the dog a little here...

Thanks to Mary Zborowsky and Michel Sabourin for pointing-out this article.