Teaching basic lab skillsfor research computing

New Challenges

As some of you already know, my contract with the University of Toronto runs out this spring, and I have decided not to seek renewal. I've learned a lot in this job, and had a chance to work with some great people, but it's time for new challenges.

What I'd most like to do next is spend a year working full-time on the Software Carpentry course—of all the things I've done, it's the one that I think has the most potential to make scientists' lives better. My goal is to raise approximately CDN$25,000 from each of half a dozen sponsors so that I can reorganize and revamp the content, add screencasts and video lectures, and generally drag it into the 21st Century. An abbreviated proposal is included below the cut—if you or anyone you know would be interested in discussing possibilities, please give me a shout.

Computers are as important to modern science as telescopes and test tubes. From analyzing climate data to modeling the internals of cells, they allow scientists to study problems that are too big, too small, too fast, too slow, too expensive, or too dangerous to tackle in the lab.

Unfortunately, most scientists are never taught how to use computers effectively. After a generic first-year programming course, and possibly a numerical methods or statistics course later on, graduate students and working scientists are expected to figure out for themselves own how to build, validate, maintain, and share complex programs. This is about as fair as teaching someone arithmetic and then expecting them to figure out calculus on their own, and about as likely to succeed.

It doesn't have to be like this. Since 1997, the Software Carpentry course has taught scientists the concepts and skills they need to use computers more effectively in their research. This training has consistently had an immediate impact on participants' productivity by making their current work less onerous, and new kinds of work feasible. The materials [1], which are available under an open license, have been viewed by over 140,000 people from 70 countries, and have been used at Cal Tech, the Space Telescope Science Institute, and other universities, labs, and companies around the world.

Despite its popularity, some of the material is now out of date (and users' expectations are higher than they used to be). Our goal is therefore to upgrade the course to bring this training to the widest possible audience. Using lessons learned in the July 2009 offering sponsored by MITACS [2] and Cybera [3], we will create a self-paced version of this material that students can use independently, while also offering them somewhere to turn when they have questions or problems. As described in [4], the revised course will cover the thinks that working scientists most need to know [5], including:

Program design

Version control

Task automation

Agile development

Provenance and reproducibility

Maintenance and integration

User interface construction

Testing and validation

Working with text, XML, binary,
and relational data

We expect the revised course will reach thousands of graduate students and working scientists, and will increase their productivity in direct and measurable ways. It will also prepare them to tackle the challenges of large-scale parallelism, cloud computing, and reproducible research.

We are currently seeking contributions of $20-25K toward the $130K needed to realize this goal. By helping us, you will help current and future staff be more productive and associate yourself publicly with best practices. If you would like to help, please contact Greg Wilson at team@carpentries.org.

Biography:

Greg Wilson (http://pyre.third-bit.com/blog/cv) holds a Ph.D. in Computer Science from the University of Edinburgh, and has worked on high-performance scientific computing, data visualization, and computer security. He is currently an Assistant Professor in Computer Science at the University of Toronto, where his primary interests are lightweight software engineering tools and education. Greg has served on the editorial boards of Doctor Dobb's Journal and Computing in Science and Engineering; his most recent books are Data Crunching (Pragmatic, 2005), Beautiful Code (O'Reilly, 2007), and Practical Programming (Pragmatic, 2009).

This page describes the revised course, and this one describes its target audience.

In 2009, we conducted the largest survey ever done of how scientists actually use computers. The results are reported in this article, a shorter and more readable version of which is here. This and this explain why scientists need to learn on these skills before tackling parallelism, cloud computing, and other leading-edge technologies.