Archive

pathological (for working with file paths), runittotestthat (for converting RUnit tests to testthat tests), and rebus (formerly regex, for building regular expressions in a human readable way) all make their CRAN debuts.

assertive, for run-time testing your code has more checks for the state of your R setup (is_r_devel, is_rstudio, r_has_png_capability, and many more), checks for the state of your variables (are_same_length, etc.), and utilities (dont_stop).

sig (for checking that your function signatures are sensible) now works with primitive functions too.

learningr (to accompany the book) has a reference URL fix but is otherwise the same.

I encourage you to take a look at some or all of them, and give me feedback.

the big problem with being a data scientist is that you have to be a statistician and a programmer, which is really two full time jobs, and consequently a lot of hard work. The webcast will cover some tools in R that make the programming side of things much easier.

You’ll get to hear about:

Writing stylish code.

Finding bad functions with the sig package.

Writing robust code with the assertive package.

Testing your code with the testthat package.

Documenting your code with the roxygen2 package.

It’s going to be pitched at a beginner’s-plus level. There’s nothing hard, but if you haven’t used these four packages, I’m sure you’ll learn something new. Register here.

Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.

At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.

With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.

The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.

The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.

The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.

Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.

Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)

The sig package helps you determine whether your code lives up to this ideal.

Even from just the first four signatures, you can see that the tools package has a style problem. add_datalist and check_packages_in_dir are lower_under_case, buildVignettes is lowerCamelCase, and bibstyle is plain lowercase.

I don’t want to pick on the tools package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.

write_sigs is a variant of list_sigs that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.

There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)

The sig package also contains a sig_report function that identifies problem functions. This example uses the Hmisc package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.