RStudio, Jupyter, Emacs, Vim: nothing that works properly is easy to use and nothing that is easy to use works properly

So I am preparing to teach quantitative analysis of social media data using R, the open source language for statistical programming. I usually do anything code-related in Emacs, because I already know how to use Emacs and you can do everything code-related in Emacs and I don’t want to install and learn the quirks of loads of different IDEs. But that argument won’t make sense from the point of view of my students, firstly because they won’t need to do everything code-related, they’ll just need to create R notebooks, and secondly because they don’t already know how to use Emacs, and learning how to use Emacs is hard because Emacs is weird.

If you’re an Emacs user and you don’t believe me, then just imagine using Vim because that’s how weird Emacs is to someone who isn’t an Emacs user. And if you’re a Vim user and you’re feeling all superior, then try reading the preceding paragraph again after switching every mention of Emacs to a mention of Vim because the same point applies. Both Emacs and Vim are very difficult to learn because neither of them makes any sense from the point of view of someone who doesn’t already know how to use it: these days, people come to software applications with expectations formed from their use of other software applications, and neither Emacs nor Vim has an interface that works quite like any other software application’s interface. This means that there’s a steep learning curve. The payoff is that, once you’ve got the hang of Emacs or Vim, you’ll never need to learn anything else for your coding-related requirements. But not everybody needs that payoff.

Enter RStudio. RStudio is a dedicated open source IDE for R, and it has built-in support for ‘notebooks,’ which are documents that enable you to combine the code in which you do your analysis with the text in which you write up that analysis. What happens – or is supposed to happen – is that every time you save your .Rmd (R Markdown) file, RStudio first compiles (‘knits’) it into an .md (regular Markdown) file using an open source tool called knitr and then compiles the .md file into an HTML page using another open source tool called Pandoc, and the really neat thing is that it does all this for you, supposedly without your having to think about it. Also, RStudio is relatively easy to learn to use, because its interface is more like the interfaces of other contemporary software applications than the interfaces of Emacs and Vim. Hence my decision to teach my students with RStudio.

In preparation for all this neatness, I’ve switched to using RStudio for my own research – and moreover, to using it on one of the Windows PCs that my employer provides, because that’s what the students will be using. And on the face of it, RStudio is pretty darn terrific. There’s a window for your notebook (or script, or whatever), and when you run code that creates a table or a chart, it appears in the notebook itself, right below the chunk of code. There’s also a window showing the environment (i.e. all the variables and functions that you have defined) and the history (i.e. a chronological list of all the commands that have been executed), a window for help text and for display of images outside of the notebook, and a console window, which is where things actually happen: when you run the code in your notebook, what it actually does is to send that code to the console, line by line, where it runs just as if you’d typed it there. You can also try out lines of code in the console, then put them into your notebook via the history window if you like the result. It’s a lot like using Emacs Speaks Statistics, except not in Emacs.

The results of my little experiment have been mixed. I’ve got some work done that I’m relatively satisfied with, including a piece intended to teach how opinion polls work. But RStudio – at least on this ordinary Windows PC – constantly hangs. I don’t know why it does this. It seems to have nothing to do with the memory or processing requirements of what I’m using it to ask R to do – though it seems to happen more often when nothing has been sent to or typed in at the console for a while. Maybe they just lose touch; I don’t know. Sometimes, I’ll ask it to do a calculation as trivial as 1+1, and it will hang (yes, I have tried this and it did). After a minute or three, it might start working again. Or I might get tired of waiting and click the menu option to restart R. Eventually, a pop up window – or sometimes a whole series of pop up windows – will appear, telling me that the connection with R has been lost. Then, a little while later, the answer to the calculation will appear, and an instant after that, R will restart. This isn’t so bad if I’m at the beginning of a notebook, but by the end, when later calculations may depend upon the results of earlier calculations, it can mean that I need to re-run the whole thing, which again means waiting because, even when it’s not hanging, RStudio often becomes painfully sluggish for no apparent reason, drip-feeding lines of code to the console in slow, slow motion. I also end up doing that when one of two other things that tend to happen happens: either the source window stops sending code to the R terminal altogether but the R terminal keeps working (which means that I can at least test bits of code by copy-pasting them from the source window to the terminal by hand, though that becomes inconvenient quite quickly), or I tell the source window to execute some particular chunk of code and the clock icon appears to tell me that it’s scheduled to run after some other chunk of code has finished executing (but there is no other chunk of code executing – or if there is, it’s executing without telling me that it’s executing and without my having told it to execute). So what I’ve started doing is making myself go and do something else to pass the time every time it seems to be happening.

To give you an idea of how much time I’ve wasted like that today, it’s how this blogpost got written.

But that’s not all. Once I eventually got my current piece of work into some sort of near-readable form, RStudio started refusing to knit my .Rmd file into an .md file, giving up with the message Error creating notebook: no lines available in input at the top of the source window and in most cases telling me which code chunk it had given up at with the console message Quitting from lines X-Y (filename.Rmd). Each time this happened, I checked and tested and fiddled with the code but there was never anything wrong with any of it. Sometimes, I’d run a chunk by hand and try to save again and then it would breeze past that chunk only to give up at a different one. But it didn’t always tell me which chunk it had choked on, and sometimes it did tell me but the trick of running the chunk by hand and trying again didn’t work. I tried clearing the environment, restarting R, and re-running; I tried clearing the environment, restarting RStudio, and re-running. Same problem. Did I mention that ‘re-running’ takes maybe fifteen minutes?

Eventually, I gave up on using RStudio’s point-and-click interface and called the knitr program directly from the R console with library(knitr); knit('my-Rmd-file.Rmd'), which worked – albeit rather slowly – and proved that there really was nothing wrong with my code. It also gave me an .md file that I’ll presumably be able to compile into an HTML notebook… once I’ve figured out how to get Pandoc working on this Windows PC, that is (because RStudio was doing all that behind the scenes where I didn’t think I had to think about it). I was at the point of wondering whether to email the .md file to myself so that I could convert it to HTML on my Linux machine (hey, doesn’t that sound like a great workflow?), when I decided to try clearing the environment, closing RStudio, turning the computer off and then on again, and then re-running all the code chunks. That did the trick.

I think that this latest problem might have something to do with the size that my document has reached, because I previously hit a problem where the cursor started jumping around randomly within paragraphs towards the end of the opinion polls piece once that got beyond a certain length. Or maybe it was memory use. R and notebooks both conspire to make memory management difficult. But this was after I’d moved the most memory-intensive bits of computation out of the notebook and into separate scripts whose output was loaded by the notebook code. It might also be the two together. Perhaps the solution with the long documents problem (if there is a long documents problem and not just a memory problem) is to split the notebook up into smaller files, although I can’t see a way of recombining them in RStudio, and it wouldn’t solve the memory problem (if there is a memory problem and not just a long documents problem), and anyway, neither of the notebooks I’m talking about is a particularly long document: the one I’m working on at the moment is just over 4000 words long and the opinion polls one is just under 6500 words long, whereas most journal articles I’ve published have been around 8000 words long. And if it’s a memory problem and not (or as well as) a long documents problem then maybe the solution is to do much less computation in the notebook itself and to do much more in scripts that save their output for the notebook to load in and display – though as I’ve said, I’m already doing quite a bit of that, and to be honest it kind of defeats the object of a code notebook.

This doesn’t sound amazingly appealing, does it? RStudio is supposed to make things easier by automating the boring stuff and hiding it behind a nice point-and-click interface, but a lot of the time it just doesn’t work.TM

So right now, I’m torn. Do I expect my students to put up with this crud? Or do I expect them to put up with the different crud that is having to learn Emacs — as well as Pandoc and makefiles, to do the work that RStudio was supposed to do behind the scenes where you don’t have to think about it? (On second thoughts, I don’t even know whether Windows has makefiles.) Or do I give up on both RStudio and Emacs, and have them create notebooks in Jupyter? Because Jupyter has its own headaches – in particular, that you can’t inspect objects if you’re writing a script rather than a notebook, and that there’s no console if you’re on Windows, and that it’s actually really awkward to do everything inside a browser window, and that Jupyter’s reference management plugin can’t handle page numbers, and that the format in which it saves notebooks creates severe complexities for version control. (Not that I’ve managed to get RStudio’s version control integration working yet, either. The instructions suggest that a particular dialogue will appear, but it doesn’t appear.1)

That doesn’t sound amazingly appealing either, does it? Nothing I’ve mentioned here would sound even remotely appealing to a reasonable person, as opposed to the sort of masochist who makes things difficult for himself in order to prove how serious he or she is. (Moi?) Thinking about everything in terms of how I’m going to teach it to students (possibly across a language barrier) is making me keenly aware of what a vast chain of small and irritating obstacles I’ve had to overcome to be able to do the sort of research that I now do. Having overcome those obstacles, I keep clambering forward along the chain of new obstacles because, having come this far, I can hardly give up. But my students are somehow going to have to leap the whole lot in a single bound. Which means that I’m going to somehow have to coach them to do that.

The trouble with anything code-related and open source (which includes not only RStudio, but also Jupyter, Vim, Emacs, and R itself and all its packages) is the implicit assumption that anyone worthy of using it is a software engineer at heart. The result is that nothing that works properly is easy to use and that nothing that is easy to use works properly – and that an awful lot of things barely work at all and are monstrously difficult to interact with (in addition to being woefully under-documented to boot). But that apparently doesn’t matter, because – as a software engineer at heart – you will happily solve the problems that arise by yourself (because obviously, you have nothing better to do this week). And if not, then I suggest that you ask for your money back (ha ha, very amusing).

I’ll tolerate this crud when it’s just me and my research, but when you’re teaching a class, it’s a very different matter. If I’ve got 20 students trying to follow along with me, but every two minutes, somebody’s IDE stops working for no apparent reason, then what am I supposed to do? (Protip: saying ‘I suggest that you ask for your money back’ is a really bad idea when the person that you’re talking to is paying your employer for the privilege. Furthermore: regularly interrupting a class to sort out problems on individual students’ computers could be an equally bad idea when all the other students who are sitting around waiting for the class to resume are also paying your employer for the privilege.)

A way forward will eventually become apparent. But solving problems of this nature isn’t a good use of my time – and I’m feeling less and less inclined to avoid the conclusion that it would be better for my employer to pay for commercial software that was designed to be used by people who have other things to do besides getting it to work. MATLAB or Mathematica, for instance. But – thanks to the macho ideology of open source and its disdain for anyone who can’t (or doesn’t always have time to) deal with endless and very tedious technical problems – some people can get very sniffy about those, hence borderline-unusable software becoming standard not only in academia but in industry as well. And a large part of the point of the course I’m going to be teaching is going to be employability skills, which means that there’s no point teaching students to use something that will get them sneered at in industry. N.B. by ‘industry’, I mean the bit of industry that does social media analytics. Engineers other than software engineers have more sense than to sneer at people for using Mathematica just because it isn’t open source.

I’ve ranted about this kind of thing before with regard to open source typesetting software, but things are just as bad pretty much everywhere (see e.g. this amusing rant about web app deployment). The culture of free but utterly ramschackle software is underpinned by a profoundly counterproductive elitism. The implicit message is always ‘If you can’t – or aren’t willing to – spend hours, days, and months getting our beautiful gift to the world to work despite all the problems with it that we couldn’t be bothered to fix, then it’s high time you got back to something more suited to your abilities – like cowering under a rock and eating mould, you worthless pleb.’

I think I know the way around this particular problem. It involves prioritising what works properly over what’s easy to use, using the university’s Windows machines only in order to remotely access its high performance computing system so that we can work (albeit at one remove) under an operating system that is horribly difficult to use but that doesn’t place quite so many obstacles in our path (i.e. Linux), forgetting all about RStudio, and figuring out how to teach students who don’t all speak English particularly confidently what C-x C-s and M-0 C-k meanbefore I can even start teaching them how to analyse data.

Nut, meet sledgehammer.

But a part of me will still be wondering what coding might be like if only the easy stuff wasn’t quite so hard.

After extensive testing, a very smart IT support guy called Matt Catlow figured out the problem: RStudio’s Git/SVN integration does not work at all if the local version of the project you want to keep under version control is saved to a network drive (at least on Windows). So I’m now coping by keeping my R projects on the C drive of the computer that the university provides me with. That’s fine for me — or at least, it is now that Matt’s given me edit rights to a folder on the C drive (thanks, Matt!). However, it won’t be fine for the students, because they will be using the PCs in the computer labs, where saving to a network drive is the only practical option.↩

6 thoughts on “RStudio, Jupyter, Emacs, Vim: nothing that works properly is easy to use and nothing that is easy to use works properly”

I have to say, the thought of teaching my colleagues R and having Step 1 be getting them in a text editor that doesn’t let them point and click to move the cursor, etc. sounds like something that would cause a mutiny on day 1. If RStudio performance is a concern, it seems to me that it would be better to either get them going in the R GUI or the tried and true practice of using the terminal and text files.

As for cryptic failures with RStudio’s knitting, it took me a while to realize that many of my issues were due to not realizing (it is not obvious unless one dives into documentation on RStudio’s website, if it’s even there) that RStudio knits in a fresh environment. Many times I’ve had problems because there was an object in my local environment that was not actually defined in the script or, even more often, a package loaded in my local environment that isn’t loaded in the script.

I have also been puzzled at the occasional lags when sending commands to R from RStudio after some downtime and having either R or RStudio (unclear who is actually lagging, though it’s certainly RStudio’s fault) take many seconds to react.

I have to say, the thought of teaching my colleagues R and having Step 1 be getting them in a text editor that doesn’t let them point and click to move the cursor, etc. sounds like something that would cause a mutiny on day 1.

Yeah, you’re right. There’s no pedagogical problem so big that making students use Emacs or Vim won’t make it substantially worse.

As for cryptic failures with RStudio’s knitting, it took me a while to realize that many of my issues were due to not realizing (it is not obvious unless one dives into documentation on RStudio’s website, if it’s even there) that RStudio knits in a fresh environment.

Good tip! And that behaviour is counter-intuitive, I think. You run a chunk of code; it gives you the output and the output stays there. You could go through a whole notebook like that, running every chunk of code. But — as I understand it — you’re saying that won’t make any difference to what happens when you tell RStudio to knit the notebook because it won’t do so in the environment that you ran those chunks in. I suspected that something like that was going on, but it’s good to have it confirmed.

If RStudio performance is a concern, it seems to me that it would be better to either get them going in the R GUI or the tried and true practice of using the terminal and text files.

I’ve persevered with RStudio and kind of reconciled myself to its problems. The weird lag seems to happen only on Windows, so I can either run RStudio remotely on my employer’s Linux-based high performance computing service or just take a deep breath every time it happens and wait for things to start moving again. But yeah, terminal and text files are the answer in a lot of cases.

Anyway, thanks for this comment, which gives me an excuse to add my more considered reflections on R and the notebooks it enables you to create. The main thing I’ve realised is that there are real limits to the notebook model of working. For instance, if you’re defining new functions, it makes sense to keep them all in one place — but putting them all into a chunk at the top of your notebook means that you’re constantly scrolling up and down between the lines where a function is defined and the lines where it’s called. The solution to this problem is to put all your functions into a separate text file and then source it at the beginning. That makes practical sense, but it means that there’s a lot going on invisibly and outside your notebook. And the way that source, library, and require all load functions into the same namespace makes it really hard for your readers to see where each function came from. (I’m used to working in Python, where it’s considered bad practice to do the equivalent, i.e. from module import *.)

Another issue (exacerbated by R’s general slowness and memory inefficiency) is that if there’s a big computation to do, the only really practical way to handle it is to put the code in a separate file (quite possibly using a different programming language) and run it from the terminal as a batch job, then have your R notebook read the output and do something further with it (e.g. make a chart). Again, this means that only some of what you’re doing is really documented in the notebook. For example, right now I have a script running that’s creating a dissimilarity matrix (fingers crossed!). The code in my notebook will load the matrix into memory and then run hclust on it. The code that created the matrix will be outside the notebook — though I’ll have to discuss (and maybe quote) it inside the notebook, because it’s not using a standard algorithm.

And there’s nothing wrong with that and it’s really the only sensible way of doing things. But it does mean that there’s something performative about an R notebook. I mean, why even put the call to hclust in my notebook? That could also be in a separate text file, run from the terminal, with the notebook loading in and doing further processing on the output. If you can’t put everything into your notebook, then what do you put in? The decision ‘I (do not) want my readers to be able to see the call to hclust‘ is an aesthetic or a didactic one, I guess.

I’m not dismissing R notebooks, of course. In fact, I’m doing more and more of my research in them. But it’s important to be aware of what they are good for and what they aren’t — which is why I wrote this blog post, I guess. Open science is great, but even though the notebook format seems to promise to allow you to make all your workings public, you still end up making all these decisions about what to include and what not to include. And some of those decisions are virtually made for you by the technical limitations of the tools you use to create your notebooks, as well as by the notebook concept itself.

Most open source software seems to originate on UNIX-like operating systems. On ms-windows things like creating and managing processes and interprocess communication are handled very differently. And it seems that GUI-programs have some message-passing / event loop related choke-points. For an interesting story, find and read “24-core CPU and I can’t move my mouse“.

Also, while a fresh Linux install is like a nicely filled toolbox, a fresh ms-windows install is more like an empty box. Add to that the lack of a decent package manager and all the other infrastructure that a modern UNIX-like system has (plus a host of other small but nagging details) and you’ll have lots open source programmers running away screaming.
[As an example w.r.t. nagging details; up to relatively recently each version of the runtime of microsoft’s C/C++ toolchain was incompatible with all other versions. That means e.g. that if you want to build a DLL module for Python 2.7, you’d have to dig up and install Visual Studio 2008 (or the equivalent standalone compiler), because that’s the version Python 2.7 was built with.]

Even though I’ve managed to install Python, vim and git on my ms-windows machine at work, I still prefer to use them on an old FreeBSD laptop that I also keep around; *they just work better there*. Startup times are much less for one thing even with said laptop being a much lower spec than the ms-windows machine. Also things like tab completion in the bash shell are really slow on ms-windows.

A thing you might do is give your students a virtual machine image with a pre-configured Linux that just fires up RStudio when started. Both VirtualBox and Virtual PC are freely available for ms-windows, as far as I can tell.

As for your suggestion of giving the students a virtual machine image with a pre-configured Linux setup, that is actually a great idea. I can’t do it, unfortunately: everything I ask them to do has to work on the university’s Windows PCs, and installing a virtual machine is apparently not allowed on those. But it’s a great idea nonetheless.

I’m figuring out ways around the problems described here, but it seems to be a general rule that everything that involves open source software is more difficult in Windows.

I’m about to face exactly the same issue and I’d be very interested to know how things go for you. A colleague of mine did try RStudio with his students on Uni Windows PCs and he too had major hanging issues, leading him to rethink. I’ve been teaching students with the basic console/editor set up. It works but the first few weeks we do absolutely no analyses as students get comfortable. It’s a bit of a waste of time and doesn’t have all the potential functionality of RStudio. I’m starting to use it too but on Linux so…..

Thanks for your comment, Sara. It’s still a good while before I have to teach the course to students, so I was very interested to hear your perspective. I’m thinking of setting up a couple of voluntary sessions for research postgraduates to see how they respond.

Having persevered with it for a few more months, my conclusion is that RStudio really doesn’t work very well on Windows. The hanging is the worst thing, but losing control of the cursor in longer documents is also a fairly big deal. This is a shame because RStudio is intended to make R more accessible and, whether we like it or not, most people have easier access to Windows machines than to Linux or Unix machines (including Macs).

The best solution I’ve found so far has been to open a remote connection to the university’s high performance computing service and run RStudio on that. This gives access to the Linux version of RStudio, as well as the Linux command line. But working in a virtual desktop creates its own problems, both technical (the screen looks poor and the text is hard to read) and conceptual (beginning students have enough trouble understanding the file system without having to grasp the difference between the file system on the computer they’re looking at and the file system on the computer they’re connecting to remotely).

I’ve grown attached to RStudio and I don’t want to give up on it, despite the drawbacks. In many ways, Jupyter is superior, but RMarkdown is a much better medium in which to create notebooks than Jupyter’s JSON format. And of course, some of Jupyter’s functionality doesn’t work at all on Windows.

I get that many — perhaps most — developers prefer *nix. I prefer it too. But we should try to meet our students halfway, and we have no choice but to work with the hardware that our employers provide.