Developing Texts Like We Develop Software

Recently I was asked to speak at a conference for university librarians, about how the future of academic publication looks to me as a computer scientist. It’s an interesting question. What do computer scientists have to teach humanists about how to write? Surely not our elegant prose style.

There is something distinctive about how computer scientists write: we tend to use software development tools to “develop” our texts. This seems natural to us. A software program, after all, is just a big text, and the software developers are the authors of the text. If a tool is good for developing the large, complex, finicky text that is a program, why not use it for more traditional texts as well?

Like software developers, computer scientist writers tend to use version control systems. These are software tools that track and manage different versions of a text. What makes them valuable is not just the ability to “roll back” to old versions — you can get that (albeit awkwardly) by keeping multiple copies of a file. The big win with version control tools is the level of control they give you. Who wrote this line? What did Joe write last Tuesday? Notify me every time section 4 changes. Undo the changes Fred made last Wednesday, but leave all subsequent changes in place. And so on. Version control systems are a much more powerful relative of the “track changes” and “review” features of standard word processors.

Another big advantage of advanced version control is that it enables parallel development, a style of operation in which multiple people can work on the text, separately, at the same time. Of course, it’s easy to work in parallel. What’s hard is to merge the parallel changes into a coherent final product — which is a huge pain in the neck with traditional editing tools, but is easy and natural with a good version control system. Parallel development lets you turn out a high-quality product faster — it’s a necessity when you have hundred or thousands of programmers working on the same product — and it vastly reduces the amount of human effort spent on coordination. You still need coordination, of course, but you can focus it where it matters, on the conceptual clarity of the document, without getting distracted by version-wrangling.

Interestingly, version control and parallel development turn out to be useful even for single-author works. Version control lets you undo your mistakes, and to reconstruct the history of a problematic section. Parallel development is useful if you want to try an experiment — what happens if I swap sections 3 and 4? — and try out this new approach for a while yet retain the ability to accept or reject the experiment as a whole. These tools are so useful that experienced computer scientists tend to use them to write almost anything longer than a blog post.

While version control and parallel development have become standard in computer science writing, there are other software development practices that are only starting to cross the line into CS writing: issue tracking and the release early and often strategy.

Issue tracking systems are used to keep track of problems, bugs, and other issues that need to be addressed in a text. As with version control, you can do this manually, or rely on a simple to-do list, but specialized tools are more powerful and give you better control and better visibility into the past. As with software, issues can range from small problems (our terminology for X is confusing) to larger challenges (it would be nice if our dataset were bigger).

“Release early and often” is a strategy for rapidly improving a text by making it available to users (or readers), getting feedback, and rapidly turning out a new version that addresses the feedback. Users’ critiques become issues in the issue tracking system; authors modify the text to address the most urgent issues; and a new version is released as soon as the text stabilizes. The result is rapid improvement, aligned with the true desires of users. This approach requires the right attitude from users, who need to be willing to tolerate problems, in exchange for a promise that their critiques will be addressed promptly.

What does all of this mean for writers who are not computer scientists? I won’t be so bold as to say that the future of writing will be just exactly like software development. But I do think that the tools and techniques of software development, which are already widely used by computer scientist writers, will diffuse into more common usage. It will be hard to retrofit them into today’s large, well-established editing software, but as writing tools move into the cloud, I wouldn’t be surprised to see them take on more of the attributes of today’s software development tools.

One consequence of using these tools is that you end up with a fairly complete record of how the text developed over time, and why. Imagine having a record like that for the great works of the past. We could know what the writer did every hour, every day while writing. We could know which issues and problems the author perceived in earlier versions of the text, and how these were addressed. We could know which issues the author saw as still unfixed in the final published text. This kind of visibility will be available into our future writing — assuming we produce works that are worthy of study.

Comments

This is a fascinating post. Some cloud-based word processors are already vindicating the key prediction: Google’s word processor keeps track both of multiple concurrent authors and of sequential revisions, albeit not with the finesse of best-in-class computer science tools. Other innovations like Etherpad, Google Wave, and the various shared whiteboards and “spaces” in tools like Skype and, long ago, ICQ, seem to be part of an ongoing thread of thought within the software development community about how writing could get better.

The post’s conclusion has an unstated premise. It’s true that more and more data about the author’s writing process, including interim tinkering before the final product, is being captured. But this data will be available to future scholars only if it is also preserved. Given that preservation even of final documents is increasingly being recognized as a vexing problem, and given that in the humanities context information about revisions and marginalia is often of lower preservation priority than final text, it’s reasonable to wonder how much of this will be accessible to future scholars.

One question is, can we take some of the revision data we have now — on recently authored works of scholarly interest — and start to use it in ways that demonstrate the utility of preserving such data, as a general principle? Or, does the revision data only gain value — or only become practically available to scholars — long after the original acts of authorship?

As you point out, version control doesn’t create a long-time archive, only an opportunity to archive.

This issue came up in the Q&A after my talk. My suggestion is that some library can solve two problems at once by offering a free version control server for scholars to use, on the understanding that everything on the server will be permanently archived.

These days I tend to use git for version control on code and papers. Certainly, tools like git, svn, cvs, and SourceSafe are not “housebroken” for ordinary users, but I’m convinced that their underlying concept of operation can be adapted for a broader audience, especially if the audience adapts a bit over time to use them more effectively.

The user interface on git is challenging even for experts. Very powerful but deceptively unintuitive.

Most wiki software provides version control, visual display of differences between versions, and some sort of concurrent editing option. This is not too bad, especially if you break up a big text into sections with a wiki-page per section.

I’ve had some experience teaching the use of SVN (using the TortiseSVN interface) to non-developers, and people seem generally to pick up the basics pretty quickly, even if they’re not good with computers.

However, this has always been in situations where good technical support from experienced sysadmins has been available.

If I may be allowed to shill, I’m on a team developing a collaborative editor with the specific goal of bringing code management tools to a wider audience. One of our main findings has been that unconstrained whiteboards are much less easy to use than “chunked” systems where users get temporary write-locks on small portions of text.

Writing natural language text in an ordinary linear editor, even on my own, has begun to feel constraining.

I’ve seen examples of the systems you mention popping up even today.
Cory Doctorow lazywebbed and got Flashbake – an automatic git version control system for his texts that commits the current state of your text every X minutes and in the commit message tells what song you have playing on your computer, what the weather is, what’s happening in the news, etc. Instant version control and context for historians. Google Flashbake
I’ve also noticed Robin Sloan doing real time collaboration and review and small story iterations. He’s written about these smaller, tighter loops of reviewing here:

Some links removed due to spam filter. Give a google of Robin Sloan and Utility belt
Ok, now I don’t have any links and I’m still getting filtered.

Something has gone wrong. For months, there has been a mean of 3.25 new posts here each Saturday when I check, with a small standard deviation. That is, there’ve been either three or four new posts, usually three, exceedingly rarely two, and never just one or as many as five.

The last two weeks the number of new posts has been, respectively, zero and one.

Zero is a twelve-sigma outlier. One is still several standard deviations from the mean.

I’m CSO of a startup with a horse in this race, so please allow for parental enthusiasm.

I agree completely that version management software has a lot to offer writers and scholars way outside the IT field, and even more strongly that the future of writing will not be just exactly like software development. It is quite hard to get programmers (who already think in complex structural terms) to fit themselves into version control systems, and really hard with non-engineers. Analogously, a gear lever adds control that improves performance for a user who deeply understands engines, but automatic works better for most drivers. Unlike Ed Felten I am not “convinced that their underlying concept of operation can be adapted for a broader audience, especially if the audience adapts a bit over time to use them more effectively”: what is under the hood should be less obtrusive.

On top of that, code is usually developed within a group that commits to one version control system. A new-company business plan may involve an engineer, an accountant, a lawyer, an investor, etc., etc., all with their own software habits: collaborating scholars may be scattered over the globe, and over the writing software landscape.

For such reasons, we avoided starting with code management tools of the current kind. In particular, we don’t track changes as they are made: we allow import of versions that may arrive in different formats, without version metadata, and find the differences by algorithms borrowed from gene comparison. (Also, we identify moves as moves, not as Track Changes’ unrelated deletions and insertions.) Then we serve them up in a single-window compare-and-merge interface, that took more thought and time (simplification is complex!) than the underlying code.

We think TextFlow has all the advantages discussed in the blog entry and comments, and avoids the hard learning curve of a IT-style VCS. Obviously we are biased, but you can take a look at http://www.textflow.com to make your own judgement. You can also find it in the Google Apps Marketplace, where it was launched this week.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.