tikadiff: graphical diff for text from “binary” files

Code

Version Control Systems (VCSs)

VCSs like mercurial, git and bazaar (to mention only a few) are great for keeping track of changes to source files, but their utility doesn’t stop there. If you’re working on documents in applications like Word, OpenOffice or LibreOffice, especially when you are asking others to review those documents, a VCS program can save you a lot of anguish.

However, people who work not with source code, but with research papers, academic assignments and the like, are not inclined to make themselves familiar with the tools that geeks have grown used to. Considering how long it took for software developers to embrace those tools, it’s hardly surprising.

Limitations of VCSs

Unfortunately, VCSs aren’t set up well for managing such files. They depend for efficient version management on tracking line-by-line differences in file. This allows them to maintain the minimal set of changes from version to version, and to readily show what changes have occurred between any two versions. That works well for source files, which are written in plain text, but not for files which maintain their own complex internal formats, which generally only become readable when translated by the “mother” application.

VCSs provide for such files, but they mark them as binary, and make no attempt to track the differences between them. Instead, for each version, they simply keep a copy of the entire file. While this is still better by a long stretch than not being able to track, version by version, the history of a document, it makes it impossible for the native VCS to show what the differences between versions actually were.

Other solutions

This problem has been partially addressed before. For those who generate Open Document Text (ODT) files with, for example, OpenOffice, the program odt2txt will extract text from those documents. odt2txt, in turn, enabled oodiff, which is a script to generate a diff from the text of two ODT files, extracted with odt2txt. It’s not a complete solution, because there is still no merge facility, but it at least allows the textual differences between versions to be examined from within the VCS.

The combination of such a file comparison program with a VCS graphical interface presents a much lower barrier to adoption for users from outside the asylum. Because I᾽m on OS X, I’ve been using Atlassian’s free product, SourceTree, which has the advantage of working with both mercurial (hg) and git. If you’re working with mercurial only, or if you are on a linux distribution, you can use TortoiseHg as a graphical front end.

That wasn’t quite enough, because my wife, whose requirements got me thinking about this, works mainly with Word files. I needed an equivalent of odt2txt and oodiff for both .doc and .docx files — which have very different formats.

Thank you, Tika

Fortunately, the problem of extracting text from these formats has already been conveniently solved by the Apache Tika project. Tika can extract metadata, plain text, xml and html from a dazzlingly array of file types. For my purposes, plain text and metadata will suffice; metadata only for any file types, like images, from which text cannot be gleaned. Tika provides not merely a substitute for odt2txt, but an extension to virtually any commonly (an many not so commonly) used file formats. All that remained was to provide a substitute for oodiff.

tikadiff

That’s tikadiff. It takes either two filenames and generates a graphical diff of them, or two directories and, for each file in the first, generates a diff of the file of the same name (if it exists) in the second. tikadiff depends on Tika (obviously) and a graphical diff program; by default kdiff3, but it is currently written to look also for Perforce p4merge, and can be instructed to use a diff program of your choice, provided that it accepts the same two arguments.

tikadiff depends also on a number of scripts that are distributed with it.

tika

tika is a convenience script to run the CLI from the tika-app jar file. It passes all its arguments to tika-app. In addition to that basic function, it will preset certain arguments to tika-app depending on the name by which it is invoked.

tikatype —prints the mimetype of the file named in its argument

tikameta —prints the metadata of the file named in its argument

tikatext —prints the plain text extracted from the file named in its argument

tikaxml —prints the XML extracted from the file named in its argument

tikahtml —prints the HTML extracted from the file named in its argument

tikserve

tikserve, like tika, is a convenience script to run tika-app. Unlike tika, it is not invoked (except during setup) under its own name, but only through a series of links. It runs tika-app as a server, which performs some operation on any file which is written to its open TCP port.

tikstype —set up server to return mimetype of file

tiksmeta —set up a server to return the metadata of the file

tikstext —set up a server to return the plain text extracted from the file

tiksxml —set up a server to return the XML extracted from the file

tikshtml —set up a server to return the HTML extracted from the file

tikadiff only uses tikstype, tiksmeta and tikstext when comparing directories, on the theory that it will be faster to use a server when testing and comparing multiple files. I have not checked whether this theory is valid.

In order to run the server(s), tikserve needs a port on localhost. To support this habit, tikserve looks to two other scripts.

freeport

freeport hands back the next available port on localhost, starting at 1024. You may optionally give it a minimum port number and, sub-optionally, a maximum port number as constraints. In order to perform this task, freeport requires—

localhostports

localhostports prints the ports which, in the opinion of netstat, are currently associated with localhost.