K.Mandla's blog of Linux experiences

Four console document converters

I am cursed as a member of middle management, which means part of my responsibility in real life is to filter the resumes of potential employees. It does occasionally give me something to laugh about, but mostly it just gives me the chance to mess with document converters.

By and large, most of the resumes I get are in Microsoft Word format, which means antiword is almost a necessity. antiword is probably my favorite for the transition to plain text, mostly because its requirements are extremely light — only glibc, if that can be believed.

At the same time it has enough frills to keep it usable, including showing hidden text or document revisions, formatting for specific outputs and so forth. Again, as I am usually just looking for something amusing in some poor innocent applicant’s resume, I don’t have a whole lot of demands that I put on it. And if you look at its home page, its been ported into just about every architecture from the Amiga up, which not a lot of software can say.

On the other hand, it is slowly becoming out-of-date, with version 0.37 stamped October 2005. It hasn’t failed me yet though, which probably speaks more to the efforts of applicants, who probably pick a safe file format to make sure their resume is readable. How quaint. … :twisted:

As an alternative, wv2 is a lot more recent, seeing its last update in October of 2009. It may be newer but in my book it has a few strikes against it. First, it seems to require some outside programs to do the job — every time I try to convert a doc file to text it clamors after something like elinks or lynx, and I don’t understand enough of its mechanics to understand why.

On top of that, it has libgsf as one of its dependencies, and anything with the word “Gnome” in it is anathema to me. I know, I’m being superficial, but antiword already does this job for me rather well, and installing something heavier to do the same job goes against my principles. :roll:

Very rarely I get a docx file, and when that happens I have to rely on docx2txt to get at the guts of it. docx2txt is a perl script that apparently, mystically, magically relies on nothing else to get the job done. It does that job very well though, and can carry along some of the document information or even some formatting, if told to.

I’ve only needed it once or twice, but in neither of those situations did it give me any hassles. Again, it wasn’t doing anything mission-critical and it may be that it strikes you as rather light on features, but in the case of simple document conversion, it did what it claimed it would do.

I have never gotten a resume in OpenOffice format, but when I do, I shall hire that person on the spot. Provided I can read their resume of course, and in that case I’ll probably rely on o3read. I’ve used o3read plenty of times in the past and it always pleases me to see it spin quickly through a document, spitting out the good stuff and tossing the bad stuff down /dev/null. Or wherever it goes.

If you check the link you’ll see that the source for o3read is the home of Siag Office, which is probably one of the lightest full-featured office suites you can get for nothing. If you don’t live at the command line you might want to take a look at Siag, because while it’s not particularly gorgeous, it’s impressive as an example of lightweight graphical software.

o3read is similarly clean-cut and to-the-point. It can handle sxw and sxc files and does a very nice job separating out the information I want, and giving it to me in a manageable form. Best of all, you don’t have to install OpenOffice.org to use it. It’s the little things in life that you must savor. … ;)

That’s about it. Between two or more of those tools I can generally accomplish my real-life job requirements, and perhaps get a small bit of comedy on the side. The moral of the story is, don’t send me your resume. :mrgreen:

Post navigation

10 thoughts on “Four console document converters”

For the openoffice odt to text conversion, in Debian we use odt2txt. It says thath can handle the “new” .odt format, and the old .sxw. It also extracts text from spreadsheets presentations too. The old o3read is only for convert .sxw and such “old format” files. (almost that saids it website).

That’s the problem with management, you can never please everyone. You prefer plain text resumes (and I admire you for that,) but others might think of using plain text as a sign of being unprofessional, lazy, or lacking in technical skills. Personally, if I ever had to apply for a job, I’d send my resumes in many different formats and let the employer take their pick.

Also, I once thought of using troff and formatting my resume like a man page (‘career objectives’ in the SYNOPSIS section, ‘skills’ in the OPTIONS section, etc.,) but I never got around to doing it. What would you think if someone did that? ;)

Speaking of document converters, do you know of any programs that convert pdf to text?

I have to say I’m impressed with wv, it does a much better job of converting docs than catdoc. Plus it can do latex, html, pdf etc, which is handy if the file has images in it (not that a resume would :P but some of the files I get sent do). And as far as gnome dependencies go, libgsf isn’t that bad as it doesn’t drag the rest in with it.

On arch it is the package wv and not wv2 that has the executables, wv2 seems to be a library?

To be honest, I wasn’t exactly sure. The PKGBUILD points at the wv2 pages, unless I am mistaken. But both projects seem to be linked, so it may be that installing one implies the other, or that one has functions that overlap the other. You might have to peek at the PKGBUILDs to be sure.

Both PKGBUILDs point to the wvWare project page, but I have been using wv without wv2 installed, so while they are from the same project they are independent packages. If you look at the file list for wv2 on the Arch packages page it is just a bunch of header files and a .so.