I'm experimenting with pdftohtml but I'm finding that it's occasionally having difficulty parsing tables correctly. It's grouping the text from two columns into a single cell, which makes my attempts to parse the resulting data futile!

Note that this occurs only once or twice within a PDF and is quite unpredictable.

I've tried the latest versions of pdftohtml (including the 0.40a beta), but to no avail.

Is anyone aware of any Linux-compatible equivalents that might be worth trying?

Have you submitted a bug report? PDFs are notoriously difficult to parse, and an incredible amount of time has gone into the poppler tools. Your best bet might be to see what you can do to help upstream.
–
efreyMay 15 '12 at 14:11