PDFReflow is a utility that reflows PDF text. Its input is a PDF document, to which it reflows the text, removes page number, header, footers, and hyphenation, and generates an HTML file output.

The reflow logic for PDFReflow is in the command line utility pdfreflow. The graphical user interfaces also uses the pdftohtml command to generate the XML input for pdfreflow

Graphical User Interface

There is now a graphical user interface for Windows, Mac, and Ubuntu (actually, any platform that runs java)

Windows: Download the attached PDFReflow-0.8.6.1-Setup.zip. Extract PDFReflow-0.8.6.1-Setup.exe from the zip file, and run it. It contains all the necessary binaries (ie pdfreflow.exe and pdftohtml.exe) to run. Requires Windows XP or newer. You must have Java installed. You can download java at http://java.com/download.

Ubuntu, Linux: Download PDFReflow-0.8.6.1.jar.zip. Extract PDFReflow-0.8.6.1.jar from the zip file. You must separately download the pdftohtml and pdfreflow command line utilties, and they must be in your path. Java must be installed. Download java at http://java.com/download. To run:
java -jar PDFReflow-0.8.6.1.jar

The Command Line Interface

In the attached pdfreflow-0.8.6.zip file is a version that will run under Windows XP, Mac OSX 10.5 Leopard, and Ubuntu 8.04 (and later). There is also pdfreflow.html, which is documentation of how to use the command, and how to find a prebuilt version of pdftohtml from Poppler (http://poppler.freedesktop.org/).

The open source of pdfreflow is copyrighted under GNU GPL, and source is available at SourceForge (http://sourceforge.net/projects/pdfreflow/).

Synopsis

pdfreflow [options] [filename]

Description

Pdfreflow, in conjunction with pdftohtml, will convert a PDF into a reflowed HTML file. Pdfreflow operates on the XML output from pdftohtml (from the Poppler (http://poppler.freedesktop.org/) utilities), converting it into an HTML file. To get the XML input for pdfreflow, use pdftohtml as follows:
pdftohtml -xml mybook.pdf
The output of pdftohtml is in the file mybook.xml.

General Usage

Pdfreflow is oriented for operating on ebook PDFs, text based only, with minimal formatting, the kind of formatting you would get reading a fiction novel. By default pdfreflow expects justified text, but you can specify the input is rag right with the following option:
pdfreflow --ragright mybook.xml
The output of pdfreflow is in the file mybook.html.

You might not want to reflow every page in your ebook. To specify which pages are NOT to be reflowed, use the following option:
pdfreflow --dontreflow="1-6,10,198-201" mybook.xml
The ‑‑dontreflow option takes a comma separated list of page ranges. The first page in a book is page 1. Also, the page number is not the printed page number, but the page number that shows in the thumbnail view of PDF viewers like Acrobat, Preview, Evince, etc.

Cropping

While pdfreflow does its best to remove page numbers, headers and footers, you may have to assist by specifying the cropping options, ‑‑top=TOP_Y and ‑‑bottom=BOTTOM_Y. To find the Y values of a header or footer, you need to look inside the .xml file and find line of text that contains the header or footer. A sample entry looks as follows:
<text top="36" left="203" width="209" height="11" font="0">Self Knowledge</text>
⋮
<text top="506" left="506" width="209" height="11" font="0">Self Realization</text>

pdfreflow --top=36 --bottom=506 mybook.xml
In this example, every text line that has a "top" value less than or equal to 36 will be cropped, and every text line that has a "top" value that is greater than or equal to 506 will be cropped.

Centered Text

Pdfreflow does its best to detect centered text. Sometimes, especially with rag right text, it is hard to detect the center point. To improve the center detection, you can specify a line in your document that is centered by specifying the page number and line number of a centered line. For example, if the 2nd line on page 3 is a centered line, you specify this with page:line argument to the ‑‑center option as follows (page numbers and line numbers both start at 1).
pdfreflow --center=3:2 mybook.xml
To discover the line number to specify for the ‑‑center option, you can used the ‑‑print options to print out the contents of a page with linenumbers to the output.
pdfreflow --print=3 mybook.xml
Reflow Specified Pages

It is also possible to only reflow a subset of the ebook by specifying the ‑‑first=FIRSTPAGE and ‑‑last=LASTPAGE options. This is useful if a book has sections with vastly different formatting. Create a different HTML file for each differently formatted section, and either concatenate the files together, or if you are creating an e-book, this step is not necessary as it is possible to specify multiple HTML files as input to ebook creation software.
pdfreflow --first=1 --last=100 mybook.xml
cp mybook.html section1.html
pdfreflow --first=101 --last=200 mybook.xml
cp mybook.html section2.html
Files
If the filename command line argument is specified, file suffix is replace with .html and the ouput is written to that file, i.e. an input file of mybook.xml has an output file mybook.html. If no input file is specified, standard in used as the input, and standard out is the output.
pdfreflow < mybook.xml > out.html
Options
Here is the usage output for pdfreflow.
usage: pdfreflow [options] [inputfile]
Options:
--absolute font sizes are the same as the original document
(not the default)
-b, --bottom=MAXTOP crop text whose top is greater than or equal to maxtop
-c, --center=SPEC argument is page:line, ie 2:1 is line 1 on page 2
is a centered line (sometimes this hint is needed)
-d, --dontreflow=PAGES don't reflow comma separated page ranges,
i.e. "1,2,4-9,100"
-f, --first=FIRSTPAGE starting page (default is 1)
-l, --last=LASTPAGE ending page (default is last page of the document)
‑‑nonfiction for books that use block quoting at the same
inset as the paragraph indent
-r, --ragright text is rag-right, NOT justify (default is justify)
-t, --top=MINTOP crop text whose top is less than or equal to mintop

‑‑shortlines paragraphs end with short lines (only necessary
for rag right documents with no paragraph
indent and no after paragraph spacing.
--showdebug print debugging options
-v, --version print current version
-?, --help print this help

Example
Options can be combined. An example using a combination of the options in the description section is:
pdfreflow --dontreflow="1-6,10,198-201" --top=36 --bottom=506 mybook.xml
Troubleshooting

While pdfreflow tries it best, sometimes it can not correctly reflow all documents. Here are some tips to get a better output document.

Paragraph are too large

If your book does not have paragraph indenting or vertical spacing after every paragraph, too much text may be reflowed into each paragraph. You might try the ‑‑shortlines option. The argument is a percentage between 1 and 100. If 0 is specified, you get the default value (currently 80). This percentage is used against the longest line width in the document, and lines that are shorter than this percentage are considered the end of a paragraph.
pdfreflow --shortlines=0 mybook.xml
Paragraph are incorrectly reflowed

If your input document is not justified, make sure you specified the ‑‑ragright option.

Pdfreflow is configured to deal with fiction, which often has indented paragraphs and/or vertical spacing after a pararaph. If your book has indenting, but is not fiction with dialog, try using the ‑‑nonfiction option.
pdfreflow --nonfiction mybook.xml
If your book has vastly differently formatted sections, you might try look at the Reflow Specified Pages section above.

Limitations

Only simple book formats are supported. This is not a general purpose reflower for a MS Word or desktop publishing document. Pictures are not supported.
Mutiple columns are not supported.
Footnotes will cause problems. At this point they just show up wherever they are in the paragraph, potentially splitting a paragraph into two pieces.

Getting pdfreflow

There are binaries for Windows XP, Ubuntu 8.04, and Mac OSX 10.5 (and later) attached to this post. The open source of pdfreflow is copyrighted under GNU GPL, and source is available at SourceForge (http://sourceforge.net/projects/pdfreflow/).
Getting pdftohtml

To get a copy of pdftohtml, without building it from source, here are some options:

Ubuntu: Use Synaptic Package Manager to fetch poppler-utils

Macintosh: Download Calibre for Mac. There is a copy of pdftohtml inside of Calibre.app under /Applications/calibre.app/Contents/Frameworks/
PATH=$PATH:/Applications/calibre.app/Contents/Frameworks
htmltopdf -xml mybook.pdf
Windows: Download Calibre for Windows. There is a copy of pdftohtml inside of Calibre under C:\Progam Files\Calibre2. Make sure to add C:\Progam Files\Calibre2 and C:\Progam Files\Calibre2\DLLs to your path, ie:
PATH=%PATH%;C:\Progam Files\Calibre2;C:\Progam Files\Calibre2\DLLs
htmltopdf -xml mybook.pdf

Prana

frabjous

05-10-2010, 06:13 PM

Thanks for uploading this.

Popper's pdftohtml is what calibre uses when converting PDFs, isn't it? So the output should be a lot like calibre's, except it'll give you HTML, which is nice. (You'd have to resort to some workarounds to save it as html using calibre alone...)

Or does someone know better than I?

I assume the Ubuntu executable will work under 10.04 too?

Pranananda

05-11-2010, 12:01 AM

frabjous,

Yes, calibre uses pdftohtml. People who don't want to build pdftohtml from source can use the copy found inside of calibre. But if you are on Ubuntu, you can use Synaptic to install poppler-utils.

The output of pdfreflow is going to have multiline paragraphs rather than the 1 line paragraphs of the default pdftohtml.

The Ubuntu executable will also run on 10.4 (I just tried it.)

frabjous

05-13-2010, 01:19 AM

Nice tool.

It doesn't seem to work with double spaced PDFs, though you're aiming towards PDF ebooks which are unlikely to come double spaced. (This problem may be on the pdftohtml end... not sure.)

roger64

05-13-2010, 03:17 AM

It did process something but I am a little dumb how to use it. I put it in the same folder as my working file to begin with. I thought it could not harm. :o

But I do not know what to do with the resulting xml file. Sorry for that.

Pranananda

05-13-2010, 04:09 AM

Frabjous,

Yes, pdfreflow is not going to like double spaced text, as it will see the double spaced lines as new paragraphs. I can add an option to make this work though.

--update

I think I can make this work without adding a new option, but just detect that the lines are double spaced. I'll put this in the next update.

Pranananda

05-13-2010, 04:15 AM

Roger64

Try pdfreflow Bowden\,\ Mark\ -\ Killing\ Pablo.xml

Or perhaps ./pdfreflow Bowden\,\ Mark\ -\ Killing\ Pablo.xml

Also, read the pdfreflow.html to see the command line options.

roger64

05-13-2010, 07:04 AM

Thank you :thanks:

pdfreflow is amazingly quick and efficient for reflowing text-based PDF thru xml !!
A great tool. Congratulations and thanks. :thumbsup:

PS: With Ubuntu I needed to use: ./pdfreflow ...:o

frabjous

05-13-2010, 01:04 PM

I think I can make this work without adding a new option, but just detect that the lines are double spaced. I'll put this in the next update.

That would be excellent. Don't rush on my part, though...

Pranananda

05-13-2010, 05:36 PM

I've posted pdfreflow-0.8.4.zip to the original post. Here are the release notes:

now building for Ubuntu 8.04 Hardy Heron and Mac OSX 10.5 Leopard (and later)
documents using double spaced lines are supported
don't print all debug options in --help, but added --showdebug option instead
documents with only large fonts would not reflow correctly
added --lineheight debugging option to print line height frequency

frabjous

05-13-2010, 10:51 PM

Seems to work well now with double-spaced PDFs... or at least the one I tried on. Thanks a lot!

My dream tool for something like this would be able to recognize footnotes and treat them appropriately, but knowing how PDFs work (and the fact that they don't semantically mark footnotes as such), this is probably a pipe dream.

A more reasonably accomplished feature would break up typographical ligatures, though I could script this myself easily enough with sed or similar.

roger, if you want to be able to use it without using ./ before it, just copy the executable into your PATH, such as into the ~/bin/ folder (restart bash if need be).

roger64

05-14-2010, 11:26 AM

roger, if you want to be able to use it without using ./ before it, just copy the executable into your PATH, such as into the ~/bin/ folder (restart bash if need be).

Thanks for the tip.

Pranananda

05-22-2010, 08:44 PM

I've posted pdfreflow‑0.8.5.zip to the original post. Here are the release notes:

No more small fonts! The HTML output now uses relative font sizes, i.e. font‑size=120% versus font‑size=12px. Its possible to specify the previous behavior with absolute font sizes using the ‑‑absolute flag. See pdfreflow.html for more info.
Sometimes pdfreflow can't find the center X position of the page. This happens with rag right documents. Added a way to specify where the center X position to use for centered paragraphs with the ‑‑center=line_spec option. See pdfreflow.html for more info.
The HTML output and imbedded CSS styles are simpler because of using the default font more often.
Fixed lots of bugs,updated pdfreflow.html

Pranananda

05-24-2010, 03:50 AM

I've posted pdfreflow-0.8.6.zip to the original post. Here are the release notes:
Added --shortlines=PERCENT option, to help with documents that don't use indented paragraphs and don't have vertical spacing after paragraphs. The argument is a value between 1 and 100. If 0 is specified, the the default value is used (80%). If a paragraph has a line that is less than the specified percentage of the longest line, it will be considered the end of a paragraph. This option is only necessary for poorly formatted fiction books, or perhaps for ebooks that are oriented for very tiny screens, that don't want to waste any vertical spacing or lose the space from the paragraph indent.
Added --nonfiction option, to specify that short lines don't necessarily mean end of paragraph. This is not necessary for typical fiction books that use either indented paragraphs or have vertical spacing after a paragraph. It is necessary for books that use block quoting that has an inset margin that is the same as the paragraph indent.
Added --print option, to print out the contents of a single page with line numbers, to standard error. This is useful for determining the line number argument for the --center option
And bug fixes, of course!

greenapple

05-24-2010, 10:15 PM

This sounds like a very useful tool. Could you also make a front-end, windows GUI for this? I'm not very good with DOS stuff. Thanks.

Pranananda

05-26-2010, 05:39 AM

Hi greenapple,

I will try to get out a GUI front end for Windows within a week or so. I'm having to do this in java, and it's been a long time since I did any java programming.

greenapple

05-26-2010, 06:15 AM

Thanks, Pranananda. Looking forward to it! :)

Pranananda

05-27-2010, 04:22 PM

There is now a graphical user interface. You must have Java installed to run this interface, which you can get from http://java.com/download.

See the original post for the binaries and instructions.

For Windows, the zip file contains all the binaries you need: pdfreflow.exe, pdftohtml.exe, and the Java jar file.

For other platforms, you must already have pdfreflow and pdftohtml installed, and they must be in your path.

There is a Help button that will bring up some online help.

http://i1026.photobucket.com/albums/y324/pranananda/PDFR0860win.jpg

jackie_w

05-27-2010, 06:44 PM

This looks like a promising new PDF utility, Pranananda. Thank you for your hard work. :)

Fat Abe

05-28-2010, 10:11 PM

:) Pranananda, the exe version you have on sourceforge seems to differ from the version you posted on MR. However, the newest version (0.8.6), with a GUI:thanks:, is pretty good. I reflowed a pdf novel in 10 minutes flat, 9.5 of which were spent editing/proofing the resulting html file. The only fix, related to pdfreflow, was to change the style of p2 to text-align: center. The culprit Xml line was as follows:

For some strange reason, there was no fontspec id="5" in the xml file, so I'm not sure how you interpreted the above.

Pranananda

05-29-2010, 04:39 AM

jackie_w & Fat Abe, thanks for the positive feedback.

Abe, I just downloaded the zip files here on the original posts, and they do have the correct version (0.8.6) in them. The build times might be different because of my non automated techniques. But I did run the --version, and it reported 0.8.6.

If people are having PDFs that should work but don't, I would love to hear about it and perhaps even get the PDF that is showing any defect in the reflow logic.

Pranananda

05-29-2010, 05:20 AM

There is now an installer for a Macintosh user interface. It runs on Mac OS X Leopard and Snow Leopard, and it is in PDFReflow-0.8.6.1.dmg.zip.

The Windows version and the Ubuntu version of the user interface have been updated - PDFReflow-0.8.6.1-Setup.zip for Windows, and PDFReflow-0.8.6.1.jar.zip for Ubuntu.

The command line version remains unchanged.

The help has been corrected and enhanced on the user interface.

http://i1026.photobucket.com/albums/y324/pranananda/screenshot.png

Fat Abe

05-29-2010, 03:44 PM

Wow, this keeps getting better and better. How about adding a box for font family, and a set of presets to automate the conversion process? I thank you for the effort you have put into the program. It is a godsend for those of us who are given documents in pdf format, but have to read them on small form factor eReaders.

jackie_w

06-03-2010, 06:27 PM

Hi Pranananda,

I'm having a few problems with a PDF I'm trying to reflow.

In the output HTML, the first few lines of each chapter are out of sequence. Also, sometimes a multi-line chapter heading has its words out-of-sequence. I have attached a 2-page extract PDF which demos the problems. I would be grateful if you could find time to look at it and advise if/where I may be going wrong.

I only set 2 parameters: Crop top = 98 and Crop bottom = 591

It doesn't surprise me that the chapter's initial DropCap might cause a problem, but the first few lines seem to be in the correct sequence in the XML but not the HTML.

In addition, an unrelated minor problem I have found is that in the reflowed HTML, the opening <body> tag seems to have been output as a closing </body> tag, i.e the file has 2 closing body tags and no opening body tag. I assume this is a coding typo.

As presented, the page number 88 is specified on the 3rd line above, but is actually the last line of page 1. I have not looked at the source code for pdfreflow, but the actual line order that it should have decoded from the xml are the top locations 45, 205, 221, 248, 275, etc. However, the line heights of the sequence:

THE JOURNEY FROM
PLATFORM NINE
AND THREE-QUARTERS

cause the rendered sequence to be

THE JOURNEY FROM
AND THREE-QUARTERS
PLATFORM NINE

Just manually edit the xml file, and change the font size from 3 to 2 (in these lines), and then it will be in order again. Manually reorder the lines at top="299" and top="591". At top="463", there is a line height jump to 20 instead of the usual +16 due to an oversized font.

After analyzing the xml file (which is a product of pdftohtml), I can sympathize with those developers who are working on pdf re-flowers. They seem to have to do some form of layout decoding and correction, as well as sorting and correction, to produce a perfect result.

These funny font sizes make the lines intersect each other, and the corrections above avoid this issue. There is also another pdftohtml problem with this line:
<text top="483" left="47" width="351" height="14" font="5"><i>Magic.</i> His school books were very interesting. He lay on his bed </text>
having a smaller height than the other lines.

But there are also wrapping bugs in pdfreflow. It is wrapping too much text into some paragraphs, and getting confused about the start of new paragraphs, partly because of the drop cap.

I am away from my home and home computer until this Sunday and it may be a week after I return before I put out version with a bug fix.

jackie_w

06-04-2010, 06:47 AM

@Abe and Pranananda, Thank you for taking the time to explain to me.

I look forward to the next release.

humore

07-11-2010, 01:30 PM

It's the best I've ever seen! Being able to eliminating headers/footers and to reflow the text, it really is one of a kind! Thaaaank you, Pranananda!

Toxaris

08-30-2010, 05:24 AM

Looks very very good!

amoroso

08-30-2010, 05:23 PM

I installed PDFReflow 0.8.6.1 on a Fedora 11 Linux system (I already had popper-utils), but I am unable to generate HTML output. If I run

pdftohtml -xml mybook.pdf

and then

pdfreflow mybook.xml

the PDFReflow GUI starts. If I then select the PDF document in the GUI and click Reflow, a new GUI instance is started.

When I directly run the GUI, select the PDF document and click Reflow, a new GUI instance is started.

In all these cases, no HTML output is generated.

Pranananda

09-03-2010, 11:51 AM

Paolo,

If you are running under Linux, the lowercase pdfreflow command should run the command line interface. The GUI is in PDFReflow.jar, and requires you to run it as:

java -jar PDFReflow-0.8.6.1.jar

Could it be that you have another pdfreflow command? In the terminal, type:

which pdfreflow
type pdfreflow

to see the full path of the the command line.

If you created a shell script to invoke the GUI, this could be the problem. The lowercase pdfreflow command must be in your path, and it must the the command line version.

amoroso

09-03-2010, 04:46 PM

If you created a shell script to invoke the GUI, this could be the problem. The lowercase pdfreflow command must be in your path, and it must the the command line version.
I did two mistakes: 1) I forgot to download pdfreflow 2) I created a shell script named pdfreflow that run the PDFReflow JAR. Installing pdfreflow and renaming the script solved my problem. PDFReflow works great, thanks.

Raulnuto

11-02-2010, 04:55 PM

Hello, is there any other tool which supports also pictures? pdfreflow simply removes them...

Thanks a lot, I found the information here very useful

Francesco

12-19-2010, 11:46 PM

pdfReflow + calibre = good job!
pdfReflow does a great job removing headers. Even though the result might not be too pleasing to the eye (I opened the resulting html in OpenOffice and didn't like what I saw: quite a few different font sizes, lots of space between paragraphs, etc), Calibre produced a quite decent document. This is cool stuff, hanks!

By the way, I used the GUI, haven't tried the command line.

michaelbr

01-27-2011, 09:55 PM

Hi Pranananda,
Thanks for this great job. I think I found two bugs in your pdfreflow, could you please confirm it or tell where I can report it? One suggestion, if you can leave the open file button stays at the last location, it'll be great (when you have some spare time, of course, this is just a nice feature).
1) There's a </body> at the top of generated html file, I think this should be an opening instead of closing <body> (almost at the end, there's another </body>).
2) Sometimes there's missing </p>. For instances, the chapter sometimes has a closing </p> tag, sometime it's missing (so 2 paragraphs close by are merged together, this seems random).

KevinH

01-30-2011, 05:20 PM

Hi,

Just found pdfreflow and was amazed at how good a job it does using the additional xml information available from the pdftohtml program.

I would very much like to see/study the source code to see if inter-paragraph spacing can be improved and how you detect paragraph starts but the svn command on sourceforge produces nothing.

Is there a tar.gz version of the latest java source available someplace? Or a new place to checkout the code from?

Very nice work btw!

KevinH

Pranananda

02-01-2011, 01:05 AM

@michaelbr, thanks for the bug report. I will incorporate bug fixes you reported into the next version of pdfreflow.

Pranananda

02-01-2011, 01:08 AM

@KevinH,

The source code for the command line is on sourceforge, and there is a link to the source code on the original post. I haven't put the Java code out there, because the building procedure is so different for the Mac and Windows platform. But all the logic is in the command line, so you can just replace the command line program with the source that is available that the GUI is using (though this doesn't allow you to change the GUI, I realize).

KevinH

02-05-2011, 12:22 AM

Hi,

Okay, I found the C source code for pdfreflow and looked it over, very nice job!

Do you have any plans to convert your styles from absolute margins using "px" values to relative values using either "em" or "%" as the default?

Given the page width in px is available, it would seem to be possible to scale things just before writing it to the file replacing margin-left px (and margin-right if needed) with % of width. This would allow better reflowing on smaller devices since larger fixed px margins can be a real pain for many mobile devices. If you change it just when written to the file it should change nothing else internally so no other code need change.

Is the style Rect r.width for a paragraph aware/set at all by the page width or is it simply the width of the text in the paragraph? What would be the easiest way to access the page width in the htmlprintstyle routine to make the conversions?

If I enter "pdfreflow --top=36 --bottom=743 *.xml", the application processes the first file it encounters, and stops.

I have very little experience at the command line. Am I doing something wrong?

frabjous

11-03-2011, 09:16 AM

Kevin8or,

Probably best to use a for-loop, e.g., on mac or linux:

for file in *.pdf ; do pdftohtml -xml "$file" ; done

(Can't remember the right syntax for Windows off the top of my head, but I'll look it up if need be.)

Kevin8or

11-03-2011, 09:26 AM

Ah, cool .. Eh, no joy, but you've given me an avenue of exploration. Thankyou.

Edit: In my Windows XP reference, in a table of batch commands, the syntax is listed as:

for %%var in (set)
do [cmd] %%var

I made a .bat file with this:
for %%var in *.pdf
do pdftohtml -xml %%var
It didn't work, unsurprisingly. I assume I need to replace %%var with something related to the real names of the files I'm using, which are "RT01.pdf", "RT02.pdf", & "RT03.pdf". (I have more pdf files to do, but these are the ones I'm using for practice.)

frabjous

11-03-2011, 10:14 AM

My best guess is something like (all one line)

for %%I in (*.pdf) do pdftohtml -xml "%%I"

Or a single % in both instances if typing in at the command prompt. I don't use Windows so I can't test that.

Kevin8or

11-03-2011, 10:21 AM

You hit the mark! This did it:
for %I in (*.pdf) do pdftohtml -xml "%I"
Out of curiosity, what does the "I" stand for? I mean, why not "A" or "T"?

You've made my day frabjous. :D Thank you so much.

frabjous

11-03-2011, 10:22 AM

It's just the name of variable. You can make it A or T or whatever you want.

Kevin8or

11-03-2011, 10:32 AM

It's just the name of variable. You can make it A or T or whatever you want.I see. Thanks again. :hatsoff: