thank you for four quick answer.
I discover the "crop box" .., explanation is clear.
for this document (no column) , I can use "calibre" to convert pdf to epub.
( k2pdf is bettterfor some details..)
but calibre can't convert multicolumn documents..
I'll see what hapens, next time, with a multi-column doc...

Thanks for your explanation for computing the text margins. k2pdfopt is an impressive and amazingly precise tool.

The conversion of a standard book like mine may need two kind of commands, one for the cover page and for the other pages without margins) that may exist, another for normal pages with margins. So we may have several output files for one book. I can use pdfsam to merge these output files.

Thanks for your explanation for computing the text margins. k2pdfopt is an impressive and amazingly precise tool.

The conversion of a standard book like mine may need two kind of commands, one for the cover page and for the other pages without margins) that may exist, another for normal pages with margins. So we may have several output files for one book. I can use pdfsam to merge these output files.

Thanks. You are correct that I don't presently have a good way to apply different options to different pages within a single k2pdfopt conversion, so converting different sets of pages with different options using consecutive commands and then assembling the outputs is the way to go. I had not heard of pdfsam. Thanks for the tip. I use jpdftweak for general PDF file manipulation.

Is there any way for OCRing multiple language pages for example a dictionary page which is (usually) biligual? I don't have any idea if Tesseract allows doing this so it might be impossible to achieve..

Is there any way for OCRing multiple language pages for example a dictionary page which is (usually) biligual? I don't have any idea if Tesseract allows doing this so it might be impossible to achieve..

This is a better question for the Tesseract folks. You can always just try the English language OCR in Tesseract and see what you get. For fun, I tried OCR-ing the attached document (multilingual.pdf) which I created using google translate. When I use the English Tesseract training pack (result in multi_eng.pdf), the first three pages--English, French, and German--OCR mostly correctly--some of the special French characters come through, but others are lost or done incorrectly, and the German umlaut doesn't come through, and the Russian (Cyrillic) doesn't get done correctly at all. When I use the Russian training pack (result in multi_rus.pdf), the Russian page is (mostly) correct, but none of the others are. So it depends partly on how different the languages are. I don't see any generic "Romance language" training packs for Tesseract, unfortunately--English is the largest training data package (other than Asian languages), so I'd guess it's your best bet for English/French/Spanish and other English-alphabet languages, though I can't say for certain. Again, a Tesseract expert would have to weigh in.

Note that to see the Russian characters correctly, you need to copy and paste the Russian PDF page into a unicode-aware application (like the google translate box in a modern browser). K2pdfopt does not use the correct Cyrillic font. The commands I used were:
k2pdfopt -mode copy -ocr t -ocrvis t multilingual.pdf -ocrlang eng -o multi_eng.pdf

Sorry--I missed this post. The problem is that your document size (4.5 x 7 inches) combined with k2pdfopt's default output resolution (167 dpi) results in no wrapping being required. So you have two options if you want wrapped text: (1) increase the output dpi (will make everything larger) to something like 200, or (2) use -wrap+, which will un-wrap the narrow column on the right so that all the text fits the width of your reader screen. You also should use -m 0 to avoid having any clipping since your viewable region runs right to the edge of the page. Finally, for cases like this I like to use -sm so that I can verify how k2pdfopt is interpreting the page layout. Final commands, then:

k2pdfopt -m 0 -sm -fc- -odpi 200 page17.pdf

or

k2pdfopt -m 0 -sm -fc- -wrap+ page17.pdf

(you can also combine -odpi 200 and -wrap+).

Thank you, in the meantime since my previous post was my first and took a while to be moderated (i suspect that's why you missed it as well) i've mostly solved the issue using these arguments (possibly i'm forgetting something):
-m 0 -col 1 -fc- -wrap-
to prevent any wrapping / layout changes or text resizing. It worked pretty good since actually it fits quite well with the reader (wide) size in terms of text size (so just dumb luck basically + switching to landscape orientation). I will try your solution to see how that works out.

Thank you, in the meantime since my previous post was my first and took a while to be moderated (i suspect that's why you missed it as well) i've mostly solved the issue using these arguments (possibly i'm forgetting something):
-m 0 -col 1 -fc- -wrap-
to prevent any wrapping / layout changes or text resizing. It worked pretty good since actually it fits quite well with the reader (wide) size in terms of text size (so just dumb luck basically + switching to landscape orientation). I will try your solution to see how that works out.

Thanks again for your help!

It sounds like you've rotated the document so that you are viewing it in landscape mode on your reader, which the above options would not do. Maybe you used this?

-m 0 -mode fw

The -mode fw is a shortcut for several options. See my command-line options help page for the details. (Actually, you don't need -m 0 anymore with v1.65. It's now the default.) If you didn't try the above command, you should try it. It's a good solution if you don't need text re-flow.

Hi all,
just wanted to let you know that I have also updated my Windows GUI for k2pdfopt with a few of the new options of k2pdfopt, most important the OCR functions. The GUI contains links to all Tessaract training files, so downloading them is pretty easy. The respective environment variable is set by the GUI, you only have to specify the path where you have extracted the language files.

I did not want to implement the Download and Extraction procedure into the GUI due to possible safety concerns users might have ("Why does that program connect to the internet?!"), so that part is handled by your trusted browser. ;-)

I have a problem with a three-column text. Inside one page an image is located in a way to overlap the two columns. And the program does not read the page correctly - it does not recognize the text as three column. Now, it renders the second page fine. Now, if I cut the image out by specifying 3,4" bottom margin, the columns get recognized (although the lines separating the columns does not get ignored, which is a mino problem, though).

could something be done about pages like this, or is it just too much play? Here is the file I have a problem with:

I have a problem with a three-column text. Inside one page an image is located in a way to overlap the two columns. And the program does not read the page correctly - it does not recognize the text as three column. Now, it renders the second page fine. Now, if I cut the image out by specifying 3,4" bottom margin, the columns get recognized (although the lines separating the columns does not get ignored, which is a mino problem, though).

could something be done about pages like this, or is it just too much play? Here is the file I have a problem with:

Layouts like this make me want to re-think the way I order regions in k2pdfopt, or at least to provide a couple more options, but I was able to get something reasonable with the existing version:

k2pdfopt -col 4 -cgr .4 -evl 1 -sm -mb 1.1 -ch 0.5 Az.pdf

-col 4 enables detection of up to 4 columns (2 levels of recursion).-cgr .4 limits the horizontal search range for the column divider. The value of .4 gets k2pdfopt to treat the left column divider as the first divider, which is the key to correct layout on page 1.-evl 1 erases the vertical lines, which helps k2pdfopt find the column dividers.-sm shows you how k2pdfopt is flowing your document (in the ..._marked.pdf file). You can take that out on the final conversion since it slows things down considerably.-mb 1.1 ignores the page numbers / footer on the bottom of each page by cropping off the bottom 1.1 inches from each source page.-ch 0.5 allows regions as short as 0.5 inches in height to be separated into multiple columns, which is important for page 1 (the default is 1.5 inches).