attempting a pdf to epub ( & yes I know its a dumb thing to do ) but all goes well except where the original PDF has split a word over 2 lines- which happens a lot in this document

e.g. if PDF goes
line 1: xxxxxxxxxxxxxxxxxxxxxx al-
line 2: so xxxxxxxxxxxxxxxxx

then the epub comes out as " al‐ so"
but with the hyphen replaced by thick black bold? vertical line after the l of also NB it doesn't appear when I copy from epub reader & paste to here ), but I see it also in the source window when I open calibre wizard.

a text version of the source ( In notepad) shows
al-
so

i.e. there's a line break in there.

it must be to do with how a line break character in the PDF is being translated.

is there any way to remove / suppress it ?

update - I ticked the transliterate unicode box & recoverted zip to epub - that removed the thick black character so now I just see a broken word e.g. "al- so" .

is it possible to force an auto repair of all broken words somehow. it would be like a global replace of "- " with NULL but filtering out the genuine use of "-" characters - something like remove all "- " except when preceeded by a space ?

Actually Calibre does go through and remove hyphenated words intelligently. It uses the document itself as a dictionary to see if there is a variant of the word without a hyphen, and deletes the hyphen if there is a match.

The problem in this case is it's a crappy pdf with some other character encoded in addition to the hyphen. Unless this is a common issue across many pdfs (and I've never seen it with lots of test cases), it's probably not something that will get covered in the code.

Actually Calibre does go through and remove hyphenated words intelligently. It uses the document itself as a dictionary to see if there is a variant of the word without a hyphen, and deletes the hyphen if there is a match.

The problem in this case is it's a crappy pdf with some other character encoded in addition to the hyphen. Unless this is a common issue across many pdfs (and I've never seen it with lots of test cases), it's probably not something that will get covered in the code.

Globalistan - Pepe Escobar
seems to be a good quality, non-commercial PDF, unless I'm misunderstanding the creative commons licence ?
QUOTE from http://www.nimblebooks.com/wordpress...mmons-license/
GLOBALISTAN free under Creative Commons License
Inspired by the example of the science fiction novelist Peter Watts, who released the full text of his outstanding novel BLINDSIGHT under a Creative Commons License last year to deservedly rapturous acclaim from Boing Boing! and many others, Pepe Escobar and I are happy to announce the Free GLOBALISTAN Project.

The full text of Pepe’s brilliant new book, GLOBALISTAN: HOW THE GLOBALIZED WORLD IS DISSOLVING INTO LIQUID WAR, is now available under a Creative Commons license in both PDF and html format
ENDQUOTE

maybe I should try grabbing & converting a html version instead ? Unfortunately the link to html version at the above site seems broken - only the pdf link is working.

PS could someone please explain - if the book is being legally distributed for free, with the author's blessing , how come Amazon still want £5.27 for a Kindle version ?

The main problem is that while many of the end of line hyphens are there to break up words to improve the typography of the book, some will be genuinely hyphenated words that should remain so.

And there probably isn't an automated way of determining this during conversion.

on my Kindle, all the genuine hyphenated words appear like this "xxxxx-xxxxx", all the faulty ones are like this "xxxxx- xxxx" i.e. only the faulty ones have a space after the hyphen, so maybe an auto-fix IS possible ?

UPDATE _ i think I may have fixed it - I converted .mobi to .rtf & began a [ find "- " replace with null] process in Word , after doing a few manually it seemed to be finding only correct items to fix so I fired off replace all which did 1100+ changes. I'll convert back into .mobi now & see how it goes - well it improved the text , I think.

but a regex solution would maybe be better, I've preserved an unchanged epub version for possible further experimentation.

I see also that in the epub and mobi conversions some pictures are messed up - this is probably a epub format limitation. - the original PDF contains charts that seem to be made of 6 or 7 panels appended together horizontally.
the convertsion process has separated those into vertical stacks of picture slices. I guess I'll have to read the pdf to see those correctly.

PS could someone please explain - if the book is being legally distributed for free, with the author's blessing , how come Amazon still want £5.27 for a Kindle version ?

Because Amazon wants your money?
I was reading my Tom's Hardware recipe-created ebook today and saw that you could buy a Kindle version (probably with all the ads from the site) for only $.99 a month to replace my Calibre free version.

on my Kindle, all the genuine hyphenated words appear like this "xxxxx-xxxxx", all the faulty ones are like this "xxxxx- xxxx" i.e. only the faulty ones have a space after the hyphen, so maybe an auto-fix IS possible ?

UPDATE _ i think I may have fixed it - I converted .mobi to .rtf & began a [ find "- " replace with null] process in Word , after doing a few manually it seemed to be finding only correct items to fix so I fired off replace all which did 1100+ changes. I'll convert back into .mobi now & see how it goes - well it improved the text , I think.

but a regex solution would maybe be better, I've preserved an unchanged epub version for possible further experimentation.

I see also that in the epub and mobi conversions some pictures are messed up - this is probably a epub format limitation. - the original PDF contains charts that seem to be made of 6 or 7 panels appended together horizontally.
the convertsion process has separated those into vertical stacks of picture slices. I guess I'll have to read the pdf to see those correctly.

Just use the remove header/footer regex option to delete the hyphens then.

Code:

(?<=\w)‐\s

That is a different unicode code point than the hyphen that typically occurs in most documents. I'll look into adding that to the default de-hyphenation regex.

my repair job via word is flowing well on Kindle, but as an additional test I ran the same Globalistan.pdf through DNAML software's pdftoepub, to see how it got on with the line break words :-

it screwed up: in epub reader I see the vertical bars, & here I see exclamation marks ( after copy paste).
book extract showing the bug:
"context of re‐medievalization, where those who control power control weapons, money and The Word, this book also aims to provide a counter‐narrative."