Hi Folks
I'm new to Calibre and so far am impressed with all that it can do. I'd like to read screenplays on my eReader (a Nook) and have found several threads talking about this but no good solutions that preserve the simple but important formatting.

I'm interested in adding some knowledge of screenplay formatting rules to Calibre so it can carry forward the important formatting bits into converted docs.

I've read http://manual.calibre-ebook.com/develop.html and it suggests that I post here for both help in getting up to speed on the codebase and for advice on how to approach a problem and where in the code it should go.

A bit more about what I'm trying to do: first, I hear you when you say PDF is a poor source format. Yet PDF is the most likely format in which a screenplay exists. A far second is formatted text.

The important bits of screenplay format are pretty much just:

Code:

ONE UPCASED LINE ABOUT THE SCENE SETTING
An arbitrary amount of text describing setting or action. Often mentions a character like PRODUCER whose dialog will appear below:
PRODUCER
(wringing his hands)
Why is it so hard to get my screenplays to look good
on my new eReader?!

QUESTION: Is there a generalized way to add formats other than a chapter book? Right now it appears that there's one case that's in the UI via the "Structure Detection" and "Heuristic Processing" panels.

QUESTION: Where in the code should I try to add this?

QUESTION: What's the best way for a Calibre newbie to get up-to-speed on the code so I don't do things "the wrong way"?

Still curious - is it a design decision to not have user-selectable heuristics/modules for recognizing different formatting conventions other than chapter books? Or is it that the need hasn't yet been strong enough?
Joe

I'm a TV producer and can really tell you this is a desire/need. TV/Film screenplays follow a basic format. So, even though they are pdf (which I know you stated is evil), are there any "presets" you can suggest in the calibre conversation process that will work best. Calibre does a great job on it's on but the line spacing is problematic in that the ideal output should look like this when finished:

**FADE IN:**

A RIVER.

We're underwater, watching a fat catfish swim along.

This is The Beast.

EDWARD (V.O.)
There are some fish that cannot be caught. It's not that they're faster or stronger than other fish. They're just touched by something extra. Call it luck. Call it grace. One such fish was The Beast.

The Beast's journey takes it past a dangling fish hook, baited with worms. Past a tempting lure, sparkling in the sun. Past a swiping bear claw. The Beast isn't worried.

EDWARD (V.O.)(CONT'D)
By the time I was born, he was already a legend. He'd taken more hundred-dollar lures than any fish in Alabama. Some said that fish was the ghost of Henry Walls, a thief who'd drowned in that river 60 years before. Others claimed he was a lesser dinosaur, left over from the Cretaceous period.

INT. WILL'S BEDROOM - NIGHT (1973)

WILL BLOOM, AGE 3, listens wide-eyed as his father EDWARD BLOOM, 40's and handsome, tells the story. In every gesture, Edward is bigger than life, describing each detail with absolute conviction.

EDWARD
I didn't put any stock into such speculation or superstition. All I knew was I'd been trying to catch that fish since I was a boy no bigger than you.
(closer)
And on the day you were born, that was the day I finally caught him.

EXT. CAMPFIRE - NIGHT (1977)

A few years later, and Will sits with the other INDIAN GUIDES as Edward continues telling the story to the tribe.

EDWARD
Now, I'd tried everything on it: worms, lures, peanut butter, peanut butter-and-cheese. But on that day I had a revelation: if that fish was the ghost of a thief, the usual bait wasn't going to work. I would have to use something he truly desired.

...I hear you when you say PDF is a poor source format. Yet PDF is the most likely format in which a screenplay exists. ...

The fact that it is a "likely format" to be found as a source for conversion in absolutely no way means that it is feasible, or even possible, to create an automated conversion routine that always works with all examples. Two PDFs that look identical can be vastly different in their internal construction.

1. Can you make Calibre's PDF translation better?
2. Assuming an "acceptably-translated" PDF, can you add a "screenplay" heuristic set that'll be savvy about screenplay format?

I see from responses above and throughout the forums that (1) is a sore subject around here. No problem. PDF is fine input for minds but poor for computers. So lets go to (2).

I've played with feeding the current PDF parser a bunch of screenplays and I think that what it generates fits my criteria of an "acceptably-translated" PDF for the heuristics I have in mind.

These heuristics would mainly use indentation to detect structure. A block of text at a given level of indentation would be the unit of reflow. Blank lines would also delimit a block - as well as passing through unaltered.

That's most of it right there. I suspect there would be a few tweaks to this - like parentheticals allowing either same-level or +1 indentation to match - so that

Code:

(this would
be one block)

but I think this would do a pretty nice job.

Am I missing something really big?

Last edited by joesh; 05-11-2012 at 07:03 AM.
Reason: fix the blockquote

The problem is that you can't just sidestep the PDF issue. It doesn't matter how many times what you guys are asking for gets rephrased...

PDFs have no "structure" such as indentation - many don't even have text being just images. As I understand it the various PDF converters attempt to resurrect such indentation and line breaks and apply heuristics to attempt to guess where paragraphs might end and indentation exists. But as has been repeated over and over there are certain issues (some particularly in calibre's current PDF converter) that result in text that is corrupted, such as the oft quoted double-L issue (ligatures) etc.

Adobe themselves who invented this awful format can't come up with a tool that can convert to something more useful. Now if the originator of the format can't do it, what does that tell you? That it completely sucks for anything other than being rendered as a PDF.

So as I posted on the other thread your options are:

(1) Buy a decent sized tablet and open them in a PDF reader so you don't bother converting. That is what I and many others do, particularly for technical books which rely on layout. If you want an e-ink screen, go hunting for a Kindle DX or whatever other models might be out there...

(2) Do the conversion but live with the formatting being trashed. How trashed depends on a variety of factors such as which tool, what settings and how that PDF was authored. There are no magic settings, you might stumble on something that looks "mostly alright" for one PDF and find it doesn't work well with the next one.

(3) Do the conversion but spend many hours making it readable using an html editor.

In my opinion it is a non-starter, but then I've only dabbled around the edges with PDF conversions. Calibre's perpetually on hold "new" PDF engine contains some improvements that might be able to be built on, but until/if it ever gets released you really are pushing the proverbial uphill.

The existing heuristics are primarily living in calibre/ebooks/conversion/utils.py, though Kovid is correct in the sense that they're primarily called from preprocess.py (and you'll need to touch a handful of other files to add the option to the conversion pipeline). I would say there are two ways to solve your problem:

Contribute to the next gen pdf engine

Preferred solution in the sense that the new engine should convert many more types of pdf formatting accurately, and better screenplay formatting would get a free ride.

Add heuristics to try to format for screenplays

The existing heuristics are primarily regex based, and you could certainly add regexes/patterns for screenplays to a new heuristics option which tries to match the various patterns of a screenplay and insert the appropriate css. The way heuristics stands today you'd need to insert all your styles inline - later in the conversion pipeline Calibre would convert those inline styles to css. The replace nbsp indents and format scene break options both insert formatting along the lines of what I'm talking about.

The reason this option is less desirable though is that trying to create generalized rules like these is hard to ever get perfect. Note perfection wasn't the original goal of heuristics - it was designed to basically take in garbage from a variety of formats and make it some what less trashy and potentially worth salvaging by hand.

Edit - reading through your text I see one big problem for your heuristic approach - you're assuming pdfs have blank lines - they don't. They have 'start text at xyz coordinate'. Blank lines aren't a part of that deal.

In terms of indentation level, that data is also gone by the time it gets to heuristics, but I have seen many pdfs with indentation information preserved by the pdftohtml function Calibre uses through the use of multiple non-breaking spaces - these are currently removed early in the conversion pipeline (in preprocess.py for pdf) as they're troublesome to work with in the rest of the conversion pipeline and not needed for a typical book, but you could preserve them in cases that a user has enabled the screenplay heuristic - you'd want to convert them to inline styles with a left margin based on the number of spaces.

ldolse - thanks for the considered response and education on how Calibre removes in preprocessing much of the formatting I was hoping to use.

As far as blank lines are concerned, certainly PDF doesn't have them but translators like pdftotext do create them in the text output - as does pdftohtml I believe.

kiwidude - I really do understand that PDF is, in general, a programming language and a PostScript interpreter is a fairly large beast. That said, most screenplay PDFs are created by a small handful of programs and generally create PDFs that are easy enough for tools like pdftotext to render with pretty high fidelity.

[edit: I stand corrected - I've just found a script output from one of the big screenwriting programs that's not well rendered by pdftotext et al]

I'm sure not looking for perfection here. What pdftotext generates is very satisfactory. Which brings me to a different thought - most eReaders understand straight text, right? Perhaps an easier way to go would be to make a separate tool that'd rewrap paragraphs to a width appropriate for a given reader and then just send the resulting text file to the eReader. Comments?

Last edited by joesh; 05-17-2012 at 06:52 AM.
Reason: found that - as kiwidude said - even pdf for screenplays can be knotty to render

ldolse - thanks for the considered response and education on how Calibre removes in preprocessing much of the formatting I was hoping to use.

As far as blank lines are concerned, certainly PDF doesn't have them but translators like pdftotext do create them in the text output - as does pdftohtml I believe.

kiwidude - I really do understand that PDF is, in general, a programming language and a PostScript interpreter is a fairly large beast. That said, most screenplay PDFs are created by a small handful of programs and generally create PDFs that are easy enough for tools like pdftotext to render with pretty high fidelity.

[edit: I stand corrected - I've just found a script output from one of the big screenwriting programs that's not well rendered by pdftotext et al]

I'm sure not looking for perfection here. What pdftotext generates is very satisfactory. Which brings me to a different thought - most eReaders understand straight text, right? Perhaps an easier way to go would be to make a separate tool that'd rewrap paragraphs to a width appropriate for a given reader and then just send the resulting text file to the eReader. Comments?

Have you tried using an ocr software like ABBYY Finereader I think it would preserve the formating when it is used to convert or http://pdftransformer.abbyy.com/