Short of using element transformations (-moz-transform, DXImageTransform etc.) which we found to be rather impractical, we encode the above HTML with a custom font created by transforming the original font. Here’s how our generated font looks in FontForge:

From the above font screenshot you also notice that we reduce fonts to only the characters that are actually used in the document; that helps save space and network bandwidth. Usually, fonts in the pdfs are already reduced, so this is not always necessary.

Naturally, for fonts with diagonal characters every character needs to be offset to a different vertical position (we encode fonts as left-to-right). In fact, this is how other HTML converters basically work: they place every single character on the page using a div with position:absolute:

At Scribd, we invested a lot of time in optimizing this, to the degree that we can now convert almost all documents to “nice” HTML markup. We detect character spacing, line-heights, paragraphs, justification and a lot of other attributes of the input document that can be encoded natively in the HTML. So a PDF document uploaded to Scribd may, in it’s HTML version, look like this (style attributes omitted for legibility):

<span>domain block is in a different image than the range block), as</span>

<span>opposed to mappings which stay in the image (domain block</span>

<span>and range block are in the same image) - also see Fig. 5.</span>

<span>It's also possible to, instead of counting the number of</span>

</p>

Together with tags for graphic elements on pages, we can now represent every PDF document in HTML while preserving fonts, layout and style, with text selectability, searchability, and making full use of the optimized rendering engines built into browsers.

They definitely have more work to do. The example document they link from that post (http://www.scribd.com/documents/5/Image-Cluster-Compression) is horribly broken in the latest Safari. It works in Chrome, so it’s not really clear why it wouldn’t work in Safari, but there’s a bunch of stuff that’s just all over the page in wrong places.

I wish they had been more specific than “impractical” so we knew the issues they ran into. I haven’t found that to be the case, and (re hacker news thread) the image-based transform of IE’s matrix transform means that things like a transformed textarea have little impact on performance, which can’t be said for some more modern browsers that are forced into a slower rendering path (and selection works just fine in the textarea).

Main annoyances I’ve had to work around
1) major performance drops in (Windows-based, non-hardware accelerated) WebKit and Opera when used with features like e.g. textareas or border-radius.
2) IE’s image based transforms mean that scaling uses bilinear sampling and can look pretty bad for even modest values.
3) IE doesn’t anti-alias text at all if the element that contains it and is being transformed has a transparent background color.
4) Cairo (on Windows, at least) renders even transformed glyphs at integer coordinates. This can be very obvious with small text as the rounding used can make each character have essentially its own little additional rotation.
5) Also on Windows, and I think just Windows, #4 combines with some layout issue where a changing transform on one element can can cause transformed (but unchanging) text elsewhere on the page to jitter back and forth from frame to frame.

#4 and #5 are the most annoying because they can’t really be worked around and can be pretty distracting. There are bugs, but they haven’t gotten much attention. However, Bas’s Direct2D Cairo backed renders text so beautifully (even at extreme skews) that I’m willing to hold out.

I’ve had good luck using SVG to rotate text. The SVG rendering path is pretty well optimized for rotated text. The advantage is that it’s well-supported, even on Opera, and even on older browser versions. Except for IE ofcourse. The disadvantage is that you have to use javascript to embed svg nodes into html, but I suppose you could use a script to replace html elements with svg elements.

the clear advantage of the glyph approach over svg et al. is that you can use easier, semantically understandable markup. In short words: you can Ctrl+F-search even for distorted text like the caption “iterations” in the screenshot.

As we are used to this feature from pdf readers, it’s nice to have it in the html version too.

Furthermore, this will improve the “indexability” of the converted pdf documents. until now, pdf-to-html converters often ignored captions in diagrams and drawings.

Good point about ctrl+f and text selection. I tested it on a simple rotated svg text demo. It works in webkit, opera and the IE9 preview. It doesn’t work in firefox (5 year old bug 292498). In the bug notes it says that firefox fails the svg test suite because of this.

Reduce fonts to only the characters that are actually used in the document to save network bandwidth is especially helpful for mobile apps which, even with 4G, are challenged for bandwidth. What other techniques can be used to save network bandwidth?