IN short, it was: for those of us who won't know the final character set until after the ePUB is basically created, and have 20-30 xhtml files....is there an easier way to obtain all the text than copy-and-pasting each of the xhtml files into the box? And, is there any way that you can see to incorporate this with, say, ePUBtweak.exe, in that vein? So that Font Shrinker could scour the exploded files when you have ePUBtweak open, and obtain the character sets that way?

For now, the only method is copy/pasting. I know that is not always handy, but it was the easiest to do and I needed the program now. I see the added value in having it reading ePUB and/or XHTML, but that will be some work. The main problem would be in identifying only the required characters in a class.
I will take a look at ePUBtweak to see if I can use the output from it. It might be a good idea.

The next version of Sigil will have a report listing all the characters visible in Book View. It's not by class though, so to limit it to sections of text would still require you to do some work. It may be it needs to be changed to use Code View - this might allow seeing what is in a class but it would be guessing what is actually visible in Book View (e.g. if a style hides the text using display:none or similar).

I have thought about this for a while. I think I will start working on the following in the weekend (depends on a lot of personal stuff...):
- ability to select an ePUB
- parse XHTML to find all characters in use by a certain CSS class
- open the used fonts in the ePUB and shrink it according to the used characters for that font
- replace the fonts in the ePUB by the shrinked ones.

Don't expect it to be ready soon though, it needs quite some testing and the most difficult part will probably be the parsing of the stylesheet to find the classes where a font is defined/used. It might be that an intermediate version will be created where the styles class names have to be entered manually.

As as special service to JSWolf () I will automatically add the ligatures to the unique characters used.

I have thought about this for a while. I think I will start working on the following in the weekend (depends on a lot of personal stuff...):
- ability to select an ePUB
- parse XHTML to find all characters in use by a certain CSS class
- open the used fonts in the ePUB and shrink it according to the used characters for that font
- replace the fonts in the ePUB by the shrinked ones.

Don't expect it to be ready soon though, it needs quite some testing and the most difficult part will probably be the parsing of the stylesheet to find the classes where a font is defined/used. It might be that an intermediate version will be created where the styles class names have to be entered manually.

As as special service to JSWolf () I will automatically add the ligatures to the unique characters used.

Well, that's a hell of a wishlist, and it would rock, but I'd be thrilled if it could simply peruse an ePUB for all the characters used in that ePUB, even if the classes are not discovered. By which I mean: let's say I have two fonts. One for the body; one for the chapter heads. By definition, the font for the body will have more characters, in all likelihood. However, I wouldn't care, at this point in time, if I had to feed the Shrinker all the chars in the ePUB, to shrink the Chapter head font.

To have it perfect, later, would be, as I said, amazing, but right this second, what I'd love is if it could just open the ePUB and say, "VOILA!" I don't even care if I have to manually replace the fonts, that's not a big deal.

Not that I'd turn DOWN Shrinker with all the extra goodies...just thinking aloud about what I, personally, need most. I realize my needs are probably different than almost everyone else's.

OH, also: a way to direct the location of the output of the created subsetted font would be super. While I'm wish-listing.

And if I didn't say it loudly enough, before: seriously, you are fabulous.

I'm really enthusiastic that this particular idea of epub tweaking found that much positive resonance - really no joking here.

Tox: while you work on manipulation of the font files you should consider auto-renaming them: both filename & the font name stored inside the font file. AFAIR It's often required even in licences of free fonts when they are changed. While it's relatively meaningless for personal uses it's crucial as soon as your tool matures to become a part of the toolchain used by professional producers. (and aren't more optimized professional books a goal we all wish for?)

I'm not sure. Do you have an example epub (a link or just a small file is fine) that contains ligatures? The code literally just reports each unicode character that appears in the text (and if it has an entity name).

I'm not sure. Do you have an example epub (a link or just a small file is fine) that contains ligatures? The code literally just reports each unicode character that appears in the text (and if it has an entity name).

The ePub does not contain the ligatures. ADE 2.0 and Calibre (and maybe other reading software) converts to using ligatures. So for example, if your text have a word such as flight, the fl will be converted to the ligature and displayed that way. Your code would have to handle fl as separate fl and as the ligature for reading software that does and does not convert to ligatures.

Oh and would it be possible to display each character for a given font for embedded fonts?

I'll try to explain. A text typically has no explicit ligatures, it could have some, but it should not, and I've only seen some very old text files with them. What a text has is just normal unicode characters, let's say a text consists of the single word "office", that's only 5 different letters: c, e, f, i, o.

Now, a font could have ligatures defined, and a reading software may use them (although many do not, I'm afraid). Let's say that the font we are dealing with has the ligatures "fi", "ffi" and "fj" defined. Defining a ligature means that the font has a glyph (a character shape) for the combination "fi" and some instructions saying that whenever there's an f and an i in the text, they should be rendered as the "fi" ligature and not as the separate characters (ditto for "ffi" and "fj").

OK, then our text will ideally be displayed as 4 glyphs: "o", "ffi", "c" "e". There are different things a font subsetter could do:

1) Remove everything but "o", "f", "i", "c", "e", including ligatures and their definition. This is not ideal, but it's probably the simplest.

2) Same as 1, but do not remove ligatures or their definition. That's much better, but it leaves unused glyphs, such as "fi" or "fj".

3) Detect ligatures, find out that "i" and "f" are never used alone, and remove everything but "o", "ffi", "c", "e". This is not a good idea, as renderers that do not support ligatures will not be able to display "f" and "i".

4) Remove all unused single characters, and related ligatures. This would remove "fj", since "j" is not in the source text, but leave "fi" since both "f" and "i" are, although the "fi" ligature is never used (because we have "ffi" already). I think this is the perfect combination of subsetting and not too demanding.

5) Remove some or all ligatures (the glyphs), but do not remove their definitions. This is not a good idea either, and I think this was the bug in Calibre. It means a renderer supporting ligatures would believe there is a ligature to use for "ffi", but it would't find it.

So, if you can, go for #4. But things may be significantly harder. A font (particulary an OTF one) may contain other alternate shapes for glyphs (final forms, swash forms, older variants, small-caps, etc.), those are currently unused by practically all renderers, but there's still hope that some day we'll be able to enjoy some more advanced typesetting options...

For now I will probably just add the few ligature glyphs. There aren't that many, so the impact on the size is limited. I should think about the smallcaps, but that one will be at the bottom on the list.