Not a good solution as Fontforge is Linux and to get it to run on Windows, Cygwin has to be installed and Python has to be installed under Cygwin. That's a very poor solution for Windows users. We need a more universal solution.

In a couple of other threads, the idea has come up that it would be very useful to have a tool that could subset the fonts embedded in ePub and/or KF8 books, as that would both reduce the file sizes and make more fonts available, as some font licences require subsetting to permit embedding.

Someone raised the question of ligatures and alternate characters. As I said, these are very good questions, as the creator of the ePub has no control over the glyph choice of the display software.

In PDFs, the subsetting task is a lot simpler, as all the glyphs used (not just characters) are fixed in the PDF.

For ePubs and KF8, I think we must take this into account in any solution. But this doesn't need to be part of the font subsetting code, which should work from a passed list of glyphs that should be included. (And should return an error if any are missing from the font.)

For ligatures we might need to generate not only a list of all characters in a file, but also of all character pairs. But, of course, there are also three character ligatures (ffi in English, for examples) and I suppose some languages might have more.

Hmmm... Perhaps we just need to include all ligatures for which the source file includes all the characters in the ligature.

Or perhaps we also need a script to get information on ligatures present in a font, so that that information can be used when parsing the XHTML.

Or should we start off with a very basic solution, and elaborate once that's working?

Are you sure Python would work with that Windows compiled version of Fontforge?

Unfortunately, the MinGW based FontForge Windows binary installer doesn't seem to install the fontforge Python module, but theoretically it should be possible to use Python scripts to control FontForge.

Because I have much more time than sense, I've done some more work on the script that counts/collects the characters used in files.

Building on the core that Man Eating Duck posted, this script will work for a single ePub, (x)html, or text file. In addition to filtering all of the html code/attributes from the results, it will also convert entities (named or otherwise) to their rendered equivalents.

It also has the ability to limit the results to a single specified CSS class (handy for determining the font-subset required for headings or drop-caps).

Python will almost always have issues printing certain unicode characters to the console on Windows OSs, so Windows users should consider just writing the results to a file and then viewing that file with an editor that supports the required character encoding.