forrest: Yes, Unicode/UTF-8 should be the default
charset and encoding for Advogato (technically, UTF-8 is not a
charset). So basically I need to convert all the Latin-1 stuff in the
database over, then switch over the reported charset.

By the way, Google search results are now multilingual, with Russian,
Japanese, and other alphabets all mixed in on the same page. They seem
to have gone back and forth on this; even recently I got the "results
can not be displayed in this character set" message. In any case, I
think it's cool.

More blog navel-gazing

I expected to get a lot of response from my last
entry, but I didn't. I tried to argue it fairly and carefully, to
best reach an audience of journalists (to whom I expect it would be
considered quite controversial), but to my usual readers I expect I'm
preaching to the choir. Perhaps if I had blamed the media for their
role in unbelievable ignorance of Americans, it would have stirred up more response.

In any case, there are some downsides to blogging, or at least areas
where it needs work. For one, not everybody is capable of criticial
reading (from the survey above, the fraction would seem to be less
than 17%). The mainstream media is actually pretty good in distilling
a story down to a form where busy people can absorb it quickly. Blogs
aren't, at least not yet. I'm hopeful that technical innovations can
help with that, not least the use of trust metrics to ferret out the
good material, but of course people have to be writing that first.

Needless to say, I didn't get any e-mails from newspaper editors on
why they're not covering Bruce Kushnick's book. The most parsimonous
answer is that their souls are simply 0wnz0red, and they're no more
capable of breaking a story on the corruption of the telecoms industry
than Hilary Rosen is capable of writing an editorial on how music
trading is sometimes good for artists.

But (and this is a big but), the blog world is not (yet) doing a good
job covering this story either. Bruce's publication of the book is a
good start, but there's a lot of followup work to be done:
fact-checking, correcting mistakes, unearthing more evidence,
summarizing the highlights, getting the word out. This is exactly the
sort of thing that journalists claim to be good at, because they have
the resources to do it. Perhaps bloggers don't, although my personal
belief is that it's the kind of work that lends itself to the sort of
distributed effort that's so effective in creating free software.

Word to PDF

Thanks for the great feedback from cinamod
and
cuenca
on this topic. I'll try to respond.

I'm not sure whether it's better to try to create a batch renderer
project now, or whether it's best to work on existing tools, such as
the renderer in AbiWord. If the latter is really, really good, then it
can be used as a batch renderer, and we're done.

Even if everybody's needs are being well met by the existing projects,
in retrospect I think there would have been significant advantages to
have done the batch renderer first. As cuenca points out, it's a
considerably simpler problem because you don't have to design your
data structures for incremental update and so on. So I think there
would have been high-quality rendering much earlier than we're seeing
now with the GUI-focussed work.

In any case, for people contemplating new projects to work with
complex file formats, I think the advice is sound: do the batch
processor first, then adapt it to work interactively. ImageMagick and
netpbm happened before Gimp, and for a good reason.

Absolutely an important part of such a project is a regression
suite. Even better, it should be possible to use such a suite with
other Word processors, such as GUI editors.

I'm not enthusiastic about transcoding into another existing document
format such as TeX. This path makes it easy to get basic formatting
right, but probably much harder to get it really good. The idea of TeX
code to match Word's formatting quirks makes me cringe.

AlanShutko:
It's not surprising that Word's layout has changed over the years. In
fact, it's fair to say that interchange and compatibility in the Word
universe only works well if everybody is using the same version. I'm
sure that that the fact that this fuels upgrading is merely a
coincidence :)

Even so, that doesn't make the problem impossible, just harder. I
believe that Word documents self-identify the version of Word that
generated them. Therefore, in theory at least, it should be possible
to create a pixel-perfect rendering of the document as seen by the
writer. SMB has many implementation variances, but that doesn't stop
Samba from being viable. The goal, as usual, should be "least
surprise".

Of course the rendering depends on the font metrics. Is there
anyone who believes it shouldn't? Depending on the printer is a
misfeature, of course, but as I've argued above, a "best effort" is
likely to make people happy.