I really don't know much about this, but I would disagree with Will's
assumption that because non-ascii show up as entities when you do "Save
as HTML" that implies that they are stored *internally* within
Word/Excel/Ppt files a entities. That just seems unlikely to me.
Maybe somebody over at the Poi project [1] knows something about this?
My 2 cents...
dwh
[1] http://jakarta.apache.org/poi/index.html
>-----Original Message-----
>From: wallen@Cyveillance.com [mailto:wallen@Cyveillance.com]
>Sent: Thursday, May 20, 2004 10:43 PM
>To: lucene-user@jakarta.apache.org
>Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese
>
>I believe MS apps store non-ascii characters as entities internally instead
>of using unicode. You can see evidence of this if you save your file as an
>HTML file and look at the source. You will have to adjust your parser to
>convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16).
>
>-Will
>
>