QuoteReplyTopic: How can I extraxt text if encoding is Identity-H Posted: 16 Nov 2016 at 12:25pm

Dear community!

I try in my application (C#) extract texts from pdf file.

In one pdf-file (uploads/1345/1.pdf) I meet Tj operator, which contain text in
this form "<001400170011001C001C0003005000F0>". Current font has encoding
Identity-H. I try convert hexadecimal elements (such as 0014, 0017... 00F0) to
integers and integer to chars. But in case (if encoding is Identity-H) this isn't
correct. For this text must be "14.99 m²".

The 'Identity-H' encoding in a PDF means that the bytes in the string specify Glyph IDs: values that identify specific glyphs in a TrueType or OpenType font file. These Glyph IDs are not the same from font to font (e.g. 'A' may map to glyph ID 12 in a Helvetica font file, but to glyph ID 37 in Comic Sans). Because the string directly depends on a specific font, an 'Identity-H' encoding requires that the font file be embedded or subset within the PDF.

In order to convert the Glyph IDs in the string back Unicode, the embedded font should carry a 'ToUnicode' table--a reverse lookup table that maps the glyph IDs back to Unicode code points. If the embedded font does not carry a 'ToUnicode' table (meaning the PDF generator didn't include one), it will be virtually impossible to extract the text that uses that font.

Note that the embedded font's 'ToUnicode' table is in a specific format--rather than being a simple table, it's basically a set of PostScript instructions for constructing the table. For this reason, it's usually easiest and most reliable to use a third-party library for PDF text extraction than to do it directly in your own code.

Here
is not exist reference to one object which ist to unicode map. I do not
understand how Acrobat Reader know what for map must be used here
(Acrobat displays correct this text), and from where take Acrobat this
map?

"In this case the entry 0014 will display the glyph of the embedded font at this position (1)."

???? What must I make? Where is the list of these "positions"? I have not foundanything similar in the font descriptor. But I have not parced stream of FontFile2. Is there in this stream? Ithink that's right, or not? If it's right, where can I found description of fontfile2?

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot delete your posts in this forumYou cannot edit your posts in this forumYou cannot create polls in this forumYou cannot vote in polls in this forum