Text extraction from PDF files is a requirement that many developers encounter in their software projects. While some people prefer to use a 3rd party library (PDFkitten for example) for this task, others want to implement it from the scratch.

This article is the first in a series of articles that will show how to implement this feature from scratch. While the code in the articles will use the CoreGraphics and CGPDF* API to parse the PDF files, the general concepts shown in the articles will apply to any programming language. Basic knowledge of PDF structure is a plus and will help.

Text showing operators

PDF specification includes several page content operators for displaying text on a PDF page.

Tj - shows a text string. It has a single operand: stringObject Tj

TJ - shows an array of strings. It has a single operand: arrayObject TJ
- the array can contain also numbers that let you adjust the spacing between characters

' (single quote) - moves to the next line and shows a text string. It has a single operand: stringobject '

" (double quote) - moves to the next line and shows a text string while setting the word and character spacing. It has 3 operands: wordSpacing characterSpacing stringObject "

The stringObject operand is a sequence of bytes, it is not an actual string using WinAnsi or UTF-8 encoding. This sequence of bytes is transformed into an actual string using the current font's encoding and its ToUnicode cmap. This leads to another operator that needs to be handled:

Tf - sets current font and size. It has 2 operands: fontResourceName fontSize Tf
The fontResourceName is a name object that we'll use to locate the font object in the Resources dictionary.

The page content is parsed using the CGPDFScanner* methods. The operators table for the operators above is setup like this: