How Evernote’s Image Recognition Works

Evernote’s ability to search for text within images is a popular feature. In this post, I’ll describe how the process works and answer some frequently-asked questions.

How images are processed

When a note is sent to Evernote (via synchronization), any Resources included in the note that match the MIME types for PNG, JPG or GIF are sent to a different set of servers whose sole job is performing Optical Character Recognition (OCR) on the supplied image and report back with whatever it finds. These results are added to the note in the form of a hidden—that is, not visible when viewing the note—metadata attribute called recoIndex. The full recoIndex node is visible when a note is exported as an ENEX file.

For example, I dug around and found an old note in my account containing only a single photo of a bottle of beer:

When I export this note as an ENEX file—a portable XML export format for Evernote notes—and jump to the bottom of the file, I’ll find the recoIndex element. Contained within recoIndex are a number of item nodes. Each item represents a rectangle Evernote’s OCR system believes to contain text.

Each item contains four attributes: x and y indicating the coordinates of top-left corner of the area represented by the item, as well as w and h representing the width and height of the item.

As an image is evaluated for textual content, a set of possible matches is created as child elements to their corresponding item. Each match is assigned a weight (represented by the w attribute of the item): a numeric value indicating the likelihood that the given match text is the same as the text in the image.

The OCR results are embedded in the note, which is subsequently synchronized back to the user’s client applications. At this point, the text found in the image is available for search.

Here’s a portion of the recoIndex element found in the note shown earlier which contains item and t (match) elements. You’ll notice that most of the item elements have multiple t elements and each is assigned the weight value we described earlier. When a user issues a search within an Evernote client, the content of the t elements is searched:

How PDFs are processed

Evernote’s OCR system can also process PDF files, but they’re handled differently from images. When a PDF is processed, a second PDF document that contains the recognized text is created and embedded in the note containing the original PDF. This second PDF is not visible to the user and exists only to facilitate search. It also doesn’t count against the user’s monthly upload allowance.

For a PDF to be eligible for OCR, it must meet certain requirements:

It must contain a bitmap image

It must not contain selectable text (or, at least, a minimal amount)

In practical terms, this eliminates many PDFs generated by other applications from text-based formats, such as word processors and other authoring applications. PDFs that are generated by hardware scanners generally meet the above requirements. If the scanner software performs its own OCR on the PDF, it won’t be processed by Evernote’s OCR service.

If you export a note containing a PDF that has been processed by the OCR system, there will be two nodes in the document: data and alternate-data. The data node contains a base–64 encoded version of the original PDF and the alternative-data represents the searchable version of the same PDF.

Common questions

What kind of text can be recognized?

Anything that the OCR system believes to be text. Typewritten text (e.g., street signs or posters) and handwritten notes (even if your handwriting isn’t the neatest in the world) are both evaluated by the OCR service, provided the service can detect them.

The orientation of the text is a factor, as well. Text found within an image will be evaluated if it matches one of the following orientations within a few degrees:

0° — normal horizontal orientation

90° — vertical orientation

270° — vertical orientation

Text that does not match one of these orientations will be ignored (including diagonal and inverted text).

It’s important to remember that no OCR system is perfect and it’s possible that text you expect to be recognized may not be. That said, the OCR engine is being constantly refined and tuned for better accuracy.

Can Evernote’s OCR be used to create a text version of an image that contains text?

No. As described before, the matching done by the OCR system doesn’t produce one-to-one matches. Rather, there will usually be several potential matches for a given rectangle containing text and many of them will be inexact.

How long does it take for an image to be processed by OCR?

When a user syncs a note containing an image, the image is sent to the aforementioned group of servers for OCR processing. The system is queue-based, meaning the submitted image takes its place in line and will be processed after all other images ahead of it in the queue. Images synced by Premium users, however, are moved to the front of the queue ahead of all images synced by free users.

As to how long it will take, this depends on the size of the queue when the image is sent for processing. For Premium users, image processing generally completes within a few minutes (though, it can take longer in some instances). For free users, the wait can be substantially longer if there is a large number of images in the processing queue.

How many languages does the Evernote OCR systems support?

Currently, Evernote’s OCR system can index 28 typewritten languages and 11 handwritten languages. New languages are added regularly and existing languages are optimized and improved. Users can control which language is used when indexing their data by changing the Recognition Language setting in their account’s Personal Settings.