Recognize Text in Images with ML Kit on Android

You can use ML Kit to recognize text in images. ML Kit has both a
general-purpose API suitable for recognizing text in images, such as the
text of a street sign, and an API optimized for recognizing the text of
documents. The general-purpose API has both on-device and cloud-based models.
Document text recognition is available only as a cloud-based model. See the
overview for a comparison of the
cloud and on-device models.

If you do not enable install-time model downloads, the model will be
downloaded the first time you run the on-device detector. Requests you make
before the download has completed will produce no results.

If you want to use the Cloud-based model, and you have not already enabled
the Cloud-based APIs for your project, do so now:

Input image guidelines

For ML Kit to accurately recognize text, input images must contain
text that is represented by sufficient pixel data. Ideally, for Latin
text, each character should be at least 16x16 pixels. For Chinese,
Japanese, and Korean text (only supported by the cloud-based APIs), each
character should be 24x24 pixels. For all languages, there is generally no
accuracy benefit for characters to be larger than 24x24 pixels.

So, for example, a 640x480 image might work well to scan a business card
that occupies the full width of the image. To scan a document printed on
letter-sized paper, a 720x1280 pixel image might be required.

If you are recognizing text in a real-time application, you might also
want to consider the overall dimensions of the input images. Smaller
images can be processed faster, so to reduce latency, capture images at
lower resolutions (keeping in mind the above accuracy requirements) and
ensure that the text occupies as much of the image as possible. Also see
Tips to improve real-time performance.

Recognize text in images

To recognize text in an image using either an on-device or cloud-based model,
run the text recognizer as described below.

1. Run the text recognizer

To recognize text in an image, create a FirebaseVisionImage object
from either a Bitmap, media.Image, ByteBuffer, byte array, or a file on
the device. Then, pass the FirebaseVisionImage object to the
FirebaseVisionTextRecognizer's processImage method.

JavaAndroid

KotlinAndroid

The image represented by the Bitmap object must
be upright, with no additional rotation required.

To create a FirebaseVisionImage object from a
media.Image object, such as when capturing an
image from a device's camera, first determine the angle the
image must be rotated to compensate for both the device's
rotation and the orientation of camera sensor in the device:

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, a
FirebaseVisionText object will be passed to the success
listener. A FirebaseVisionText object contains the full text recognized in
the image and zero or more TextBlock objects.

Each TextBlock represents a rectangular block of text, which contains zero or
more Line objects. Each Line object contains zero or more
Element objects, which represent words and word-like
entities (dates, numbers, and so on).

For each TextBlock, Line, and Element object, you can get the text
recognized in the region and the bounding coordinates of the region.

Note: Recognized languages are provided only when using the cloud model. To
identify languages with the on-device model, use ML Kit's
language identification API.

Tips to improve real-time performance

If you want use the on-device model to recognize text in a real-time
application, follow these guidelines to achieve the best framerates:

Throttle calls to the text recognizer. If a new video frame becomes
available while the text recognizer is running, drop the frame.

If you are using the output of the text recognizer to overlay graphics on
the input image, first get the result from ML Kit, then render the image
and overlay in a single step. By doing so, you render to the display surface
only once for each input frame. See the CameraSourcePreview and GraphicOverlay classes in the quickstart sample app for an
example.

If you use the Camera2 API, capture images in
ImageFormat.YUV_420_888 format.

If you use the older Camera API, capture images in
ImageFormat.NV21 format.

Consider capturing images at a lower resolution. However, also keep in mind
this API's image dimension requirements.

Recognize text in images of documents

To recognize the text of a document, configure and run the cloud-based
document text recognizer as described below.

The document text recognition API, described below, provides an interface that
is intended to be more convenient for working with images of documents. However,
if you prefer the interface provided by the FirebaseVisionTextRecognizer API,
you can use it instead to scan documents by configuring the cloud text
recognizer to use the dense text model.

To use the document text recognition API:

1. Run the text recognizer

To recognize text in an image, create a FirebaseVisionImage object from either
a Bitmap, media.Image, ByteBuffer, byte array, or a file on the device.
Then, pass the FirebaseVisionImage object to the
FirebaseVisionDocumentTextRecognizer's processImage method.

JavaAndroid

KotlinAndroid

The image represented by the Bitmap object must
be upright, with no additional rotation required.

To create a FirebaseVisionImage object from a
media.Image object, such as when capturing an
image from a device's camera, first determine the angle the
image must be rotated to compensate for both the device's
rotation and the orientation of camera sensor in the device:

KotlinAndroid

2. Extract text from blocks of recognized text

If the text recognition operation succeeds, it will return a
FirebaseVisionDocumentText object. A
FirebaseVisionDocumentText object contains the full text recognized in the
image and a hierarchy of objects that reflect the structure of the recognized
document: