This is the third part of a four-part series on implementing Augmented Reality in your games and apps. Check out the first part and the second part of the series here!

Welcome to the third part of this tutorial series! In the first part of this tutorial, you used the AVFoundation classes to create a live video feed for your game to show the video from the rear-facing camera.

In the second part, you learned how to implement the game controls and leverage Core Animation to create some great-looking explosion effects.

Your next task is to implement the target-tracking that brings the Augmented Reality into your app.

If you saved your project from the last part of this tutorial, then you can pick up right where you left off. If you don’t have your previous project, or prefer to start anew, you can download the starter project for this part of the tutorial.

Augmented Reality and Targets

Before you start coding, it’s worth discussing targets for a moment.

From retail shelves to train tickets to advertisements in bus shelters, the humble black-and-white QR code has become an incredibly common sight around the world. QR codes are a good example of what’s technically known as a marker.

Markers are real-world objects placed in the field-of-view of the camera system. Once the computer vision software detects the presence of one or more markers in the video stream, the marker can be used as a point of reference from which to initiate and render the rest of the augmented reality experience.

Marker detection comes in two basic flavors:

Marker-Based Object Tracking — The marker must be a black-and-white image composed of geometrically simple shapes such as squares or rectangles, like the QR code above.

Markerless Object Tracking — The marker can be pretty much anything you like, including photographs, magazine covers or even human faces or fingertips. You can use almost any color you wish, although color gradients can be difficult for a CV system to classify.

Admittedly the term markerless object tracking is confusing, since you are still tracking an image “marker”, albeit one that is more complicated and colorful than a simple collection of black-and-white squares. To confuse matters even further, you’ll find other authors who lump all of the above image-detection techniques into a single bucket they call “marker-based” object tracking, and who instead reserve the term markerless object tracking for systems where GPS or geolocation services are used to locate and interact with AR resources.

While the distinction between marker-based object tracking and markerless object tracking may seem arbitrary, what it really comes down to is CPU cycles.

Marker-based object tracking systems can utilize very fast edge-detection algorithms running in grayscale mode, so high-probability candidate regions in the video frame — where the marker is most likely to be located — can be quickly identified and processed.

Markerless object tracking, on the other hand, requires far more computational power.

Pattern detection in a markerless object tracking system usually involves three steps:

Feature Detection — The sample image is scanned to identify a collection of keypoints, also called features or points of interest, that uniquely characterize the sample image.

Feature Descriptor Extraction — Once the system has identified a collection of keypoints, it uses a second algorithm to extract a vector of descriptor objects from each keypoint in the collection.

Feature Descriptor Matching — The feature descriptor sets of both the input query image and the reference marker pattern are then compared. The greater the number of matching descriptors that the two sets have in common, the more likely it is that the image regions “match” and that you have “found” the marker you are looking for.

All three stages must be performed on each frame in the video stream, in addition to any other image processing steps needed to adjust for such things as scale- and rotation-invariance of the marker, pose estimation (i.e., the angle between the camera lens and the 2D-plane of the marker), ambient lighting conditions, whether or not the marker is partially occluded, and a host of other factors.

Consequently, marker-based object tracking has generally been the preferred technique to use on small, hand-held mobile devices, especially early-generation mobile phones). Markerless object tracking, on the other hand, has generally been relegated for use on the larger, iPad-style tablets with their correspondingly greater computational capabilities.

Designing the Pattern Detector

In this tutorial you’ll take the middle ground between these two standard forms of marker detection.

Your target pattern is more complicated than a simple set of black-and-white QR codes, but it’s not much more complicated. You should be able to cut some corners while still retaining most of the benefits of markerless object tracking.

Take another look at the target pattern you’re going to use as a marker:

Clearly you don’t have to worry about rotational invariance as the pattern is already rotationally symmetrical. You won’t have to deal with pose estimation in this tutorial as you’ll keep things simple and assume that the target will be displayed on a flat surface with your camera held nearly parallel to the target.

In other words, you won’t need to handle the case where someone prints out a hard copy of the target marker, lays it down on the floor somewhere and tries to shoot it from across the room at weird angles.

The fastest OpenCV API that meets all these requirements is cv::matchTemplate(). It takes the following four arguments:

Query Image — This is the input image which is searched for the target. In your case, this is a video frame captured from the camera.

Template Image — This is the template pattern you are searching for. In your case, this is the bull’s-eye target pattern illustrated above.

Output Array — An output array of floats that range from 0.0 to 1.0. This is the “answer” you’re looking for. Candidate match regions are indicated by areas where these values reach local minima or maxima. Whether the best possible match is indicated by a minimum or maximum is determined by the statistical matching heuristic used to compare the images as explained below.

Matching Method — One of six possible parameters specifying the statistical heuristic to use when comparing the query and template images. In your case, better matches will correspond to higher numerical values in the output array. However, OpenCV supports matching heuristics where better matches are indicated by lower numerical values in the output array as well.

The caller must ensure that the dimensions of the template image fit within those of the query image and that the dimensions of the output array are sized correctly relative to the dimensions of both the query image and the template pattern.

The matching algorithm used by cv::matchTemplate() is based on a Fast Fourier Transform (FFT) of the two images and is highly optimized for speed.

cv::matchTemplate() does what is says on the tin:

It “slides” the template pattern over the top of the query image, one pixel at a time.

At each pixel increment, it compares the template pattern with the “windowed sub-image” of the underlying query image to see how well the two images match.

The quality of the match at that point is normalized on a scale of 0.0 to 1.0 and saved in the output array.

Once the algorithm terminates, an API like cv::minMaxLoc() can be used to identify both the point at which the best match occurs and the quality of the match at that point. You can also set a “confidence level” below which you will ignore candidate matches as simple noise.

A moment’s reflection should convince you that if the dimensions of the query image are (W,H), and the dimensions of the template pattern are (w,h), with 0 < w < W and 0 < h < H, then the dimensions of the output array must be (W-w+1, H-h+1).

The following picture may be worth a thousand words in this regard:

There's one tradeoff you'll make with this API — scale-invariance. If you're searching an input frame for a 200 x 200 pixel target marker, then you're going to have to hold the camera at just the right distance away from the marker so that it fills approximately 200 x 200 pixels on the screen.

The sizes of the two images don't have to match exactly, but the detector won't track the target if your device is too far away from, or too close to the marker pattern.

Converting Image Formats

It's time to start integrating the OpenCV APIs into your AR game.

OpenCV uses its own high-performance, platform-independent container for managing image data. Therefore you must implement your own helper methods for converting the image data back and forth between the formats used by OpenCV and UIKit.

This type of data conversion is often best accomplished using categories. The starter project you downloaded contains a UIImage+OpenCV category for performing these conversions; it's located in the Detector group, but it's not yet been implemented. That's your job! :]

Open UIImage+OpenCV.h and add the following three method declarations:

The first two declarations are for static class methods that convert an OpenCV image container into a UIImage, and vice-versa.

The final declaration is for an instance method that converts a UIImage object directly into an OpenCV image container.

You'll be providing the code for these methods in the next few paragraphs, so be prepared for a few warnings. These warnings will go away once you finish adding all the methods.

Note: If you find the syntax cv::Mat to be an odd way of designating an image reference, you're not alone. cv::Mat is actually a reference to a 2-D algebraic matrix, which is how OpenCV2 stores image data internally for reasons of performance and convenience.

The older, legacy version of OpenCV used two very similar, almost interchangeable data structures for the same purpose: cvMat and IplImage. cvMat is also simply a 2-D matrix, while IplImage stands for Intel Processing Library and hints at OpenCV's roots with the chip manufacturing giant.

This static method converts an instance of UIImage into an OpenCV image container. It works as follows:

You retrieve the width and height attributes of the UIImage.

You then construct a new OpenCV image container of the specified width and height. The CV_8UC4 flag indicates that the image consists of 4 color channels — red, green, blue and alpha — and that each channel consists of 8 bits per component.

Next you create a Core Graphics context and draw the image data from the UIImage object into that context.

Finally, return the OpenCV image container reference to the caller.

The corresponding instance method is even simpler.

Add the following code to UIImage+OpenCV.mm:

- (cv::Mat)toCVMat
{
return [UIImage toCVMat:self];
}

This is a convenience method which can be invoked directly on UIImage objects, converting them to cv::Mat format using the static method you just defined above.

This static method converts an OpenCV image container into an instance of UIImage as follows:

It first creates a new color space. If the image has only one color channel, then create a new grayscale color space. But if the image has multiple color channels, then create a new RGB color space instead.

Next, the method creates a new Core Foundation data reference that points to the image container's data. elemSize() returns the size of an image pixel in bytes, while total() returns the total number of pixels in the image. The total size of the byte array to be allocated comes from multiplying these two numbers.

It then constructs a new CGImage reference that points to the OpenCV image container.

Next it constructs a new UIImage object from the CGImage reference.

Then it releases the locally defined Core Foundation objects before exiting the method.

Finally, it returns the newly-constructed UIImage instance to the caller.

Build and run your project; nothing visible has changed with your game but occasionally incremental builds are a good practice, if only to validate that newly added code hasn't broken anything.

The constructor takes a reference to the marker pattern to look for. You'll pass a reference to the bull's-eye target marker pattern as an argument into this constructor.

The object provides an API for scanning input video frames and searching those frames for instances of the marker pattern used to initialize the object in the constructor. You'll pass the video frames as they are captured as arguments into this API. Depending on the power of your hardware, you can expect to invoke this API at least 20 to 30 times per second.

The object provides a collection of APIs for reporting match scores, and is able to provide the exact point in the video frame where a candidate match has been identified; the confidence, or match score, with which that candidate match has been made; and the threshold confidence, below which candidate matches will be discarded as spurious noise.

The object provides a boolean API indicating whether or not it is presently tracking the marker. If the current confidence level, or match score, exceeds the threshold level, this API returns TRUE. Otherwise, it returns FALSE.

m_patternImage is a reference to the original marker pattern. In your code, this will be a reference to the bull's-eye target marker pattern. m_patternImageGray is simply a reference to a grayscale version of m_patternImage. Most image processing algorithms run an order of magnitude faster on grayscale images than on color images. In your code, this will be a reference to a black-and-white version of the bull's-eye target marker pattern.

m_patternImageGrayScaled is a smaller version of m_patternImageGray. This is the actual image reference used for pattern detection where its size has been optimized for speed. In your code, this will be a reference to a small version of the black-and-white version of the bull's-eye target marker pattern.

These elements are simply supporting data members, whose purpose will become clear as you work your way through the rest of this tutorial.

Add the following code to the top of PatternDetector.cpp, just beneath the include directives:

kDefaultScaleFactor is the amount by which m_patternImageGrayScaled will be scaled down from m_patternImageGray. In your code, you'll cutting the image dimensions down by a factor of two, thus improving performance by a factor of about four, since the total area of the image will be about a quarter of the size of the original.

Normalized match scores range from 0.0 to 1.0. kDefaultThresholdValue specifies the score below which candidate matches will be discarded as spurious. In your code, you'll discard candidate matches unless the reported confidence of the match is higher than 0.5.

Now add the following definition for the constructor to PatternDetector.cpp:

You then convert the original marker pattern to grayscale using the OpenCV function cv::cvtColor() to reduce the number of color channels if necessary.

You reduce the dimensions of the grayscale marker pattern by a factor of m_scaleFactor — in your code, this is set to 2.

CV_TM_CCOEFF_NORMED is one of six possible matching heuristics used by OpenCV to compare images. With this heuristic, increasingly better matches are indicated by increasingly largely numerical values (i.e., closer to 1.0).

Construct a new cv:::Mat image container from the video frame data. Then convert the image container to grayscale mode to accelerate the speed at which matches are performed.

Reduce the dimensions of the grayscale image container by a factor of m_scaleFactor to further accelerate things.

Invoke cv::matchTemplate() at this point. The calculation used here to determine the dimensions of the output array was discussed earlier. The output array will be populated with floats ranging from 0.0 to 1.0 with higher numbers indicating greater confidence in the candidate match for that point.

Use cv::minMaxLoc() to identify the largest value in the frame, as well as the exact value at that point. For most of the matching heuristics used by OpenCV — including the one you're using — larger numbers correspond to better matches. However, for the matching heuristics CV_TM_SQDIFF and CV_TM_SQDIFF_NORMED, better matches are indicated by lower numerical values; you handle these as special cases in a switch block.

Note: OpenCV documentation frequently speaks of "brightness values" in connection with the values saved in the output array. Larger values are considered "brighter" than the others. In OpenCV, images and matrices share the same data type: cv::Mat.

Since the type of resultImage is cv::Mat, the output array can be rendered on-screen as a black-and-white image where brighter pixels indicate better match points between the two images. This can be extremely useful when debugging.

This method is clearly not "game-ready" in its current state; all you're doing here is quickly checking whether or not the detector is tracking the marker.

If you point the camera at a bull's-eye target marker, the match score will shoot up to almost 1.0 and the detector will indicate that it is successfully tracking the marker. Conversely, if you point the camera away from the bull's-eye target marker, the match score will drop to near 0.0 and the detector will indicate that it is not presently tracking the marker.

However, if you were to build and run your app at this point you'd be disappointed to learn that you can't seem to track anything; the detector consistently returns a matchValue() of 0.0, no matter where you point the camera. What gives?

That's an easy one to solve — you're not processing any video frames yet!

Return to ViewController.mm and add the following line to the very end of frameReady:, just after the dispatch_async() GCD call:

m_detector->scanFrame(frame);

The full definition for frameReady: should now look like the following:

Previously, frameReady: simply to drew video frames on the screen, thereby creating a "real time" video feed as the visual backdrop for the game. Now, each video frame is being passed off to the pattern detector where the OpenCV APIs scan the frame looking for instances of the target marker.

All right, it's showtime! Build and run your app; open the console in Xcode, and you'll see a long list of "NO" messages indicating the detector can't match the target.

Now, point your camera at the tracker image below:

When the camera is aimed directly at the bull's-eye target and the pattern detector is successfully tracking the marker, you'll see a “YES” message being logged to the console along with the corresponding threshold confidence level.

The threshold confidence level is set at 0.5, so you may need to fiddle with the position of your device until the match scores surpass that value.

Note: The pattern detection method you're using does not support scale invariance. This means the target image has to fill up just the “right” amount of space on your screen in order to track. If you’re pointing the camera at the target, but not able to get it to track, try adjusting the distance between the camera lens and marker until you see a “YES” message in the console.

If you're using an iPhone you should expect a match if you hold the iPhone at a distance where the height of the bull's-eye image covers a little less than one third of the height of the iPhone screen in landscape orientation

Your console log should look like the following once the device starts tracking the marker:

Paul is a mobile developer based in Los Angeles. He has over 15 years of experience doing professional software development, and has been publishing apps on the iTunes stores since 2009. When he's not busy with work, he enjoys Stanley Kubrick movies, urban hiking and spending time with his wife and son. You can find and follow him on Twitter or Github.