Overview

In June 2017, at the Apple Worldwide Developers Conference (WWDC), Apple announced that ARKit would be available in iOS 11. To highlight how IBM’s Watson services can be used with Apple’s ARKit, I created a code pattern that matches a person’s face using Watson Visual Recognition and Core ML. The app then retrieves information (stored in a database) about that person and returns the information to the augmented reality (AR) application. You can find the source code on GitHub and a demonstration on YouTube.
In this blog, I give a brief overview of ARKit and explain how you can quickly start creating an app that builds an AR experience using facial recognition and Core ML

What is ARKit?

ARKit simplifies the task of creating an AR experience by combining device motion tracking, scene processing, camera scene capture, and display conveniences. AR can add 2D or 3D objects to the camera view or live view so that these objects seem as if they are interacting with the real world. A few notable AR games have already been created including Pokemon Go and Zombies, Run!.

ARKit uses world and camera coordinate systems that follow a right-handed convention, which means x-axis toward the right, y-axis upwards, and z-axis toward the viewer. To track the world coordinate, ARKit uses a technique called visual-inertial odometry, which is the combination of the information merged from the iOS device’s motion-sensing hardware with the vision analysis of the scene visible from the camera. World tracking also analyzes and understands the content of the scene. This means that it can identify horizontal and vertical planes in the camera image and track its position and size.

World tracking does not always give you exact metrics because it relies on the device’s physical environment, which is not always consistent and often difficult to measure. There is a certain degree of error when mapping the real world in the camera view for AR experiences. To build high-quality AR experiences, you must consider the following points:

Design AR experiences for predictable lighting conditions: World tracking requires clear image analysis. To increase the quality of the AR experience you must design the app for better lighting where details can be analyzed.

Use tracking quality information to provide user feedback: ARKit can provide better feedback when device motion is combined with clear images. Using this, you can tell the user how to resolve low-quality tracking situations.

Allow time for plane detection to produce clear results and disable plane detection when you have the results you need: ARKit refines its position and extent based on the plane detection over time. The first time the plane has detected the position and its extent might not be accurate, but ARKit learns over time when the plane remains in the scene over time.

ARKit terminology

SceneKit view: A component that lets you easily import, manipulate, and render 3D assets, which lets you set up many defining attributes of a scene quickly.

ARSCNView: A class in a SceneKit view that includes an ARSession object to manage the motion tracking and image processing that is required to create an AR experience.

ARSession: An object that manages the motion tracking and image processing.

ARWorldTrackingConfiguration: A class that provides high-precision motion tracking and enables features to help you place virtual content in relation to real-world surfaces.

SCNNode: A structural element of a scene graph, representing a position and transform in a 3D coordinate space to which you can attach geometry, lights, cameras, or other displayable content.

App components

Architecture

The user initiates the AR app.

The app opens the camera view to detect the face and crops it.

Watson Visual Recognition classifies the cropped face.

The app retrieves additional information about the person from a Cloudant database based on the classification from Watson Visual Recognition. The app downloads the model from the Watson Visual Recognition to the phone so that it can recognizes the face using the local model rather than calling the Visual Recognition API.

The app places the information from the database in front of the original person’s face in the mobile device view.

How to create the app

The following steps explain how to create the AR app.

Step 1: Create a project in Xcode

Launch Xcode and choose to create a new project based on a template. Select the Augmented Reality App template.

Step 2: Configure and run AR session

After the project is set up, you need to configure and run the AR Session. There is an ARSCNView already set up that includes an ARSession object. The ARSession object manages motion tracking and image processing. To run this session, you need to add an ARWorldTrackingConfiguration object for minimal configuration. The following code sets up the session and configures it with horizontal plane detection.

Important: If your app requires ARKit for its core functionality, use the ARKit key in the UIRequiredDeviceCapabilities section of your app’s Info.plist file to make your app available only on devices that support ARKit. If AR is a secondary feature of your app, use the isSupported property to determine whether to offer AR-based features.

Step 3: Add 3D content to the detected plane

After the ARSession is set up you can use SceneKit to place virtual content in the view. The project has a sample file called ship.scn that you can place in the view in the assets directory. The following code adds a 3D ship into the SCNView, which is reflected in the real world through the app.

// Create a new scene
let scene =SCNScene(named:"art.scnassets/ship.scn")!
// Set the scene to the view
sceneView.scene= scene

For example, the following image shows a plane superimposed over the real-world view.

Step 4: Use iOS Vision for face detection

After you have tested that the 3D ship appears in the camera view you can proceed to the next step, detecting faces. The iOS Vision module can detect a face in an image, which you then crop to focus on the face before sending this image to the Watson Visual Recognition service to be classified.

Step 5: Use Watson Visual Recognition and Core ML to classify the face

You can now use Watson Visual Recognition to classify the face and send back information about the user (which has been stored in a database) as a JSON response. An IBM Cloud account is required to use the Watson Visual Recognition service. Simply create the service via the catalog. After you’ve created the service, an APIKey is created for you that you can use in an application. IBM publishes SDKs for Watson to support various programming languages, and in this example, we’ll use the Watson Swift SDK.

The Github code provided initially creates 3 default classifiers when the app is initializing. When the classifiers are in a ready status the classified models are downloaded locally. The app can now recognize the face by classifying against the local model. Local classification is done using the Watson Visual Recognition API, which uses Apple’s Core ML deep learning and AI as the underlying technology. The initial Watson Visual Recognition classification is done using the following code:

The code not only creates the classification but also saves the default user’s details to Cloudant NoSQL database using the API. The following code classifies the face using the Watson Visual Recognition API, which uses Core ML as the underlying technology.

Step 6: Update the node to place 3D content and text

After the Watson Visual Recognition service classifies the face, the response is a JSON blob that indicates a potential match with the training data and how confident Watson is about the match. This result is passed to the Cloudant NoSQL database (which was populated in a separate stage) to retrieve information about the user. The data retrieved from the database has the following fields:

You can now update the SCNNode with these details. SCNNode is a structural element of a scene graph, representing a position and transform in a 3D coordinate space, to which you can attach geometry, lights, cameras, or other displayable content. For each node, you need to define its font, alignment, and material. Material includes properties for 3D content such as diffuse content color, specular contents color, and double-sided. In this example, to display the full name from the previous JSON information, you could do the following:

Conclusion

With the release of ARKit on iOS 11, there are endless opportunities to build solutions that map data to the real world. Personally, I think augmented reality is an emerging technology in the market, and developers from various industries are experimenting with it to create unique applications for gaming, construction, aviation, and so on. Augmented reality will become more mature over time, and I believe it will be commonplace in the future.